Canadian Heritage - Patrimoine canadien Canada
 
Français Contact Us Help Search Canada Site
Home Site Map
Canadian
Heritage
 News
 Job Postings
 Conferences
 and Training

 Directories
 Funding
 Order Publications
 Add Information

Creating and Managing Digital Content Creating and Managing Digital Content

Links - Digitization

Add a Digitization link

   

Preliminary Digitization Guidelines
From the document: "To assist museums, galleries and other heritage institutions who are interested in using the World Wide Web as a means to achieve their educational and communication objectives, CHIN has outlined some issues to consider in planning and preparing for this new activity and some preliminary guidelines relating to digitization of text and images."
Ce document est aussi disponible en français : Lignes directrices préliminaires concernant la numérisation

Editor/AuthorCanadian Heritage Information Network (CHIN)
Organization/PublisherCanadian Heritage Information Network (CHIN)
Date of Publication1996
Submitted byCHIN/RCIP
TopicDigitization
Download document.


      

Preliminary Digitization Guidelines

To assist museums, galleries and other heritage institutions who are interested in using the World Wide Web as a means to achieve their educational and communication objectives, CHIN has outlined some issues to consider in planning and preparing for this new activity and some preliminary guidelines relating to digitization of text and images. The suggestions contained in these guidelines can be met by the use of most of the available standard equipment. CHIN is also developing more comprehensive and detailed guidelines. These guidelines are intended primarily for institutions using their own scanning equipment, although they will also be useful in discussions with third-party suppliers of scanning services.

Image Digitization - Preliminary Guidelines

The process of producing digital images consists of a number of steps. The source material must first be identified and prepared for capture. During image capture, the images are digitized by processing hardware and software and recorded to disc, where they are manipulated and modified in various ways. The digitized images are then stored along with related textual information. Following this, the images are linked to their text for retrieval, and reformatted for presentation over the World Wide Web.

Before any images are digitized, be sure that you have the right to do so! Virtually all photography is copyright, as are many of the objects, and the consent of the copyright owner is required before you digitize images for display. It is the responsibility of the collection holder to ensure that copyright is owned or clearance is obtained to digitize and display the material over the World Wide Web.

Preparation

Images may be captured from a number of sources, but the most common of these are likely to be prints, negatives, or transparencies. All of these come in various sizes. As an initial project in digitization, it is suggested that institutions use already existing photographic sources, as the time involved and cost of original photography are extensive. It is recommended that 35mm film or photographic prints be used as the source of images. The hardware and software to handle these is more readily available than that needed for other formats.

Image capture from the object itself should be minimized, as the handling and exposure to lights tend to be bad for the object, and there is the further danger of accidental damage. If it is necessary to go to the object itself, it is best to take a photograph, and work from the photographic medium. It is possible to return to a photographic source again and again with only minimal damage.

The better the photographic source, the better the digital image that can be obtained. All images should be in good condition, and free of foreign objects and finger prints. Make sure all images are in their correct orientation and not reversed. Such errors may be very difficult to detect after digitization.

Plan the number and types of images to be scanned. This will allow an estimate of the needed storage space, and the equipment and human resources required. It will also enable the grouping of similar sorts of material so that images can be run in batches. This is more efficient as it reduces the time required for set-up and calibration of equipment.

Image Capture

Prints are commonly captured by placing them on a copy stand. In place of the traditional photographic camera, however, is a digital camera. This is raised or lowered to increase or decrease the field of view and to minimize the amount of cropping needed later. Another commonly used capture method is by use of a flatbed scanner, which resembles a photocopier. The prints are placed face down on a glass plate, and either the whole image or a selected portion is scanned.

The most commonly used form of film at this time is 35mm, whether positive (transparencies) or negative. Simply because the size is so standard, a number of 35mm slide scanners are available on the market. These may take framed slides or may accept unmounted film strips, depending on the film mounting options available.

Scanners generally come with their own software drivers, but most of the major image manipulation software also includes drivers for several scanners. All cameras and scanners are controlled from microcomputers.

As part of the scanning procedure, record the technical details of the process and store the information for future reference. This makes it possible to reproduce the process if a file becomes corrupted, to modify the process if the scanned image is not satisfactory, or to share the details with others. A recommended set of standards for documenting the image files will be provided in the comprehensive guidelines sent to successful applicants.

Images may be digitized at various levels of resolution, and at different colour depths. As a general rule, all images should be captured at the highest possible quality that your technology and budget allow. The greater the resolution and depth of colour, the larger the scanned image file, and the greater the time spent to scan each image. It is always possible to reduce the quality of an electronic image, but it is not possible to improve the quality. This is an initial investment that will prove its value in the long run as the digitized images can be reused for a variety of purposes.

It is recommended that original scanned images be in 24-bit colour (16 million colours), with a minimum resolution of 2000 x 2000 pixels. This will produce a file of up to 12 megabytes per image, depending on the image shape. These images should be copied, and the original scans archived on some type of back-up medium such as CD-ROM, digital tape, or a large (gigabyte) hard drive. This allows the scan to be used for other purposes at a later date. All further manipulations should be done on the copies.

Textual documentation of the image and the object which it represents must also be recorded, and the image and its associated text must always be related in some way. A simple method for doing this is to have the electronic file name of the image tie it clearly to the textual documentation as a photo number or code of some form.

Quality control is vital to the success of an imaging project. Each captured image should be verified individually. This will often require comparing the scanned image with the original photograph, or even the object itself. Somehow there is always one image that is reversed, or upside down, and these errors can be very difficult to find!

Storage and compression

Image files tend to be very large, taking up a great deal of disc space. These large files need to be reduced in size for storage, transmission, and display. For archival purposes, the original scanned images should be compressed using "lossless compression". This will generally reduce an image file size by about half, and the file can be reconstructed without loss of information. For the World Wide Web, file sizes must be more severely reduced, as many users cannot easily download huge image files. This further reduction in file size is achieved through "lossy compression"; depending on the degree of compression, some information is lost, and the image cannot be perfectly reconstructed. The greater the compression, the greater the information loss.

On the World Wide Web compression ratios are likely to range from 10:1 to 40:1, depending on the starting file size and the quality of the image display desired. JPEG is a popular compression routine, and it will compress and format most file types, including Kodak Photo CD and Macintosh PICT files. The medium range of compression will suffice for most Web uses, providing a reasonable compromise between file size and picture quality.

Depending on the technology used, images can be stored in a variety of file formats. Some of the more commonly encountered are BMP (Microsoft Windows Bitmap), JPEG, PCD (Kodak's Photo CD format), PICT (for the Macintosh), and TIFF. Any of these file formats is easily converted to a World-Wide Web format.

The GIF format, commonly encountered on the World-Wide Web, will only render 8-bit colour (256 colours), and so is not recommended as a file storage format.

Retrieval

The only efficient and widely supported way of retrieving images from the Web is by means of text associated with them. That means that a certain amount of textual information must be entered along with the image. For World-Wide Web distribution there are a number of retrieval choices.

a.- Make a hypertext link from a text document to the image file.

b.- Create a small, thumbnail image "inline" as part of the text document, which can then be selected to retrieve a full size image.

c.- Give the large image an informative title.

d.- Present the image as part of an exhibition which has a general, informative, introduction.

The actual presentation choice is not important in this context; what is important is that a link is maintained between the image and the information about the image.

Presentation

Using image manipulation software, the working copy of the image should be cropped to remove the non-image areas about the edges. The colour, contrast, brightness, and sharpness may be adjusted if required. This process is called "image optimization", or "image enhancement", and most higher end or full-featured image processing software packages can be used for this purpose. The image is then saved for output display.

Regardless of how images are stored, an additional decision must be made with respect to the appropriate format for display in the World Wide Web environment. Most images on the World Wide Web are in GIF or JPEG format. GIF file format is probably the most common as it is the original format used on the Web, but it can only render 8-bit colour (256 colours). The JPEG file format is gaining popularity, and has the advantages of higher compression, resulting in smaller image files, and it can render images in either 24-bit colour (16 million colours) or 8-bit colour (256 colours). The disadvantage is that the JPEG files may not be handled by older browsers, and a compressed 24-bit JPEG file is still larger than the comparable 8-bit GIF file.

For the World Wide Web, it is suggested that images be presented in compressed GIF or JPEG format. They should be in 8-bit colour (256 colours) and have a maximum resolution of 400 x 400 pixels. This will ensure that the entire image can be fully displayed on a medium resolution display monitor. This produces a presentation file of up to 160 kilobytes, smaller when compressed. Thumbnail images should also be in 8-bit colour, but need only have a resolution of 128 x 128 pixels or less, producing a compressed file of under 10 kilobytes in size. Remember that you do not have complete control over display picture quality; you can set a "best possible" level, but beyond that, the characteristics and settings of the user's monitor will be the determining factor.

Digitization of Text-based Documents

All information for presentation on the World Wide Web, including text and graphics, must be in a digital electronic format. Textual and graphical information which is not already in a such a format must be converted for use on the World Wide Web. This can be done manually by keying in the information using a word processor or text editor. It may also be done using electronic document conversion, either by copying the information as document images, or by means of a scanner equipped with Optical Character Recognition (OCR) software.

Keying in the information using a word processor is the most straightforward way of digitizing, and the procedure is commonplace. For small to moderate quantities of information, or when original text is newly created, this is the most efficient and effective method of digitization. Keying, however, can be an unnecessarily lengthy procedure when there are large, already existing, quantities of reasonably clean, clear source material available for entry. Nor is keying a method suited to capture of information other than the printed word; as, for example, company letterhead, logos or graphics, etc.

Electronic document conversion, whether it be document imaging or Optical Character Recognition, is well suited for digitizing large quantities of text. It is most effective when working from originals which are in good condition. Torn or mutilated paper confuses the scanner software. Poorly defined lettering is difficult to reproduce so the type should be clear, legible and contrast with the background for good scanning results. These problems can be alleviated somewhat by batching documents of similar quality together in a run and adjusting the scanner for each batch of documents. However, poor quality source material still tends to result in lower quality, less accurate, electronic documents.

The two methods of electronic document conversion differ fundamentally, and the choice of one over the other will depend on the character of the documents to be converted and their anticipated use.

A document imaging process scans the document as an image and produces a digital graphics image file much like a Fax machine. The advantage of this is that the scanning machine does not make any mistakes, because it does not do any interpretation of text. It reproduces what it "sees", and leaves the interpretation to the people viewing it. A disadvantage is that the information within the image, be it text, graphics, logo, or picture, cannot be manipulated or changed. A second disadvantage is that the information cannot be found and retrieved (without significant additional work such as associating keywords with the document image using other software). If extensive manipulation of the information is required or searching the information is anticipated, it would be more advantageous to use OCR. If, however, the original documents contain logos or other graphics that need to be digitized for the later use, document imaging is the preferred method. The resulting document image files are also considerably larger than equivalent OCR files, and hence take more time to transmit and download on the World Wide Web.

The second electronic document conversion process scans the document in conjunction with Optical Character Recognition software. This allows the scanner to read the document character by character and save the text portions as a text or word processor file or store them in a database. In the OCR process, the software will discard non textual material such as pictures, graphs, charts, and the straight lines which define forms of various sorts. It will also throw away, or be confused by, hand written annotations. Performance deteriorates with coloured or strongly textured paper. There is also a marked decline in performance when several different fonts, scripts, and character sets have been used. This method results in text that can be manipulated and reformatted for presentation on the World Wide Web.

Since the OCR software is interpreting the document, an error correction step is required. This is probably the most time consuming part of the process. Good, modern scanning software has some intelligence, and can indicate what characters it thinks it could not read. The OCR process works well, with up to 99% accuracy on good, clean source material, and is the method of choice when text manipulation is required.


      

Virtual Museum of Canada (VMC) Logo Date Published: 2002-04-27
Last Modified: 2003-12-08
Top of Page © CHIN 2006. All Rights Reserved
Important Notices