Home Page Hard Disk DrivesMagneto-Optical Drives and LibrariesScannersPrintersHalf Inch Tape DrivesMiscellaneous Products

 

Technical

 

An Introduction to Document Image Processing (DIP)
Printed documents are the very basis of business today. Almost every facet of business life depends upon the creation of a printed document, often in multiple drafts and often, even in a small office, in multiple copies.

To be useful, documents must be transported, stored and retrieved when required by anyone who needs them. As the quantity of paper documents in a business continues to grow, the ease of access to a particular piece of paper diminishes.

The modern office needs more efficient systems to handle the mounting volume of paper produced today. The combination of scanners with optical disk technology opens up many opportunities for new uses of DIP.

Document Image Processing

Document Image Processing (DIP) is a means of electronically managing information which has been distributed on paper.

DIP has four steps :

Scan Converting the printed document to a digitised image.
Index Labeling the digitised document.
Store Allocating space on electronically readable media for the scanned material.
Retrieve The recovery of the digitised material for viewing or printing.

Document Image Processing (DIP), Electronic Filing Systems (EFS) and Document Management Systems (DMS) provide similar functions and are terms used by different areas of the IT industry.

Note: Although DIP is used throughout this document, any of the above terms could have been used.

Why DIP?


The problem of managing growing quantities of paper-based information faces the manager of every office in every business. Searching, retrieving and storing paper documents costs money. It has been estimated that the cost of storing one sheet of paper for one year is 16 pence. Multiply that by the number of sheets of paper flooding through every office and the costs form a significant part of business expenditure.

A Document Image Processing system offers a solution to two problems; maintaining efficient document filing and retrieval, and escalating storage costs.

The document imaging system is an integration of hardware and software providing all the elements needed to achieve these efficiencies and cost reductions, and to integrate the digitised data with other electronically generated data.

Scan


The most usual input method is to scan the document with an image scanner. Input direct from fax card, modem, or other imaging device is also feasible. However, it is important that an image scanner is included in every system for the transfer of normal paper based documents.

The scanner may need to cope with a range of documents (e.g. A5, A4 or A3 paper), and be capable of scanning large volumes of documents during the normal day to day running of the system. In a DIP system any scanner which is considered should be a high throughput device. As such the scanner will need to be fitted with an ADF (Automatic Document Feeder) to provide fast and reliable input; the flat bed of the scanner provides the facility to scan fragile and special documents.

The colour content of the paper based information is usually decorative and is unlikely to be required for its information content. It is not usual for a DIP system to use a colour scanner. Most colour scanners are designed for more general desk top publishing low volume use, and are not suitable for the rigors of a DIP system.

Scanned data has to be transferred from the scanner to the workstation or host and possibly via a network to a file server. For a scanner which can scan 50 pages a minute through an automatic document feeder, up to 1000 megabytes of electronic data may need to be transferred. This will be doubled when scanning A3 pages.

Scanned images can occupy a large amount of disk space as well as cause data transfer bottle necks. In order to reduce this large overhead the system uses data compression. An A4 page with text and graphics scanned at 300 DPI using grey scaling with produce an electronic file of around 8 megabytes. The same A4 document at 300 dpi, compressed, will occupy approximately 500 kilobytes of memory.

In a DIP system this file size can be further reduced by scanning the page at 200 dpi and scanning the document using line-art mode instead of grey scale mode. So the A4 page now scanned at 200 dpi as line-art with group IV compression will occupy only 46 kilobytes of memory

Until recently, the only realistic way of compressing scanned documents was by use of a specialist card installed in the PC or workstation. In these installations, the scanner outputs data via a video interface connected directly to the compression card in the host. The entire scanned image is then assembled and compressed by the specialist card before transferring the image data for DIP processing. While this gives the required compression, it limits the flexibility of the scanner to interface to different hardware platforms, and requires the transfer of large files between the scanner video port and the host.

An alternative to using a scanner with video output and a PC based compression card is to utilise the range of SCSI scanners which are now available. The SCSI-II interface has been long accepted as the standard for interfacing data products (disk drives allowing high speed transfer rates of up to 10 megabytes per second. Scanners fitted with a SCSI-II interface offer the benefits of lower cost and ease of interfacing. The interface itself being able to cope with the large amounts of data to be transferred.

However, the need for compression remains, for both storage and network transmission reasons. Data compression may be hardware or software based; using the scanner manufacturer's compression board (which also provides faster data transfer to the host), or by specialist interface boards and system software. Compression in these cases is usually to group IV standard.

Other compression techniques are being researched and will be introduced by system integrators using software and hardware modules.

Indexing


When the document has been scanned the digitised image needs to be categorised before it is stored. This stage of the process is equivalent to a user deciding exactly where in a filing cabinet a document should be stored.

Indexing is the most important feature in any DIP system. The ability to retrieve scanned data quickly and accurately relies upon each document being appropriately labeled. This is a function of the DIP software, whether the software is an off-the-shelf package or a customised system for a particular customer. The indexing can be either command driven or menu driven. The latter provides an easier solution and usually allows even non-specialist users to achieve fast and accurate indexing and document retrieval. 'Shrink wrapped' software usually uses menu driven indexing.

The search and retrieval of a document relies on a software maintained database. The indexing fields in the database records need to contain relevant and local information. For example index fields could be Date, Addressee, Author, Department, Subject, Head line, Topic, Keywords and possibly Extract.

The more fields identified for the search of a document, the faster and more accurate the search. Other functions which may be performed at this stage are partial or complete OCR (Optical Character Recognition), or removal of unwanted pages (e.g. incorrectly received or corrupted facsimiles).

A scanned document is initially saved as an image file - a picture made up of individual pixels of black, white, or varying shades of grey. In this way the scanned document can include text, graphics, photographs, or a combination of all three. It is important that an unaltered image is always stored as the central record of the document, thus ensuring that no information is lost or corrupted from the original document. However, with DIP, additional notes or documents can be easily added or attached to the document, in much the same way as gummed notes can be stuck to a letter, adding comments or listing outstanding actions to be completed. These notes can be in the form of text or voice and will not add to or alter the original scanned image.

Store


On completion of the indexing routine the index record is transferred to a database on a fixed magnetic disk drive (hard drive). The image data is usually transferred to an optical or external large capacity storage device.

The saving of scanned image data is the equivalent of the user storing the hard copy of a document in a filing cabinet.

With most DIP systems the hard drive is used as a temporary buffer store prior to the image data being transferred to removable media such as an optical drive. This transfer is usually transparent to the user and the path will have been configured during installation.

The EFS system and smaller DIP systems use off-line media, which need to be reloaded when required, or fixed drives of up to 10 Gbyte capacity, allocated for image storage only. The hard drive database index will identify the correct re-loadable media or storage device (defined during indexing). Larger, more complex systems use auto-changers such as optical jukeboxes and tape library systems. The storage required by the scanned image varies and depends upon such factors as scanner resolution, page size, compression ratio and page content. Typically, using group IV compression, a 1Mb storage device will hold 32 A4 text pages scanned, using line art mode, at 200 dpi. The same capacity storage will store just 14 of the same pages at 300 dpi. and only 8 pages at 400 dpi.

To put this into perspective, two (three drawer) filing cabinets containing the same typical pages above, can be stored as scanned images on a single 620 Mb disk. The floor space for the filing cabinets containing the hard copy is approximately 24 square feet (2.5 sq. meters )( including opening and viewing space).

Retrieve


In the hard copy filing cabinet situation problems arise when clerical staff change or if posts are reorganised. A new clerk may find his or her predecessor's filing system difficult to work with.

The retrieval process in a DIP solution is equivalent to the user trying to find a document in a hard copy filing system.

The DIP electronic filing system addresses the filing cabinet problems by providing templates, based on the index fields, for the search and retrieval of stored electronic data. Even someone new to the electronic filing system will be able to understand and adhere to the established indexing and retrieval methods. Filing templates facilitate a consistent index format and easy identification and retrieval of scanned material.

The retrieval of a document starts with a search of the database. The user defines a number of search parameters - the more fields used, the greater the chance of finding a particular document with a single search. Boolean (and ,or , not) operators may also be used to further refine the search pattern.

Once the required database records have been identified their associated document images can quickly be retrieved from the image storage device for display or printing. The image storage device can be either remote or local to the retrieval workstation.

The Impact on Business Efficiency


Document Image Processing or Document Management can justifiably be claimed to radically affect the way in which any company carries out its business by causing managers to review their workflow practices. This is not just the case in an enterprise-wide system; elements like filing, re-filing, preventing loss of information due to removal or theft, wear on frequently referenced documents and so on is critical even in local systems.

 

 

 


© Copyright 1993-96, Fujitsu Europe Limited