Automating CaptureThe Key to a Paperless Office

In any well thought out document management initiative you need to take a good look at the documents being managed and determine the best way to minimize the effort for their digital conversion. Scanning paper is no longer an issue since high speed scanning is readily available through the use of a multifunctional printer (MFP) and high speed/ low cost desktop scanners. The real challenge is how to “file” your documents so they can quickly and easily be found when needed. The ultimate goal of digital filing, or indexing, as it is most commonly known, is to automate this process by extracting the index data through one or more capture techniques.

The challenge for many however, is that extracting data from your documents is many times much more of an art than a science. Never is this truer than in projects where there are high volume transactional applications like the management of Proof of Delivery Tickets, Packing Slips or Vendor Invoices. Do it right and your application will sing to the users with its ease of use. Choose the wrong capture approach and your work of art will have all of the appeal of a poorly executed paint by numbers project.

Three of the most common data extraction tools used by the document management community are Bar Code Recognition, Zonal Based Optical Character Recognition (OCR) and Database Look Ups. Each of these are powerful tools with their own advantages and pitfalls. Knowing how to use them to your advantage can dramatically improve the returns you realize with your document management initiative.


3 Most Common Data Extraction Tools


Bar code Recognition is one of the most widely used techniques because of its unparalleled degree of accuracy. Bar codes are considered old technology but their simplicity is their strength. The only real challenge with utilizing bar codes is that they are less dynamic than the other approaches since you typically have to incorporate them into your documents. As a result, bar code applications are most commonly reserved for the documents that you create. Adding the font is a simple process and there are many good bar code fonts available on the web at little or no cost. You can easily create Word or Excel spread sheets, which concatenate a key field like customer ID or even embed it in your AS/400 print stream.

If you are managing documents coming from outside your organization, a bar code sticker or cover sheet is another option to consider. We see this a lot with sales order processing given the non-conformity of the layout and the ability to pull the key field directly from a line of business application like QuickBooks or Great Plains to create the label.


Database Look Ups are another option to automate indexing as they allow you to re-use data that has already been captured, eliminating the need to perform data entry twice. Typically they would be used in conjunction with Bar Code Recognition or Optical Character Recognition to fully automate the capture process.

In the example above using a Sales Order, the first step would most often be to enter data about the order into the accounting or ERP system. During this process, a bar code containing a key piece of information such as the order number can be easily generated from the order entry screen allowing the order processor to affix it as a sticker or cover sheet prior to capture. Periodically, the order processor would take the stack of completed Sales Orders with their bar codes to the multifunctional printer for scanning.

During the scanning process, the barcode would be read capturing the key piece of data that we will use for the database lookup. At the same time, the bar code can automatically separate the batch of documents into individual records.

Now that we have our documents scanned, separated and indexed with the order number, the database lookup can be utilized to pull over any additional information about the record such as customer name, terms, amounts, etc. Since all this information already resides in the accounting or ERP solution, it is a simple process to schedule a batch update using a database look up to complete your indexing.


Optical Character Recognition is a powerful option which offers the possibility of eliminating data entry all together. The ability to convert a picture of an image into high value data, which can be used to index your records, is enticing. Especially given recent trends in which organizations are taking the extracted data and feeding it to their production accounting or ERP solutions to create transactions.

In order for this to happen, you first need to optimize the quality of the extracted data. One of the most common misconceptions about OCR is that it will deliver 100% accuracy. Even with the most sophisticated OCR engines, this simply is not possible due to the real world challenges presented in a production scanning environment. OCR should be viewed as a means to eliminate much of the work surrounding manual indexing by delivering results reflecting anywhere from 70-90% accuracy depending on the documents you are capturing. This realistic expectation should not be discounted as it still represents an enormous cost savings to OCR users.

Given these limitations, it’s critical to combine tools for optimizing the OCR results with an effective QA process for correcting any misreads.

There are several options available for improving OCR accuracy. One of the most common issues with using OCR templates is that the paper frequently shifts during the scanning process while your template remains static. Consequently, the shift in paper results in your OCR zone extracting incorrect information. This can be addressed however by the use of Page Registration. By creating an “Anchor Point” for your template the system will identify any movement in the paper position and register the zone accordingly. Using Page Registration allows you to create much tighter zones with higher rates of accuracy; especially in densely populated documents where there is little margin for error.

Additional features for enhancing your OCR results include tools for image enhancement that allow you to remove errant pixels or remove color backgrounds that might impede effective results. Using these tools commonly requires extensive testing to optimize your results and are best implemented by a professional who is experienced with improving image quality.

Finally, your OCR capture process should include technology that reports back confidence on what it has extracted and routes questionable results to an operator for manual quality assurance. This approach ensures accuracy of your data while still minimizing the large percentage of any capture effort. In fact, we have seen many organizations go from four data entry operators to a single quality assurance administrator thus reducing their capture costs by 75%.


We encourage you to explore ways to automate the capture of your documents. If you have questions on what might be the right approach for you, consult a professional for advice or contract them for assistance. The benefit an automated capture process provides will likely offset any investment while providing returns for years to come.