SIENA

Stimulation Initiative for European Neural Applications

Esprit Project 9811

Case Studies of Successful Applications

Neural OCR Processing of Social Security Forms

Company Background

CENDAR (Centro de Tesorería de la Seguridad Social) is Spain's Social Security branch in charge of collecting every company contribution and processing the associated forms. As such, it is the main centre concerned with revenue collection and plays therefore a crucial role in the financial administration of the several Social Security services.

Description of the problem

Every Spanish company has to submit each month several forms reflecting its payroll and the subsequent contributions it has to make to the Social Security for workers employed. From the point of view of revenue collecting, the main form to be processed is the so called TC-1 form, which contains a summary of a company number of workers, their salaries and the amounts that are deducted from these in order to meet Social Security mandatory payments. This implies the monthly processing of more than 2 million of such forms. In any case, OCRing these documents is just the starting point in the whole TC-1 processing. In fact, it has to be done in about 12 days, since several other operations have to be performed with their information. Thus, since a typical TC-1 form has about 15 fields filled in, with an average number of 130 characters per form, it is clear that recognition speed is also a crucial factor for any successful system.

Neural Network Techniques application

The sheer number of TC-1 documents to be processed every month makes mandatory an automated approach. This is particularly crucial for the first step on that process, the incorporation into electronic form of their printed information. Neural based products are becoming the tools of choice for large OCR applications. Of course, forms as TC-1 have to be processed in a extremely accurate fashion: they are closely analysed to ensure the collecting of proper revenues and to detect possible discrepancies between revenue information as reflected in TC-1 forms and actual payments. In the case of CENDAR, a system jointly developed by KEON and IIC is currently in use, yielding excellent recognition rates: more than 75 % of the TC-1 fields are correctly recognised and the rate of documents processed totally in an automated way being above 30 %. The system is able to meet its processing load running on six medium level UNIX machines.

Benefits

The KEON-IIC system ensures a prompt and precise processing of the TC-1 information. Of course, not all forms can be processed automatically (this is never true in any OCR application, and specially so in those involving very large form number with many different original sources). The system recognition rates greatly reduces, however, the number of forms to be manually processed. The time and money saving CENDAR obtains are thus great and very valuable.

Generalisation

The OCR application described here is typical of the requirements of large scale OCR applications. They must have a powerful individual character recognition tool (neural in this case) with a rather sophisticated form navegation system that enables the system to localize the different forms actually having printed information, accurately select the relevant rasters and correctly segment them in their associated characters. This combination of a universal recognition tool coupled with a tailored printed information selection and segmentation module presents a viable OCR procedure for large scale general document processing.

Contact persons

Luis Pelayo, CENDAR, c/Alcuñeza s/n - 28850 Madrid Spain.

Alberto Pérez, Instituto de Ingeniería del Conocimiento - IIC Unversidad Autónoma de Madrid - Módulo C-XVI planta 2 - 28049 Madrid - Spain - Phone: +34 1 397 39 73; Fax: +34 1 397 39 72; E-mail: alberto@irene.iic.uam.es.