Stimulation Initiative for European Neural Applications
Esprit Project 9811
Neural OCR Processing of Social Security Forms
Company Background
CENDAR (Centro de Tesorería de la Seguridad Social) is Spain's Social
Security branch in charge of collecting every company contribution and
processing the associated forms. As such, it is the main centre
concerned with revenue collection and plays therefore a crucial role
in the financial administration of the several Social Security
services.
Description of the problem
Every Spanish company has to submit each month several forms
reflecting its payroll and the subsequent contributions it has to make
to the Social Security for workers employed. From the point of view of
revenue collecting,
the main form to be processed is the so called TC-1 form, which
contains a summary of a company number of workers, their salaries and
the amounts that are deducted from these in order to meet Social
Security mandatory payments. This implies the monthly processing of
more than 2 million of such forms. In any case, OCRing these documents
is just the starting point in the whole TC-1 processing. In fact, it
has to be done in about 12 days, since several other operations have
to be performed with their information. Thus, since a typical TC-1 form
has about 15 fields filled in, with an average number of 130
characters per form, it is clear that recognition speed is also a
crucial factor for any successful system.
Neural Network Techniques application
The sheer number of TC-1 documents to be processed every month makes
mandatory an automated approach. This is particularly crucial for the
first step on that process, the incorporation into electronic form of
their printed information. Neural based products are becoming the
tools of choice for large OCR applications. Of course, forms as TC-1
have to be processed in a extremely accurate fashion: they are
closely analysed to ensure the collecting of proper revenues and to
detect possible discrepancies between revenue information as reflected
in TC-1 forms and actual payments. In the case of CENDAR, a system
jointly developed by KEON and IIC is currently in use, yielding
excellent recognition rates: more than 75 % of the TC-1 fields are
correctly recognised and the rate of documents processed totally in an
automated way being above 30 %. The system is able to meet its
processing load running on six medium level UNIX machines.
Benefits
The KEON-IIC system ensures a prompt and precise processing of the
TC-1 information. Of course, not all forms can be processed
automatically (this is never true in any OCR application, and
specially so in those involving very large form number with many
different original sources). The system recognition rates greatly
reduces, however, the number of forms to be manually processed. The
time and money saving CENDAR obtains are thus great and very valuable.
Generalisation
The OCR application described here is typical of the requirements of
large scale OCR applications. They must have a powerful individual
character recognition tool (neural in this case) with a rather
sophisticated form navegation system that enables the system to
localize the different forms actually having printed information,
accurately select the relevant rasters and correctly segment them in
their associated characters. This combination of a universal
recognition tool coupled with a tailored printed information selection
and segmentation module presents a viable OCR procedure for
large scale general document processing.
Contact persons
Luis Pelayo, CENDAR, c/Alcuñeza s/n - 28850 Madrid Spain.
Alberto Pérez, Instituto de Ingeniería del Conocimiento
- IIC Unversidad
Autónoma de Madrid - Módulo C-XVI planta 2 - 28049 Madrid
- Spain -
Phone: +34 1 397 39 73; Fax: +34 1 397 39 72; E-mail:
alberto@irene.iic.uam.es.