FSU Libraries
 

Building the search engine for the Florida Administrative Code Database:

Introduction:

This page is targeted toward persons involved in records management who might want to use a similar method to digitize and index a record set.

The scanning process:

Pages were scanned on a Cannon Dr-5010C sheet fed scanner, using the native driver. The scanner was set so that for each sheet of paper in the binder, the scanner produced two pdfs of one page each, each pdf showing an image of a side of the page (front or back) at 300 dots-per-inch resolution in black and white.

In anticipation of building a database containing all pages, each scanned pdf of a page was given a unique name from every other scan at the time of creation. The name of the scanned pages was later used to derive the primary key for one of the tables in the database.

Optical Character Recognition:

Pages were processed using optical character recognition in Adobe Acrobat Professional 9.0. This program allows one to do batch processing of hundreds of documents at once. In order to do this (1) open Adobe Acrobat Professional, (2) select the menu option "Advanced" (3) select "Document Processing" in the drop down menu (4) select "Batch Processing" in the submenu (5) click the "New Sequence" button (6) give the sequence a name (7) click "Select Commands" (8) choose "Recognize Text Using OCR" (9) click the "Add" button.

Now run the batch processing on a set of files: (1) Place the set of pdfs in a single folder on your computer (2) in Adobe Acrobat Professional 9 select the menu option "Advanced" (3) select "Document Processing" in the drop down menu (4) select "Batch Processing" in the submenu (5) select the sequence to run (ie. the sequence you just created) (6) click "OK" (7) select the folder to process (ie. the folder with all the pdfs you want to run OCR on) (8) click the "Select" button (9) select the folder you would like the OCRed files to go into (10) “Run” the sequence. While the batch processing is running, you will be able to use your computer, but will not be able to use Adobe Acrobat to do anything else.

Troubleshooting: If you have trouble with Adobe Acrobat crashing during batch OCR, then check the pdfs you are processing to make sure that pages are right-side up.

Planning the Database:

Look at the organization of the document set, and plan out what will be required for search:

Before you begin to index the records, you should have a clear idea of what a perfect index would look like.

Once you have an idea of what a perfect index would look like, keep that perfect index in mind as you look at the pdfs and play with extracting a small set to Excel so you can look for patterns in that. (see the section on Automated Metadata Creation below) Try to determine what fields on your index might pop out of the page. For example, page number might always be the last thing on the page, and the last line in the spreadsheet.

Experiment with scripting on your sample spreadsheet and revisit your perfect index to determine whether automated indexing will capture enough information for the index to be useful. It is likely that human indexing will be required to supplement automated indexing, because some data will be illegible to OCR programs run on scanned documents and because some field may require data which does not occur with a predictable enough pattern to be captured by scripting. Revise your database based on what is needed for search and the availability of staff for supplemental metadata creation.

Automated Metadata Creation:

Extract text from pdfs to Excel:

A program called A-PDF to Excel was used to transfer text from pdfs into Excel spreadsheet format. This program is available at http://www.a-pdf.com/to-excel/download.htm A-PDF to Excel looks at the OCRed text in a pdf and allows one to have the full-text of each pdf printed to a new sheet in an Excel workbook. The A-PDF to Excel program requires that Microsoft Excel be installed on the computer.

Several other similar programs exist. What is important to recreate the method used here is that the program pulls textual data from a set of pdfs, and dumps that data into Excel in a way that is useful.

The set of files produced by extracting text from pdfs to Excel is available here: http://fsulawrc.com/excelVBAfiles/index.html Because the extractor could not extract all 30,000 pdfs to a single spreadsheet, pdfs were sorted into folders, and each folder was processed separately. Scripting was then run on the individual Excel files and the resulting indexes combined into a master index.

Use scripting to extract information and create an index:

Microsoft Office products support basic scripting. Microsoft Visual Basic Applications allows one to look for patterns in a document, and then extract strings of characters using pattern recognition. The largest entry barrier to scripting is getting started.

Before any scripting was begun, the pages were examined in order to determine what fields were necessary in order to build a search. These fields which will have a unique combination for every page were the Chapter Number, page number within that chapter, and supplement number (which tells the date on which the page was placed into the binder). Other pages necessary for search were: a unique identifier for each page, and the supplement number which superceded the page (which tells the date on which the page was removed from the binder). Also before beginning scripting, the Excel spreadsheet extractions of text were examined in order to spot patterns. Planning and articulating was must be done and is possible to do is more difficult than is scripting.

The following articles may be helpful in getting started on scripting:

The script used to extract fields necessary for search of the Florida Administrative Code is available at http://fsulawrc.com/excelVBAfiles/index.html#VBA (bottom of the page) or at http://fsulawrc.com/excelVBAfiles/VBAscriptForFAC.docx

Effectiveness of automated metadata creation in the Florida Administrative Code database:

The goal database schema is described below:

Page (uniquePage, suppNo, chNoAfterDash, pNo, supercededBy, imageName)

  • Primary Key: uniquePage
  • Foreign Key: suppNo references Supplement.suppNo

Rule (ruleNo, historyNotes)

  • Primary Key: ruleNo

RulePart (rulePartKey, uniquePage, rulePart)

  • Primary Key: rulePartKey
  • Foreign Key: uniquePage references Page.uniquePage
  • Foreign Key: rulePart references Rule.ruleNo

Supplement (suppNo, suppReleased, notesAboutSupp)

  • Primary Key: suppNo

Discussion:

The Supplement table was filled by hand. Fewer than 200 supplements were issued, and correctly matching a supplement number to a date is vital for knowing the date on which any given page was placed into or removed from the binders.

Rule and RulePart could not be filled using automated metadata generation, and are not filled at the present time.

The Page table was partially filled using the script and extracted Excel files at http://fsulawrc.com/excelVBAfiles/index.html After combining indexes into a single spreadsheet, and cleaning up the data by hand, the result was this spreadsheet which partially fills the values for the Page table.: http://fsulawrc.com/excelVBAfiles/FACFAWforimport.xls This automated metadata enabled an imperfect search. A demo of a search based solely on automated metadata may be found at http://fsulawrc.com/automatedindex.php This demo is intended to assist in assessing strengths and limitations of the automated metadata when supplemental human generated metadata is unavailable.

Because automated metadata creation left values for several fields blank in the Page table, human created metadata was required to fill these fields.

What this database contains: This database contains pages from the Florida Administrative Code printed by the Division of Elections between 1970 and 1983.

Alternative to search:

The Making of the Historic Florida Administrative Code database:

Related Resources: