ABBYY XML to TEI XML

Background

Currently we have successfully taken Dublin Core and DjVu XML files and created eleven exemplar marked up texts. These are basic TEI XML files, without any semantic enhancement. They serve as our baseline documents.

Unfortunately, from our point of view, DjVu discards much information that is useful to us. With DjVu, we are left solely with text content and the co-ordinates of the blocks that the text has been divided into by the OCR process. This is the same problem that undermined Lu's work; block co-ordinates alone are not sufficient to identify concepts and boundaries within source XML.

Taxonomists make use of three typographical cues to identify concepts and boundaries: font size, bold and italic. While these are missing from DjVu, they are present in ABBYY XML as the fs="", bold="true" and italic="true" attributes in the <formatting> element. Therefore, to ensure these typographical cues are available to us we need to work with ABBYY XML. However, recording this extra information in ABBYY XML is not without problem.

The scale of the problem

ABBYY XML files are very large because much detail is recorded about each character. Two files are attached to this page to demonstrate the scale of the problem.

A line of the file was chosen at random. Line 81 of our source text, bulletinofbritis51entolond_abbyy.xml, states that the Bulletin is published in four specialist series. The raw text is 202 characters long and can be seen in the attached file, abbyyxml_line81_actually_says.txt. ABBYY XML uses 45,533 characters to encode this one line as can be seen in the attached file, abbyyxml_line81.txt. There are 84,263 lines in the source XML.

As noted, this level of detail means that ABBYY XML files are very large; bulletinofbritis51entolond_abbyy.xml, for example, is 237Mb. This size renders the files well nigh unusable. No normal XML editor can load such large files, nor can plain text editors. The problem extends to specialist programming editors such as Notepad++. Therefore, we need to adopt a different approach to processing such files.

Using PHP's XMLReader to process one node at a time

The only approach to manipulating files that are too large to be loaded into memory in their entirety is to load them a section at a time. In this case the obvious choice for us, given our existing use of PHP, is PHP's XMLReader. This is an XML parser that reads the XML file one node at a time, thereby reducing the memory demands to a manageable level.

The script works by reading each node from the source file, testing the node type and name in an if/else cascade, then processing accordingly. The script takes approximately ten seconds to process out 200Mb files.

Using PHP's built-in string functions to process one line at a time

An alternative approach is to treat the file as a text file, which it is after all. We can process the file one line at a time, so avoiding any memory problems, still exploit the XML structure by using the tag names directly instead of via an XML parser, and by doing that with PHP's built in string functions might be faster than parsing the text as XML. The final optimised script takes around 10 seconds to convert 200Mb ABBYY XML files to TEI XML, which is comparable to the XMLReader version. I have not conducted detailed timing tests, but might do later in the project if there is time in my schedule to run them.

The text based code is more complex. The complexity arises because we are presented with a line, which might contain multiple items of data we need to capture. In contrast, XMLReader presents with a node, which can contain only one item of data. Therefore, testing for relevant data is far simpler in the XMLReader script.

There is one advantage to the text based approach, however, in that it can process invalid XML. So, should we want to break down the ABBYY source for whatever reason, and potentially break the XML by losing closing tags for example, we can still process the smaller, invalid file. The text based script may also prove useful to later projects that have similar XML conversion needs, and for which XMLReader may not be a suitable tool. Therefore, the text based script is included in this page as successful, example code.

Wherever possible PHP's built-in functions have been used as they are the fastest means of processing string data. Only one regular expression had been used, and that was because it is the most efficient way to achieve the tricky task of retrieving the fontsize value from the formatting element's fs attribute. In all other cases, PHP's strpos and substr have been used.

During the optimisation various utilities were written, one of which is attached here as it may be of general utility when analysing other XML files. It is find_first_five.php_.txt.. This PHP script retrieves the first five characters of every line in the XML file and then produces a simple list of the values found and their frequency. This utility was used to confirm which elements needed to be processed by the text based conversion script which has to locate named elements explicitly in each line.

Concatenated string output

Another optimisation concerns the way output lines are written. This optimisation applies to both scripts. There are three alternatives to writing output:

  1. using multiple fwrite statements to build up each output line - relatively slow as you might expect calling the hardware each time there is some output to write
  2. concatenate the output as a string and using one fwrite statement for each line - faster than option 1, but can be slow because of the way PHP rewrites the whole string each time data is added to it
  3. concatenate the output as an array and use one fwrite statement for each line - this can be the fastest way to handle complex output lines sucha s the ones we are creating with many TEI elements on one line, each part of the final line is added to an array as it is created, and then the array is converted to a string (using implode) and that string is finally written out

Tests showed that in our case option 2 was the fastest, because for each new line the string is simply overwritten whereas the array has the extra overhead of an explicit unset to clear it before reuse. Therefore, the final script uses string concatenation to build up output lines.

The final scripts

The scripts produce effectively identical output, see below.

The XMLReader conversion script is ABBYY_to_TEI_by_XMLReader.php_.txt, with explanatory notes in ABBYY_to_TEI_by_XMLReader_notes.odt.

The text based conversion script is ABBYY_to_TEI.php_.txt, with explanatory notes in ABBYY_to_TEI_notes.odt.


Other notes

Sample TEI file

See bulletinofbritis51entolond_abbyy_to_tei_by_xmlreader.xml_.txt for an example file converted to TEI with typographical mark up.

teiHeader data

The final scripts make use of already converted Dublin Core data. The conversion from Dublin Core to TEI was made as part of an earlier workflow in this project based on DjVu XML source for our main document text. In that workflow the Dublin Core XML was converted into a transition XSL that was used by the XSL that converted the DjVu XML into TEI XML. Rather than revisit this work, the scripts simply read the appropriate lines out of the already converted Dublin Core file in the transition XSL. This makes for a slightly cumbersome process but it works.

If there is time at the end of the ABLE project I propose to revisit this two stage workflow so as to integrate the Dublin Core conversion directly into the ABBYY conversion process.

Schema comparison

See ABBYYtoTEI_notes.odt for a table mapping elements between the two schemas.

The document also includes an explanation of the design choice in mapping fontsize data and of encoding proper name data.

Almost identical output from the two scripts

The only difference between the output of the two conversion scripts is how they handle the single quote entity.

The XMLReader conversion script writes single quotes as &#039;. The entity &apos; becomes ' when read from the source file. Unfortunately we cannot write ' to the output because it is not standard XML. However, the standard php routine to convert text characters back to entities converts ' to &#039;. It doesn't seem worth the effort to correct this to &apos;, as the entity is valid XML and any later processing to make the text understandable to a human reader is going to convert all entities to text anyway.

The text based conversion script writes single quotes as &apos; because that is how they are encoded in the ABBYY source XML and the text based script does a direct copy.

In other words,

  • the source text Museum&apos;s resources
  • is converted by the XMLReader script to Museum&#039;s resources
  • but retained by the text based script as Museum&apos;s resources
  • which, in both cases, is rendered into text as Museum's resources.

The four other entities converted by XMLReader to their character equivalents are all converted back to the same entities encoding as they originally were. Therefore, this problem does not affect them.

On TEI's hi rend=""

Brown University code all the formatting attributes within the the rend string, making use of brackets to separate them. They call this rendition ladders. However, they have not created a DTD to support this use of the rend attribute. Further rendition ladders breaks both specific TEI and general XML guidelines. It also complicates any conversion of the XML into another schema. Do not be tempted to follow Brown’s example. See http://www.wwp.brown.edu/encoding/guide/html/about.html, if you must!

AttachmentSize
abbyyxml_line81_actually_says.txt202 bytes
abbyyxml_line81.txt44.47 KB
find_first_five.php_.txt855 bytes
ABBYYtoTEI_notes.odt22.08 KB
ABBYY_to_TEI.php_.txt4.42 KB
ABBYY_to_TEI_by_XMLReader.php_.txt4.08 KB
ABBYY_to_TEI_by_XMLReader_notes.odt24.78 KB
ABBYY_to_TEI_notes.odt24.94 KB
bulletinofbritis51entolond_abbyy_to_tei_by_xmlreader.xml_.txt1.96 MB

Comments

Mark up of font size information on change

As first written the TEI output contained every reference to font size in the ABBYY source. This has been changed in the current version of the conversion scripts so that font size is written only when it changes. This makes the final XML file considerably smaller without loss of information.

Mark up of font size information

Font size is a possible cue for article boundaries and the like, so should be recorded to support later enhancements. Note, font size is not used in taXMLit.

Currently, I am using the <hi> element's rend attribute to encode actual font size. This is valid TEI, because the contents of the rend attribute can be anything.

Alternative approaches:

- use a relative size (smaller, medium, larger, etc) because that is how font size is used as a cue; however, why lose the full information? We can discard the absolute value later, perhaps as part of a taXMLit conversion routine but within the scope of the ABLE project we are recording the information in full.

- use a dedicated font size element instead of <hi>; but that would need a customised TEI DTD. It would seem appropriate to customise the DCL TEI DTD to keep everything in one place, and this would be needed if the DCL TEI->taXMLit conversion programs are to be used later on.

I recommend continuing with the rend approach, because it is the most complete record of the font size information, yet is flexibly presented so as to be immediately usable and has minimum demands for supporting changes.

Exemplar documents

All eleven exemplar documents are available in TEI format with typographical information for people able to get within the OU's firewall at \\penelope\MCSUsers\MCS-Groups\Computing\ABLE\BoB\wip.

Scratchpads developed and conceived by: Vince Smith, Simon Rycroft, Dave Roberts, Ben Scott...