Applying paragraph types

Identifies paragraph types based on keywords.

Run apply_paragraph_types.php against the *_abbyy_to_tei_by_xmlreader.xml files.

Keywords defined in a separate file, paragraph_keywords.php, to make changing them easier.

Script inserts comments with proposed paragraph type.

All eleven exemplar documents have been marked up, *_tei_annotated.xml, using the preferred comment approach.

One exemplar document, bulletinofbritis51entolond, has been marked up, *_tei_annotated_with_div.xml, using the TEI <div> approach. (See Mark up options and apply_paragraph_types_v1-1.php below.)

AttachmentSize
apply_paragraph_types.php_.txt1.68 KB
paragraph_keywords.php_.txt3.71 KB
apply_paragraph_types_v1-1.php_.txt2.45 KB

Comments

Keywords

We use a fairly simple, robust paragraph type identification method: we test the paragraph's first word against a list of keywords. The keywords are stored in an associative array, so that each keyword has a paragraph type associated with it. So, if the first word is present in the list of keywords the associated paragraph type is used in the mark up.

Currently, the associations are all taMLit paragraph element types. This is in the long term expectation that the marked up files will be converted to full taXMLit. There is the immediate benefit that taXMLit provides us with a pre-defined list of appropriate keywords, which is better than we having to draw up our own list of types with probably less rigorous definition than that used to define and refine taXMLit's over the years.

This means that the scope of our mark up is always one paragraph. An alternative approach for some paragraph types, such as description, would be to carry the identification forward to subsequent paragraphs until a change in paragraph types is detected. Doing this reliably is more difficult than it appears owing to several problems, such as an intervening distribution map or page header that breaks the flow, so there would be many false attributions unless additional checks were implemented.

The keywords are very simple, and require a full match against the tested word including any punctuation. Thus 'Female', 'Female:' and 'Female.' are all valid keywords indicating a 'SameLanguageDescriptionParagraph'. An alternative would be to remove the punctuation before conducting the match test, but further analysis would be required of the source material to ensure this did not lead to false positives.

As can be seen in the example above, even though we are drawing all our exemplar texts from one journal, there are three variations of the same keyword.

Punctuation could provide a generalisable clue to keywords and phrases. Usually, if a paragraph opens with a keyword such as description, the word is followed by a period. Thus, a generic keyword test could be to look for a period after the first one, or possibly two, words in a paragraph. The other punctuation used in this context is a colon, though other cues are also used such as the word being upper case throughout.

We do not convert case before testing. We keep it as another distinguishing feature of the keyword. There is scope in a follow up project to evaluate the consequences of ignoring case when matching keywords.

Paragraph scope

It might seem obvious, but the scope of the mark up is a paragraph, ie our inserted comment refers to the immediately following paragraph only.

In taXMLit, a paragraph might be tagged as a <DiscussionParagraph>. Subsequent paragraphs might also be tagged as <DiscussionParagraph>. TaXMLit then encloses the paragraphs within a <DiscussionBody> element, which in turn is enclosed within a <Discussions> element. Thus, taXMLit also records the overall extent of the discussion, separating out a discussion title should one be present, and so on. It provides more meta-data than we do in our mark up.

DCL's TEI documents do not attempt to emulate taXMLit scope definitions. Each <div type="discussion"> encloses one paragraph only, though it would be valid XML if it were to enclose more.


The question of scope opens up the challenging task of identifying when scope ends. For example, a discussion paragraph might be followed by several other discussion paragraphs that extend the discussion but that are not identifiable as such because they are simply a continuation of the free text and so lack any keywords. This can be addressed by maintaining a context derived from the original keyword, and through a textual analysis see if the subsequent paragraphs have the characteristics of that type of paragraph. That, however, is an opportunity for another project.

Mark up options

Currently use simplest option, and precede the paragraph with an XML comment using the appropriate taXMLit descriptor like this:

<!-- SameLanguageDescriptionParagraph follows -->

The benefits of this approach are that we have something in the file, it is easily adapted, the text is easily retrievable and the XML still validates with minimal work required from the script writer.

There are two other realistic approaches to adding this information that we could use:

1. Full taXMLit mark up - while it would be good to include taXMLit, and using namespaces would allow to do this, the nature of taXMLit means that I can not guarantee to build the correct path to the text element at this stage. One option is to include the final element only, for the example above wrap the existing TEI <p> tags with taXMLit <SameLanguageDescriptionParagraph> tags; but I'm not sure what that gains us.

2. TEI mark up in the style of DCL - DCL wrap TEI <div> tags around the existing TEI <p> tags, eg <div type="description">. While I could do this, once again I'm not sure what it gains us; except possibly that we can exploit DCL's conversion routines at a later date. There's also the question as to why this the TEI <div> tag is at the same level as the others. TEI offers a pre-defined hierarchy of <div> elements, from <div0> to <div7>, which would seem more appropriate to use.

Being pragmatic, rather than try to resolve all the XML issues within ABLE, I have kept to the simple comment approach as at least this shows the paragraph type can be identified and appropriate mark up inserted. This is the approach used in apply_paragraph_types.php.

For completeness, I have also prepared apply_paragraph_types_v1-1.php that applies DCL style <div> tags around the identified paragraphs. Note, because this script uses paragraph_keywords.php, the associated paragraph types are taXMLit's definitions not DCL's. For example, where DCL would describe a paragraph as type="description" the script will apply type="SameLanguageDescriptionParagraph". It doesn't seem worth generating another associative array for the DCL paragraph types just for this proof of concept script. If we want to exploit DCL's TEI to taXMLit conversion software at a later date we might change this; however, as a general principle applying taXMLit descriptions is preferable because it is more precise.

Scratchpads developed and conceived by: Vince Smith, Simon Rycroft, Dave Roberts, Ben Scott...