We seek to develop concept extraction methods from scanned literature, extending the work of Lu et al (2008) , targeting the biodiversity domain. The project will use image analysis (feature extraction), document layout (reverse-engineering a DTD), Natural Language Processing (NLP).

Automatic Biodiversity Literature Enhancement


We plan to extend and establish the generality of the mark-up and meta data extraction from scanned literature developed by Lu et al (2008), targeting the biodiversity domain. Meta-data will focus on proper nouns (taxon, people and place names) and dates: we will enhance the searchability of those terms using associative techniques from Natural Language Processing (NLP) combined with likely Optical Character Recognition (OCR) errors, for example by allowing the recovery of Pioa against a search for Pica, provided the context of Pioa is a bird, ideally a magpie. The project will work with approximately 10 volumes that will be scanned (approx. 3000 pages) which will be rendered into several alternative XML structures. The project deliverables will be made available on the project website and through the Biodiversity Heritage Library (BHL) as exemplar data sets which will, hopefully, stimulate further research into automatic extraction of meaning from scanned literature. If fully successful the software developed here will be applied to the BHL library of over 6 million pages. BHL scanners produce a structural XML output and a small part of the project will look at the feasibility of developing software to create compatible files starting from plain image scans.

Alternative website

The project also maintains an OU-hosted website for the project.

Scratchpads developed and conceived by: Vince Smith, Simon Rycroft, Dave Roberts, Ben Scott...