Improving Access to Text

Optical Character Recognition (OCR) for the mass digitisation of textual information: Improving Access to Text

A JISC sponsored event.
UKOLN, University of Bath
24th September 2009
Website

Summary

This workshop will provide an opportunity for participants to learn about the current state-of-the-art in the digitisation of historical texts, to look at improvements in digitisation techniques currently being explored in research projects such as the EU-funded IMPACT project, and to explore how Optical Character Recognition (OCR) is used in practical digitisation contexts and workflows.

The workshop met its objectives, but it was very much an introductory workshop. There was not much co-ordination among the presenters and we had about five presentations that included an overview of the digitisation/OCR process. The slides will be available on the workshop's website soon.

Welcome

Aly Conteh, British Library

Good introduction, including the problem with historical texts that all at the workshop were concerned with: ‘even ten years ago we could OCR modern text and get about 95% accuracy but as we go for older texts accuracy go down’. This presentation had the best process flowcharts. I have drawn a compound version of the flowcharts and attached the file, Digitisation_Workflow, to this page.

Digitisation overview: past, present & future

Neil Fitzgerald, British Library

Mainly an overview of the challenges, but a better discussion of this topic was given in the later presentation Enhancing images for optimal text recognition. Of interest to us:

  • many digitisation projects are actually about creating better metadata and digitisation is only a means to that end
  • many digitisation projects do not understand their intended audience and consequently users demand changes during and after the project
  • collaborative correction can work; examples given were our favourite Australian newspapers who apparently have now made their collaborative correction software freely available and Digital ProofReader, as used by Project Gutenberg and Google's use of, and acquisition of, reCaptcha.

Introduction to Optical Character Recognition (OCR)

Günter Mühlberger, Innsbruck University Library

Another solid introduction but with much repetition of other presentations. Interesting points to consider:

  • Achieved accuracy varies by intended users, so that modern e-books might have 1 in 1,000 character mistakes, while traditional publishers aimed for 1 in 200,000 in their printed texts; but researchers want (need) 100% accuracy.
  • What are the units of measure for accuracy (characters, words, meaningful words, etc)?
  • Does measurement include accuracy of layout?

Enhancing images for optimal text recognition

Apostolos Antonacopuolos, University of Salford

Presentation had nothing to do with the stated title, but it had the best round up of OCR problems I have come across.

Improving and adding value to OCR results – the IMPACT project

Michael Day, UKOLN, University of Bath

Originally sub-titled some IMPACT tools, however, as none have been developed yet..Another presentation that did not match its title what we got was a 15 minute introduction to the IMPACT project. This is an EU project, the European Commission paying €11.5M of the total €15.5M cost. It involves 15 partners delivering 22 workpackages, though the packages are grouped into four sub-projects.
Problems with OCR accuracy are addressed by presenting both image and text to the researcher. The researcher is expected to sort out ambiguous search results for themselves. The IMPACT project team are keen on collaborative correction, but plan to minimise the need for it by connecting to existing authoritative lists of names, etc. to produce clean data in the first place.

Case studies 1: British Library/JISC Newspaper project

Aly Conteh, British Library

A good presentation, but see this D-Lib article for a more detailed write up of the material. As an aside, the BL project team found that the tools they used required higher OCR accuracy than the human readers. Oh, and they avoided copyright issues all together by not including any copyrighted material in the project.

Case studies 2: Target Language Resources for Digitization of Historical Collections

Christoph Ringsletter, Center for Language and Information Processing, University of Munich

The most interesting element of this presentation were the problems in recognising word boundaries in German because of the language's propensity toward compound words. Of more relevance to us was the need for the digitisation team to build historical lexica to cope with changes in language and spelling.

Case studies 3: Using digitised text collections in research and learning

Emma Huber, UKOLN, University of Bath

A poorly presented set of slides telling us about some sociological research that found students and researchers considered using digital resources as cheating. Myself and many others in the audience, especially anyone involved in distance-based work or learning, were rather surprised at the presenter's conclusions. On querying the result we learnt that the sample consisted of 15 people only, and they were mainly archival researchers based at Oxford. Arguably, not representative of the majority of users of digital resources.

Panel discussion: digitisation challenges, with an introduction to the developing Centre of Competence.

A bit disappointing as the session was mainly people asking for help. I was even among those offering solutions, or at least avenues to look at for solutions.


Two useful papers:
  • Rose Holley. "How Good Can It Get?: Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs." D-Lib Magazine. March/April 2009, vol. 15 no 3/4. <doi:10.1045/march2009-holley>
  • Simon Tanner, Trevor Muñoz and Pich Hemy Ros. "Measuring Mass Text Digitization Quality and Usefulness: Lessons Learned from Assessing the OCR Accuracy of the British Library's 19th Century Online Newspaper Archive." D-Lib Magazine. July/August 2009, vol. 15 no 7/8. <doi:10.1045/july2009-munoz>
AttachmentSize
Digitisation_Workflow.png34.09 KB
Digitisation_Workflow.odg10.5 KB

Comments

Citizen science - health warning

Neil Fitzgerald highlighted an issue when relying on volunteers to volunteer mark up digitised texts: you are at the mercy of the volunteers' interests. He gave an example from the Australian newspapers digitisation project. There's a disproportionate number of tags related to railways. This does not indicate the the papers were obsessed with writing articles about railways, rather it is the result of a group of railway enthusiasts who diligently processed the digitised articles looking for references to their favourite subject.

Scratchpads developed and conceived by: Vince Smith, Simon Rycroft, Dave Roberts, Ben Scott...