Project Meetings

Project Meeting 3

12th March 2009, Natural History Museum

See file ABLE Minutes 12-03-2009.doc

Comments on the meeting from Anton

  1. You asked "Do we worry about formatting?" - although I know this wasn't the angle you were exploring, from an ocr point of view, we do need to - you need different font learning essentially for bold / italics as opposed to normal
  2. Image pre-processing: there are standard things to do. I am not sure we will have much novel to say in this area, although such processing might be fine-tuned to particular sets of scans with particular artefacts.
  3. On ABBYY ocr errors, I think I'd like to have a go, perhaps with Dauvit, at font learning using tesseract. It will be interesting at least to see if the errors are similar. If not, that could be a win also - two guesses are probably better than one if combined in the right way. Related to this and other ideas we have for using approaches such as dictionaries and character frequency, we need a way of voting / combining results. This seems to be the area of Multiple Classifier Systems and there are standard approaches to combining votes.  Reading to be done...
    I do have some other ideas. One is performing recognition at the level of two or three adjacent characters. Only certain character combinations are likely to occur next to each other, and, assuming independent likelihood of malformation, there should be that much more to work with for the ocr engine. On the other hand, I don't know how the engine would react to being fed 'characters' that are actually pairs of characters separated by whitespace. Also, you would have to train for of the order of square the number of characters, but I reckon it could work.  (In fact, two adjacent characters may be more likely to be both malformed, but ... it's worth a thought.)  This is kind of like combining votes from adjacent characters. (It may be a rather heavy way of doing what might be done using knowledge of character and digraph frequency.)
  4. Spell checker type analysis. I'd like to suggest here the soundex type matching to get away from the first letter problem. Instead of soundex (which fixes the first letter), form the 'index' term (not sure what the correct word is) from equivalence classes of confusable characters, e.g.
    {o,c} are in a confusion class, call this letter 1  (i.e., the ocr engine is liable to confuse them) {s,f} are in a confusion class, call this letter 2 and so on...
    - Knowing these classes comes from analysis of the hand-marked versus OCR'd text
    Now e.g., sock begins 211 (first three letters)
    Do the same for dictionary words to form a new dictionary of fuzzed up dictionary words.  This is an M257 TMA problem :-)
    Now match against the fuzzed up dictionary words (perhaps only if the word was not found in the unfuzzed dictionary).  Howzat?
    [And then of course there is also Levenshtein style / edit distance matching ifi exact matching doesn't work]
ABLE Minutes 12-03-2009.doc50.5 KB
Scratchpads developed and conceived by: Vince Smith, Simon Rycroft, Dave Roberts, Ben Scott...