Comparison of text content

We have several files for each Bulletin that contain text: ABBYY XML, DjVu text, DjVu XML and PDF. We want to know if the text is the same in these files, using Bulletin of the British Museum (Natural History) Entomology series, Volume 51 as our exemplar.

The attached zip file has notes, PHP scripts and numerous output files produced as part of the investigation into the differences among these sources. As files with the extension .zip cannot be attached to this page, it has been given the suffix _.txt.

AttachmentSize
text_output.zip_.txt2.07 MB

Comments

Needleman-Wunsch applies here too

If picking up this work in a later project, Alistair's work on Needleman-Wunsch alignments could be usefully applied to these texts too.

Dauvit

Scratchpads developed and conceived by: Vince Smith, Simon Rycroft, Dave Roberts, Ben Scott...