We have several files for each Bulletin that contain text: ABBYY XML, DjVu text, DjVu XML and PDF. We want to know if the text is the same in these files, using Bulletin of the British Museum (Natural History) Entomology series, Volume 51 as our exemplar.
The attached zip file has notes, PHP scripts and numerous output files produced as part of the investigation into the differences among these sources. As files with the extension .zip cannot be attached to this page, it has been given the suffix _.txt.
Attachment | Size |
---|---|
text_output.zip_.txt | 2.07 MB |
Comments
Needleman-Wunsch applies here too
If picking up this work in a later project, Alistair's work on Needleman-Wunsch alignments could be usefully applied to these texts too.
Dauvit