Identifying Latin text

The taxonomic convention of having formal names in Latin provides us with a cue when analysing digitised texts.

Anton is working on a suite of Java programs, as a Netbeans project, to exploit this cue.

The basic idea is to use MakeMapsMain to gather statistics about ngrams in some texts. Statistics are gathered for Latin from the taxonomic database file taxons.txt (method makeLatinMaps) and for English from some fairly randomly chosen English texts, assuming that the words in them are not-Latin (method makeEnglishMaps).

The ngram gathering routines rely on BookMaker to create a list of words from the text files in which delimiter characters are used to extract individual words, and a junk list is used to further clean up what's left.

Then you can use ScoringMain to classify words in one of the test files based on some metric.

Each word is scored for its likelihood to be in either Latin or not-Latin based on probability that ngrams in the scored word appear in Latin or not-Latin. The higher probability of the two determines a word's language classification. There is also a confidence scoring method that tells how much difference there was between the Latin and non-Latin scores. Because all scoring is at the word level, you can also just input any word and score it, but the ScoringMain class attempts to see how well the metrics are doing based on the test texts, again assuming that the text is either Latin or non-Latin.

Preliminary results are available with the transition probability results scoring very well.

Scratchpads developed and conceived by: Vince Smith, Simon Rycroft, Dave Roberts, Ben Scott...