One of the questions we posed ourselves was how far could we get by looking for capital letters.
Capital letters are of interest because they are a typographical clue not only to proper nouns but to genus names.
The attached script, i_c.php, works on an input text file pre-processed by removing all common words to look for words that have an initial capital letter, and if so checking if the word is recognised in the Global Names Index. Note, a word may have more than one reference in GNI, for example Scrophulariaceae is recognised as both Scrophulariaceae and as Scrophulariaceæ.
The script is commented. If there is time in the project we will progress beyond this basic work. For the Bulletin 51 (Entomology), there are 4,815 words in the processed input file, of which 2,509 have an initial capital letter, and of those there are 687 GNI references. The output from i_c.php for this volume is also attached to this page.
Attachment | Size |
---|---|
bulletinofbritis51entolond_i-c.txt | 138.17 KB |
i_c.php_.txt | 4.19 KB |