Place holder page for notes on various spell checkers.
Obvious(?) approach to looking for taxons in a document is to scan the document against a list of known taxons. Easy!
However…
So, instead investigate use of a negative list of words, i.e. check against a dictionary to remove other common words, so leaving only the unusual which should hopefully include taxons.
This is not a perfect concept. For example, Formica minor would not be identified because minor is a common word and so would removed by spell checking the document against a common word dictionary.
Comments
Hunspell
Hunspell (→ http://hunspell.sourceforge.net/) is the default spell checker for many open source projects including the latest versions OpenOffice, Firefox and Thunderbird.
It is based on MySpell and can use MySpell dictionaries. MySpell was an earlier project to integrate various open source spell checkers for OpenOffice.
Programs can access Hunspell through the Enchant interface developed as part of Abiword, for example. This means it is also accessible in PHP, Java, Python, Ruby, etc. All inputs and outputs are in UTF-8 encoding. (MySpell uses 8bit ASCII.)
Specialist dictionaries
Published Dictionaries
While there are many spell checkers on the market, there are few with the specialist dictionaries we require.
Spellex
Spellex market several specialist dictionaries covering medical legal and other fields. There are two of potential relevance to us:
These two dictionaries are not suitable as their coverage is not aligned with our requirements.
WordNet
WordNet is not intended as dictionary, in the sense of a list of words and their meanings. It goes beyond that scope to group nouns, verbs, adjectives and adverbs into sets of cognitive synonyms (synsets). Each synset expresses a distinct concept. The synsets are interlinked and can be navigated with the supplied browser.
This scope means that WordNet contains "open-class words" only: nouns, verbs, adjectives, and adverbs. So, many common words such as pronouns and conjunctions are missing from WordNet's word list.
WordNet can be used a standalone utility, a library called from other programs or online.
More detail at http://wordnet.princeton.edu/.
Ispell
This comment is only for background information:
Aspell
A free software spell checker, it is the standard spell checker for the GNU software system. It is also available for other platforms including Windows. The current GNU Aspell version is 0.60.6, released April, 2008. The current Windows version of Aspell is 0.50.3, released December 2002.
Aspell is the successor to Ispell, designed to spell check UTF-8 documents without the need for a special dictionary.
It is available as a standalone utility (as used by David in his his initial investigations into this approach of a negative word list) and as a library called by other programs, for example, Notepad++.
More information at http://aspell.net/.