Spell checkers

Place holder page for notes on various spell checkers.

Why spell checkers?

Obvious(?) approach to looking for taxons in a document is to scan the document against a list of known taxons. Easy!

However…

  1. Difficulty of building a fully comprehensive positive list of valid taxons. Think of Chris’s comments on weevils: 70,000 or so identified of the estimated 200,000+ varieties. So, always chasing an ever changing list of valid taxons.
  2. Difficulty of getting all taxonomists to share a common dictionary. See current taxacom debate at http://markmail.org/thread/5a4dbfngsrkivqs2.

So, instead investigate use of a negative list of words, i.e. check against a dictionary to remove other common words, so leaving only the unusual which should hopefully include taxons.

This is not a perfect concept. For example, Formica minor would not be identified because minor is a common word and so would removed by spell checking the document against a common word dictionary.

Comments

Hunspell

Hunspell (→ http://hunspell.sourceforge.net/) is the default spell checker for many open source projects including the latest versions OpenOffice, Firefox and Thunderbird.
It is based on MySpell and can use MySpell dictionaries. MySpell was an earlier project to integrate various open source spell checkers for OpenOffice.
Programs can access Hunspell through the Enchant interface developed as part of Abiword, for example. This means it is also accessible in PHP, Java, Python, Ruby, etc. All inputs and outputs are in UTF-8 encoding. (MySpell uses 8bit ASCII.)

Specialist dictionaries

Published Dictionaries

While there are many spell checkers on the market, there are few with the specialist dictionaries we require.

Spellex

Spellex market several specialist dictionaries covering medical legal and other fields. There are two of potential relevance to us:

  • Biotech “a comprehensive spell checking solution for the biotechnology field covering molecular biology, biotechnology, biomedicine, general biology, general chemistry, organic chemistry, biochemistry, biophysics, microbiology, cell biology, molecular genetics, and the life sciences.”
  • Botanical “a comprehensive botanical terminology spell checker which includes the correct spelling of tens of thousands of vascular plants, mosses, liverworts, hornworts, and lichens grown in the United States, its territories, and around the world.”
  • These two dictionaries are not suitable as their coverage is not aligned with our requirements.

    WordNet

    WordNet is not intended as dictionary, in the sense of a list of words and their meanings. It goes beyond that scope to group nouns, verbs, adjectives and adverbs into sets of cognitive synonyms (synsets). Each synset expresses a distinct concept. The synsets are interlinked and can be navigated with the supplied browser.

    This scope means that WordNet contains "open-class words" only: nouns, verbs, adjectives, and adverbs. So, many common words such as pronouns and conjunctions are missing from WordNet's word list.
    WordNet can be used a standalone utility, a library called from other programs or online.

    More detail at http://wordnet.princeton.edu/.

    Ispell

    This comment is only for background information:

    • Ispell is an old UNIX spell checker, superseded by Aspell.
    • Owing to its age (it was developed in the early ’70s) it can only manage unicode text with special dictionaries.
    • Another limitation is that it will only suggest corrections that are based on a Levenshtein distance of 1.
    • Ispell is still available from http://www.lasr.cs.ucla.edu/geoff/ispell.html.

    Aspell

    A free software spell checker, it is the standard spell checker for the GNU software system. It is also available for other platforms including Windows. The current GNU Aspell version is 0.60.6, released April, 2008. The current Windows version of Aspell is 0.50.3, released December 2002.
    Aspell is the successor to Ispell, designed to spell check UTF-8 documents without the need for a special dictionary.
    It is available as a standalone utility (as used by David in his his initial investigations into this approach of a negative word list) and as a library called by other programs, for example, Notepad++.
    More information at http://aspell.net/.

    Scratchpads developed and conceived by: Vince Smith, Simon Rycroft, Dave Roberts, Ben Scott...