Word clustering

One of our intended deliverables are lists of confusable terms that can be used to help us understand the orthogrpahic variants of a word. The lists will help towards later searching and fuzzy matching work.

Attached to this page are some scripts that progress this work

  • soundexer.php- script to calculate a word’s soundex value.
  • levenshteiner.php- script to calculate the Levenshtein distance between all words in a file, and write out a list of the closer matches.

The output from these scripts for our sample file, Bulletin of the British Musuem (Natural History) Volume 51 Entomology Series, is attached. All eleven of our exemplar volumes have been processed. For access to these other files, and for full details on this work ask to see the ABLE\BoB\wordClustering and ABLE\wordClustering folders on the OU's penelope server.

AttachmentSize
levenshteiner.php_.txt3.34 KB
soundexer.php_.txt1.33 KB
bulletinofbritis51entolond_levenshteined.csv_.txt344.76 KB
bulletinofbritis51entolond_soundexed.csv_.txt79.18 KB
Scratchpads developed and conceived by: Vince Smith, Simon Rycroft, Dave Roberts, Ben Scott...