The Lancaster Stemming Algorithm
  Introduction:    
Home

The Paice/Husk Stemmer was developed in the Computing Department of Lancaster University in the late 1980s. It was designed by Chris Paice with the assistance of Gareth Husk, and was first implemented in the Pascal programming language. Due to the reducing popularity of Pascal, further implementations have been made using ANSI C and Java. A Perl version has also been implemented by Mary Taffet at the Center for Natural Language Processing at Syracuse University. All of these versions are available from this site, and all (except possibly the C version) are believed to be accurate (but note the Disclaimer below).

The Paice/Husk Stemmer consists of a stemming algorithm and a separate set of stemming rules. The standard set of rules provides a rather 'strong' or 'heavy' Stemmer which is quite aggressive in conflation of words. Stemmer strength is a quality that can be extremely advantageous for index compression. A heavy stemmer, however, tends to produce a rather large number of Overstemming errors relative to the number of Understemming errors . Users who would prefer a lighter stemmer can develop their own rule sets.

The Paice/Husk Stemmer a simple iterative Stemmer, in that endings are removed piecemeal in an indefinite number of stages. As stated above, it uses a rule file, which is first read into a list. Full details regarding the operation of the stemmer are provided in the Stemming Algorithms section.

 

Introduction
Background Information
Stemming Algorithms
Algorithm Implemenatations
Evaluation Techniques
Evaluation Program
Resources
Bibliography