|
|
The Paice/Husk Stemmer |
|
The Paice/Husk Stemmer was developed by Chris Paice at Lancaster University in the late 1980s, and was originally implemented with assistance from Gareth Husk. The Stemmer has been implemented in Pascal, C, PERL and Java. Implementations of the Stemmer are available at a
The Paice/Husk Stemmer is a simple iterative Stemmer – that is to say, it removes the endings from a word in an indefinite number of steps. The Stemmer uses a separate rule file, which is first read into an array or list. This file is divided into a series of sections, each section corresponding to a letter of the alphabet. The section for a given letter, say "e", contains the rules for all endings ending with "e", the sections being ordered alphabetically. An index can thus be built, leading from the last letter of the word to be stemmed to the first rule for that letter.
When a word is to be processed, the stemmer takes its last letter and uses the index to find the first rule for that letter. The rule is examined, and is accepted if:
If a rule is accepted then it is applied to the word. If it is not accepted, the rule index is incremented by one and the next rule is tried. However, if the first letter of the next rule does not match with the last letter of the word, this implies that no ending can be removed, and so the process terminates.
When a rule is applied to a word, this usually means that the ending of the word is removed or replaced. For example, the rule
e1
> { -e - }means 'if the current word/stem ends with "e" then delete 1 letter and continue' (the curly brackets just contain a comment showing the rule in another form). So this is a simple 'e-removal' rule, which for example would convert "estate" to "estat". After applying this rule, the new final letter (now "t") would be taken and used to access a different section of the rule table. If, however, the final symbol had been "." instead of "
> ", the process would have terminated, and "estat" would have been returned at once.Suppose now that the rule had said:
e1i
> { -e -i }In this case, the "e" would have been removed and then replaced by the letter "i" – giving, in the present case, "estati".
Once a rule has been found to match, it is not applied at once, but must first be checked to confirm that it would leave an acceptable stem. For example, it would not be sensible to apply the 'e-removal' rule to the word "me", since the remaining stem would be too short - and would not even contain a vowel!
More details about this can be found on the 'How the Stemmer Operates' page.
figure 1. Paice/Husk Stemmer

More on the Paice/Husk Stemmer at the Offical Paice/Husk Homepage
Lancaster University WWW | Computing Department Intranet | Computing Department FTP server
Comments or questions about these web pages to cdp@comp.lancs.ac.uk