Stemming

What is Porter Stemming?



The Porter Stemmer is a conflation Stemmer developed by Martin Porter at the University of Cambridge in 1980. The Stemmer is based on the idea that the suffixes in the English language (approximately 1200) are mostly made up of a combination of smaller and simpler suffixes. This Stemmer is a linear step Stemmer. Specifically it has five steps applying rules within each step. Within each step, if a suffix rule matched to a word, then the conditions attached to that rule are tested on what would be the resulting stem, if that suffix was removed, in the way defined by the rule. For example such a condition may be, the number of vowel characters, which are followed be a consonant character in the stem (Measure), must be greater than one for the rule to be applied.

Once a Rule passes its conditions and is accepted the rule fires and the suffix is removed and control moves to the next step. If the rule is not accepted then the next rule in the step is tested, until either a rule from that step fires and control passes to the next step or there are no more rules in that step whence control moves to the next step. This process continues for all five steps, the resultant stem being returned by the Stemmer after control has been passed from step five. See figure 2.

The Porter Stemmer is a very widely used and available Stemmer, and is used in many applications. Implementations of this Stemmer are available at a website established by Porter himself, with implementations in Java, C and PERL; the website also includes a copy of the paper defining the Algorithm. Other implementations of this algorithm are available from the Web. Porter's algorithm is probably the stemmer most widely used in IR research.

figure 2. Porter Stemmer

Porter Flow Chart

Back Back to: The Offical Paice/Husk Homepage

Back Back to: What is Stemming?


Lancaster University WWW | Computing Department Intranet | Computing Department FTP server
Comments or questions about these web pages to cdp@comp.lancs.ac.uk