The Lancaster Stemming Algorithm
  Stemmer Stength:    
Home


A 'weak' or 'light' stemmer is one which only merges a few of the most highly related words together - for example, just singular and plural forms, or inflective variants of verbs ("react", "reacts", "reacting", "reacted"). A 'strong' or 'heavy' stemmer, on the other hand, merges a much wider variety of forms (… "reaction", "reactions", "reactive", "reactivity", "reactant" etc.). In IR searching, light stemming seems to be most effective; heavy stemming increases the chances of ambiguity and confusion (for instance, merging "author" with "authority" is generally not helpful). In other situations - e.g., for displaying terms in a user interface - a heavier stemmer may be quite useful.

Whether the impact of a stemmer is positive depends not only on its strength, but also on whether performs its task accurately - stemmer strength metrics do not represent actual stemming accuracy

Frakes & Fox (in press) have listed the following ways to measure stemmer strength:

  • Number of words per conflation class

  • This is the average size of the groups of words coverted to a particular stem (regardless of whether they are all correct). Thus, if the words "engineer", "engineering" and "engineered", and no others, were all stemmed to the stem "engineer", then the size of that conflation class would be 3. If the conflation of 1,000 different words resulted in 250 distinct stems, then the mean number of words per conflation class would be 4.

This metric is obviously dependent on the number of words processed, but for a word collection of given size, a higher value indicates a heavier stemmer. The value is easily calculated as follows:

WC = Mean number of words per conflation class

N = Number of unique words before Stemming

S = Number of unique stems after Stemming

  • Index Compression

  • The Index Compression Factor represents the extent to which a collection of unique words is reduced (compressed) by stemming, the idea being that the heavier the Stemmer, the greater the Index Compression Factor. This can be calculated by;

IC = Index Compression Factor

N = Number of unique words before Stemming

S = Number of unique stems after Stemming

  • The Word Change Factor

  • This is a simply the proportion of the words in a sample that have been changed in any way by the stemming process, the idea being that the larger the number of words that are altered, the greater the strength of the Stemmer.
  • Number of Characters Removed

  • There are various metrics which use the principle that strong stemmers remove more characters from words than weaker stemmers. One way is to compute the average number of letters removed when a stemmer is applied to a text collection. Thus, suppose that the nine words "react", "reacts", "reacting", "reacted", "reaction", "reactions", "reactive", "reactivity" and "reactivities" are all reduced to "react"; the numbers of characters removed are then 0, 1, 3, 2, 3, 4, 3, 5 and 7, respectively. This gives a mean removal rate of 28/9 = 3.11.

Obtaining such data for a large collection of words enables us to compute not only the mean but also the mode and the median of the resulting distribution, in case these seem more useful.

The meaning of 'number of characters removed' is not clear-cut for those stemmers which replace endings rather than just removing them. Thus, a stemmer may replace the endings "-ies" and "-ied" by "-y". We could count this as 3 (number of letters removed), or 2 (reduction in length), or even 4 (letters removed + letters added). Alternatively, we could use a metric such as the Hamming Distance, or Modified Hamming Distance, as discussed next.

  • Hamming Distance

  • The Hamming Distance takes two strings of equal length and counts the number of corresponding positions where the characters are different. If the strings are of different lengths, we can use the Modified Hamming Distance , MHD. Thus, suppose the string lengths are P and Q, where P<Q, we use the formula

MHD = HD(1,P) + (Q-P)

where HD(1,P) is the Hamming Distance for the first P characters of both strings. Applying this to a stemmer, suppose that the word "parties" is converted to "party". In this case, P=5 and Q=7, so that HD(1,P) = 1 by comparing "parti" with "party", and (Q-P) = 2, giving MHD = 3. Clearly, we can compute the average MHD value for every word in the original sample.

 

Introduction
Background Information
Stemming Algorithms
Algorithm Implemenatations
Evaluation Techniques
Error Counting
Stemmer Strength
Inter-Stemmer Similarity
Evaluation Program
Resources
Bibliography