Stemming

Stemming Performance


Direct Assessment

The most primitive method for assessing the performance of a stemmer is to examine its behaviour when applied to samples of words - especially words which have already been arranged into 'conflation groups'. This way, specific errors (e.g., failing to merge "maintained" with "maintenance", or wrongly merging "experiment" with "experience") can be identified, and the rules adjusted accordingly. This approach is of very limited utility on its own, but can be used to complement other methods, such as the error-counting approach outlined later.

 

Information Retrieval

The most obvious method for comparing the usefulness of Stemmers for the field of IR is by their impact on IR performance, using a testing system and a ‘test collection’ of documents, queries and relevance judgments. This involves substituting different Stemmers to see which gives the best results in terms of performance metrics such as Precision and Recall. However, there are problems with using such a technique for deciding which stemmer to use in an IR system, since the results are frequently indecisive. Thus, the ‘best’ Stemmer may be different for different databases and different searches.


Error Counting

It is possible to evaluate stemming by counting the numbers of two kinds of errors that occur during stemming, namely;

Using a sample file of grouped words, these errors are then counted. This evaluation method was described by Chris Paice in a paper entitled 'Method for Evaluation of Stemming Algorithms Based on Error Counting' JASIS 47(8), August 1996, 632-649. This method returns a value for an Under-Stemming (or Conflation) index;

UI = Under-Stemming Index
CI = Conflation Index: proportion of equivalent word pairs which were successfully grouped to the same stem.

UI= 1 - CI

Also a value for an Over-Stemming (or Distinctness) index;

OI = Over-Stemming index
DI = Distinctness Index: proportion of non-equivalent word pairs which remained distinct after stemming.

OI= 1 - DI

See Stemming Errors for a fuller description of all this.

The purpose of this error-counting approach is that, although it is advantageous to have the index of terms compressed, this is only useful up to a point. This is because, as conflation becomes 'heavier', the merging of distinct concepts becomes increasingly frequent. At this point, small increases in Recall are gained at the expense of a major loss of Precision.

One question mark over this approach concerns the validity of the grouped file against which the errors are assessed. In Paice (1996), these grouped files were constructed by human judgment, during scrutiny of sample word lists.


Stemmer Strength

A 'weak' or 'light' stemmer is one which only merges a few of the most highly related words together - for example, just singular and plural forms, or inflective variants of verbs ("react", "reacts", "reacting", "reacted"). A 'strong' or 'heavy' stemmer, on the other hand, merges a much wider variety of forms (… "reaction", "reactions", "reactive", "reactivity", "reactant" etc.). In IR searching, light stemming seems to be most effective; heavy stemming increases the chances of ambiguity and confusion (for instance, merging "author" with "authority" is generally not helpful). In other situations - e.g., for displaying terms in a user interface - a heavier stemmer may be quite useful.

Whether the impact of a stemmer is positive depends not only on its strength, but also on whether performs its task accurately - stemmer strength metrics do not represent actual stemming accuracy (for further discussion of which, see Stemming Errors).

Frakes & Fox (in press) have listed the following ways to measure stemmer strength:

 

This metric is obviously dependent on the number of words processed, but for a word collection of given size, a higher value indicates a heavier stemmer. The value is easily calculated as follows:

WC = Mean number of words per conflation class

N = Number of unique words before Stemming

S = Number of unique stems after Stemming

Mean number of words per conflation class

IC = Index Compression Factor

N = Number of unique words before Stemming

S = Number of unique stems after Stemming

Index Compression Factor

 

Obtaining such data for a large collection of words enables us to compute not only the mean but also the mode and the median of the resulting distribution, in case these seem more useful.

The meaning of 'number of characters removed' is not clear-cut for those stemmers which replace endings rather than just removing them. Thus, a stemmer may replace the endings "-ies" and "-ied" by "-y". We could count this as 3 (number of letters removed), or 2 (reduction in length), or even 4 (letters removed + letters added). Alternatively, we could use a metric such as the Hamming Distance, or Modified Hamming Distance, as discussed next.

 

MHD = HD(1,P) + (Q-P)

where HD(1,P) is the Hamming Distance for the first P characters of both strings. Applying this to a stemmer, suppose that the word "parties" is converted to "party". In this case, P=5 and Q=7, so that HD(1,P) = 1 by comparing "parti" with "party", and (Q-P) = 2, giving MHD = 3. Clearly, we can compute the average MHD value for every word in the original sample.

 


Inter-Stemmer Similarity

It is possible to compare two separate stemming algorithms by comparing the output they produce. This provides a measure of the similarity (or conversely, the distance) between the two algorithms. The approach is to take a set of words and apply both algorithms in turn, thus producing two output lists. Corresponding stems in the two output lists are then compared to give a measure of similarity between the stemmers. This could be useful, for example, for deciding whether two separate implementations of the same algorithm really do the same thing (for example, it can be used for comparing the various available implementations of Porter's algorithm).

A simple way to measure inter-stemmer similarity is by simply counting the proportion of the stem-pairs which are identical. A more sensitive measure, however, would be one based on the average Modified Hamming Distance between the pairs of stems. This will give a measure of inter-stemmer distance, so (as suggested by Frakes & Fox) it may be more useful to use the inverse of the average MHD as a similarity value.

As with stemmer strength, inter-stemmer similarity is not directly related to accuracy. Thus, you could have two stemmers which are very dissimilar and yet which are virtually identical in their ability to conflate related words. To see this, suppose you take a particular stemmer, and modify it so that it systematically recodes the last 3 letters of each stem (e.g., "A" to "B", "B" to "C", "C" to "D", etc.). The modified stemmer would group a collection of words in exactly the same way as the original, but the similarity between the two would be quite low. A different kind of metric would be needed to notice the similarity in this case!

A New Similarity Metric

The following metric was developed by Chris O'Neill of the Lancaster University Computing Department during the course of a student project produced under the supervision of Chris Paice during 2001.

This approach starts by considering the average Hamming distance SSM between the corresponding stems in two lists, where:

A and B are the stemmers being compared,

HD is the Modified Hamming distance between two strings, and

N is the number of words in the word list.

Stemmer Similarity Metric

However, this approach leaves the following problems:

A new metric was therefore developed which, though still based on the Hamming distance, also takes into account the lengths of the words being compared. It uses the idea that there is a maximum and minimum modified Hamming distance that can obtain between two strings. The minimum distance is zero, when both strings are identical, whereas the maximum distance MD is the length of the longer string. The relative distance RD is obtained from the modified hamming distance MHD thus:

Relative Distance

Therefore, by calculating the sum of the relative distances for all pairs of stems, then dividing this by the number of words, an average distance metric is obtained. This was then further developed by subtracting the value from one, then multiplying it by 100 to give a percentage, so that 0% represents no similarity and 100% total similarity. Thus, the new stemmer similarity metric is given by the formula:

New Similarity Metric

where

U & V are the Stemmers being compared,

N = Number of words in the sample

MHD = Modified Hamming Distance

MD = Maximum Distance

 


Back to: The Offical Paice/Husk Homepage

BackBack to: What is Stemming?


Lancaster University WWW | Computing Department Intranet | Computing Department FTP server

Comments or questions about these web pages to cdp@comp.lancs.ac.uk