The Lancaster Stemming Algorithm
  Inter-Stemmer Similarity:    
Home

It is possible to compare two separate stemming algorithms by comparing the output they produce. This provides a measure of the similarity (or conversely, the distance) between the two algorithms. The approach is to take a set of words and apply both algorithms in turn, thus producing two output lists. Corresponding stems in the two output lists are then compared to give a measure of similarity between the stemmers. This could be useful, for example, for deciding whether two separate implementations of the same algorithm really do the same thing (for example, it can be used for comparing the various available implementations of Porter's algorithm).

A simple way to measure inter-stemmer similarity is by simply counting the proportion of the stem-pairs which are identical. A more sensitive measure, however, would be one based on the average Modified Hamming Distance between the pairs of stems. This will give a measure of inter-stemmer distance , so (as suggested by Frakes & Fox) it may be more useful to use the inverse of the average MHD as a similarity value.

As with stemmer strength, inter-stemmer similarity is not directly related to accuracy. Thus, you could have two stemmers which are very dissimilar and yet which are virtually identical in their ability to conflate related words. To see this, suppose you take a particular stemmer, and modify it so that it systematically recodes the last 3 letters of each stem (e.g., "A" to "B", "B" to "C", "C" to "D", etc.). The modified stemmer would group a collection of words in exactly the same way as the original, but the similarity between the two would be quite low. A different kind of metric would be needed to notice the similarity in this case!

A New Similarity Metric

The following metric was developed by Chris O'Neill of the Lancaster University Computing Department during the course of a student project produced under the supervision of Chris Paice during 2001.

This approach starts by considering the average Hamming distance SSM between the corresponding stems in two lists, where:

A and B are the stemmers being compared,

HD is the Modified Hamming distance between two strings, and

N is the number of words in the word list.

However, this approach leaves the following problems:

  • It would be better if the result could be normalised to give a value between 1 and 0,
  • The above metric fails to take into consideration the overall lengths of the two strings. Thus, removing ‘s' from the word ‘reds' to give ‘red', is a removal of 25% of the word, and would have a Hamming distance of 1. Removing the suffix ‘s' from the word ‘methylenedioxymethamphetamins' to give ‘methylenedioxymethamphetamin' is a removal of 3.4% of the word, yet this also gives a modified Hamming distance of 1.

A new metric was therefore developed which, though still based on the Hamming distance, also takes into account the lengths of the words being compared. It uses the idea that there is a maximum and minimum modified Hamming distance that can obtain between two strings. The minimum distance is zero, when both strings are identical, whereas the maximum distance MD is the length of the longer string. The relative distance RD is obtained from the modified hamming distance MHD thus:

Therefore, by calculating the sum of the relative distances for all pairs of stems, then dividing this by the number of words, an average distance metric is obtained. This was then further developed by subtracting the value from one, then multiplying it by 100 to give a percentage, so that 0% represents no similarity and 100% total similarity. Thus, the new stemmer similarity metric is given by the formula:

where

U & V are the Stemmers being compared,

N = Number of words in the sample

MHD = Modified Hamming Distance

MD = Maximum Distance

Introduction
Background Information
Stemming Algorithms
Algorithm Implemenatations
Evaluation Techniques
Error Counting
Stemmer Strength
Inter-Stemmer Similarity
Evaluation Program
Resources
Bibliography