![]() |
The Lancaster Stemming Algorithm |
| Inter-Stemmer Similarity: | |||
| Home | It is possible to compare two separate stemming algorithms by comparing the output they produce. This provides a measure of the similarity (or conversely, the distance) between the two algorithms. The approach is to take a set of words and apply both algorithms in turn, thus producing two output lists. Corresponding stems in the two output lists are then compared to give a measure of similarity between the stemmers. This could be useful, for example, for deciding whether two separate implementations of the same algorithm really do the same thing (for example, it can be used for comparing the various available implementations of Porter's algorithm). A simple way to measure inter-stemmer similarity is by simply counting the proportion of the stem-pairs which are identical. A more sensitive measure, however, would be one based on the average Modified Hamming Distance between the pairs of stems. This will give a measure of inter-stemmer distance , so (as suggested by Frakes & Fox) it may be more useful to use the inverse of the average MHD as a similarity value. As with stemmer strength, inter-stemmer similarity is not directly related to accuracy. Thus, you could have two stemmers which are very dissimilar and yet which are virtually identical in their ability to conflate related words. To see this, suppose you take a particular stemmer, and modify it so that it systematically recodes the last 3 letters of each stem (e.g., "A" to "B", "B" to "C", "C" to "D", etc.). The modified stemmer would group a collection of words in exactly the same way as the original, but the similarity between the two would be quite low. A different kind of metric would be needed to notice the similarity in this case! A New Similarity Metric The following metric was developed by Chris O'Neill of the Lancaster University Computing Department during the course of a student project produced under the supervision of Chris Paice during 2001. This approach starts by considering the average Hamming distance SSM between the corresponding stems in two lists, where: A and B are the stemmers being compared, HD is the Modified Hamming distance between two strings, and N is the number of words in the word list.
However, this approach leaves the following problems:
A new metric was therefore developed which, though still based on the Hamming distance, also takes into account the lengths of the words being compared. It uses the idea that there is a maximum and minimum modified Hamming distance that can obtain between two strings. The minimum distance is zero, when both strings are identical, whereas the maximum distance MD is the length of the longer string. The relative distance RD is obtained from the modified hamming distance MHD thus:
Therefore, by calculating the sum of the relative distances for all pairs of stems, then dividing this by the number of words, an average distance metric is obtained. This was then further developed by subtracting the value from one, then multiplying it by 100 to give a percentage, so that 0% represents no similarity and 100% total similarity. Thus, the new stemmer similarity metric is given by the formula:
where U & V are the Stemmers being compared, N = Number of words in the sample MHD = Modified Hamming Distance MD = Maximum Distance |
||
| Introduction | |||
| Stemming Algorithms | |||
| Algorithm Implemenatations | |||
| Evaluation Techniques | |||
| Error Counting | |||
| Stemmer Strength | |||
| Inter-Stemmer Similarity | |||
| Evaluation Program | |||
| Resources | |||
| Bibliography | |||