![]() |
The Lancaster Stemming Algorithm |
| Stemmer Stength: | |||
| Home |
Whether the impact of a stemmer is positive depends not only on its strength, but also on whether performs its task accurately - stemmer strength metrics do not represent actual stemming accuracy Frakes & Fox (in press) have listed the following ways to measure stemmer strength:
This is the average size of the groups of words coverted to a particular stem (regardless of whether they are all correct). Thus, if the words "engineer", "engineering" and "engineered", and no others, were all stemmed to the stem "engineer", then the size of that conflation class would be 3. If the conflation of 1,000 different words resulted in 250 distinct stems, then the mean number of words per conflation class would be 4. This metric is obviously dependent on the number of words processed, but for a word collection of given size, a higher value indicates a heavier stemmer. The value is easily calculated as follows: WC = Mean number of words per conflation class N = Number of unique words before Stemming S = Number of unique stems after Stemming
The Index Compression Factor represents the extent to which a collection of unique words is reduced (compressed) by stemming, the idea being that the heavier the Stemmer, the greater the Index Compression Factor. This can be calculated by; IC = Index Compression Factor N = Number of unique words before Stemming S = Number of unique stems after Stemming
This is a simply the proportion of the words in a sample that have been changed in any way by the stemming process, the idea being that the larger the number of words that are altered, the greater the strength of the Stemmer.
There are various metrics which use the principle that strong stemmers remove more characters from words than weaker stemmers. One way is to compute the average number of letters removed when a stemmer is applied to a text collection. Thus, suppose that the nine words "react", "reacts", "reacting", "reacted", "reaction", "reactions", "reactive", "reactivity" and "reactivities" are all reduced to "react"; the numbers of characters removed are then 0, 1, 3, 2, 3, 4, 3, 5 and 7, respectively. This gives a mean removal rate of 28/9 = 3.11. Obtaining such data for a large collection of words enables us to compute not only the mean but also the mode and the median of the resulting distribution, in case these seem more useful. The meaning of 'number of characters removed' is not clear-cut for those stemmers which replace endings rather than just removing them. Thus, a stemmer may replace the endings "-ies" and "-ied" by "-y". We could count this as 3 (number of letters removed), or 2 (reduction in length), or even 4 (letters removed + letters added). Alternatively, we could use a metric such as the Hamming Distance, or Modified Hamming Distance, as discussed next.
The Hamming Distance takes two strings of equal length and counts the number of corresponding positions where the characters are different. If the strings are of different lengths, we can use the Modified Hamming Distance , MHD. Thus, suppose the string lengths are P and Q, where P<Q, we use the formula MHD = HD(1,P) + (Q-P) where HD(1,P) is the Hamming Distance for the first P characters of both strings. Applying this to a stemmer, suppose that the word "parties" is converted to "party". In this case, P=5 and Q=7, so that HD(1,P) = 1 by comparing "parti" with "party", and (Q-P) = 2, giving MHD = 3. Clearly, we can compute the average MHD value for every word in the original sample.
|
||
| Introduction | |||
| Stemming Algorithms | |||
| Algorithm Implemenatations | |||
| Evaluation Techniques | |||
| Error Counting | |||
| Stemmer Strength | |||
| Inter-Stemmer Similarity | |||
| Evaluation Program | |||
| Resources | |||
| Bibliography | |||