The Lancaster Stemming Algorithm
  Error Counting:    
Home

It is possible to evaluate stemming by counting the numbers of two kinds of errors that occur during stemming, namely;

  • Under-Stemming
    This refers to words that should be grouped together by stemming, but aren't. This causes a single concept to be spread over various different stems, which will tend to decrease the Recall in an IR search.
  • Over-Stemming
    This refers to words that shouldn't be grouped together by stemming, but are. This causes the meanings of the stems to be diluted, which will effect Precision of IR.

Using a sample file of grouped words, these errors are then counted. This evaluation method was described by Chris Paice in a paper entitled 'Method for Evaluation of Stemming Algorithms Based on Error Counting' JASIS 47(8), August 1996, 632-649. This method returns a value for an Under-Stemming (or Conflation) index;

UI = Under-Stemming Index
CI = Conflation Index: proportion of equivalent word pairs which were successfully grouped to the same stem.

UI= 1 - CI

Also a value for an Over-Stemming (or Distinctness) index;

OI = Over-Stemming index
DI = Distinctness Index: proportion of non-equivalent word pairs which remained distinct after stemming.

OI= 1 - DI

An Example

The following examples use the so-called Truncate(n) stemmer, which simply retains the first n letters of the word, where n is a suitable integer, such as 4, 5 or 6. If the word has less than n letters to start with, it is returned unchanged.

Consider the seven words shown in the first column of the table below, which fall into two distinct conceptual groups. We will look at the effect of applying Truncate(5) and Truncate(4) to these words.

Truncate(N)


Under-Stemming Truncate(5)

1 = A pair of groupable words with identical stems. This represents a successful stemming operation.
0 = A pair of groupable words with non-identical stems. This represents an Understemming error.
X = A pair of non-groupable words. These pairs are not considered when counting Understemming errors.

In this case, the proportion of word pairs successfully merged is 10 out of 22, hence UI= 1 - (10/22) = 0.545.


Over-Stemming Truncate(5)

1 = A pair of non-groupable words with non-identical stems. This represents a successful stemming operation.
0 = A pair of non-groupable words with identical stems. This represents an Overstemming error.
X = A pair of groupable words. These pairs are not considered when counting Overstemming errors.

In this case, the proportion of word pairs which were correctly not merged to the same stem is 20 out of 20, hence OI = 1 - (20/20), OI = 0.


Under-Stemming Truncate(4)

In this case, the proportion of word pairs correctly merged is 22 out of 22, hence UI= 1 - (22/22) = 0.


Over-Stemming Truncate(4)

Here, the proportion of word pairs which were correctly not conflated is 0 out of 20, so that OI = 1 - (20/20) = 1.

The purpose of this error-counting approach is that, although it is advantageous to have the index of terms compressed, this is only useful up to a point. This is because, as conflation becomes 'heavier', the merging of distinct concepts becomes increasingly frequent. At this point, small increases in Recall are gained at the expense of a major loss of Precision.

One question mark over this approach concerns the validity of the grouped file against which the errors are assessed. In Paice (1996), these grouped files were constructed by human judgment, during scrutiny of sample word lists.

Introduction
Background Information
Stemming Algorithms
Algorithm Implemenatations
Evaluation Techniques
Error Counting
Stemmer Strength
Inter-Stemmer Similarity
Evaluation Program
Resources
Bibliography