Another Stemmer

Chris D. Paice

Department of Computing
Lancaster University
Bailrigg, Lancaster LA1 4YR
U.K.
cdp@comp.lancs.ac.uk


1. Conflation and Stemming

In natural language processing, conflation is the process of merging or lumping together nonidentical words which refer to the same principal concept. This can relate both to words which are entirely different in form (e.g, "group" and "collection"), and to words which share some common root (e.g., "group", "grouping", "subgroups".) In the former case the words can only be mapped by referring to a dictionary or thesaurus, but in the latter case use can be made of the orthographic similarities between the forms. One popular approach is to remove affixes from the input words, thus reducing them to a stem; if this could be done correctly, all the variant forms of a word would be converted to the same standard form. Since the process is aimed at mapping for retrieval purposes, the stem need not be a linguistically correct lemma or root (see also Frakes 1982).

It is commonly accepted that removal of word-endings (sometimes called suffix stripping) is a good idea; removal of prefixes can be useful in some subject domains (chemistry is an obvious example), but is not so widely practised. In this note I will use the term stemming to refer exclusively to the removal (or sometimes replacement) of inflectional and derivational suffixes.

A number of stemming algorithms have been described in the literature, notable among them being Lovins (1968), Dawson (1974) and Porter (1980); for other sources, see Lennon et al. (1981) or Frakes (1982).

The Lovins stemmer refers to a table of 294 endings, arranged on a 'longest match' principle. Once an ending has been removed, the truncated form is referred to a separate table of 'recoding rules" which deals with double consonants and carries out various other adjustments (e.g., "-iev" is changed to "-ief" and "-olv" to "-olut"). The Dawson stemmer uses a much more comprehensive list of around 1,200 suffixes, organised as a set of branched character trees for rapid access. In this case there is no recoding stage, remaining differences between stemmed words being allowed for afterwards by a 'partial matching' routine. By contrast, the Porter algorithm proceeds through a fixed succession of five stages, with a different lookup table being used each time. In all these methods, the individual rules are accompanied by conditions designed to improve the accuracy of the stemming and to prevent words from being shortened too far (e.g., to prevent "ring" and "red" being converted to "r").

The performance of stemming and conflation procedures are usually measured in terms of their effect on retrieval performance in IR test collections - see in particular the work by Lennon et al. (1981). It is fairly clear that conflation generally has a beneficial effect on retrieval performance, but (a) there is little to choose between different algorithms, and (b) the results vary quite a lot between different test collections (see also Frakes 1982, chapter II).

References in the literature to stemming algorithms habitually talk about 'Porter's algorithm', 'Lovins' algorithm' etc. as though these are fixed and immutable entities. In fact, the papers by Porter and by Lovins do two different things: they describe a general algorithm for stemming, and they also provide a specific collection of rules to which the algorithm may be applied [1]. It is obvious that their rule tables could be adjusted in order to optimise them, or to adapt them to different applications. The fact that nobody apparently tries to do this [2] may be tied up with the fact that these procedures are only ever evaluated in terms of overall IR system performance. This means that the evaluations do not provide any detailed feedback on what is actually happening during the stemming process; hence they provide no guidance on how the rule tables might be improved.


2. The Paice/Husk Stemmer [3]

This note outlines a stemming algorithm which has been routinely used at Lancaster University for several years. Though it has not so far been formally evaluated, it appears to work well, and is quite efficient and easy to implement [4].

The Paice/Husk stemmer is iterative, and uses just one table of rules; each rule may specify either deletion or replacement of an ending, so there is no need for anything corresponding to Lovins' recoding stage. The rules are grouped into sections corresponding to the final letter of the suffix; this means that the rule table is accessed quickly by looking up the final letter of the current word or truncated word. Within each section the ordering of the rules is significant. Some rules are restricted to 'intact' words - i.e., words from which no ending has yet been removed. A simple blanket acceptability test is applied before any matching rule is activated After a rule has been applied, processing may be allowed to continue iteratively, or may be terminated.

There now follows a more detailed description of the stemmer; a sample set of rules is given in the Appendix.


2.1 The Rule Format

In the following, the term form refers to any word or part-word which is being considered for stemming. The original ward before any changes have been made to it is said to be intact. Each line in the rule-table (see Appendix) holds a separate stemming rule. Braces {...} enclose comments describing the action of each rule.

The rules in the table are grouped into sections, each containing all those rules relating to a particular final letter (known as the section letter).

In the current Pascal implementation the rules are held in an array. Each rule is represented as a string of characters (nine characters suffice for all rules at the present time). The procedure which reads in the rules constructs an index array allowing fast access to the relevant section. Each rule has five components, two of which are optional:

Within each section, the order of the rules is significant

Example 1: the rule "sei3y>" means: if the word end in "-ies" then replace the last three letters by. "-y" and then apply the stemmer again to the truncated form.

Example 2: the rule "mu*2." means: if the word ends in "-um" and if the word is intact, then remove the last two letters and terminate. This converts "maximum" to "maxim" but leaves "presum" (from "presumably" etc.) unchanged.

Example 3: the rule "ylp0." means: if the ward ends in "-ply" then leave it unchanged and terminate. This ensures that the subsequent rule "yl2>" does not remove the "-ly" from "multiply".

Example 4: the rule "nois4j>' causes "-sion" endings to be replaced by "-j". This acts as a dummy, causing activation of the "j" section of the rules (q.v.). Hence "provision" is converted first to "provij" and then to "provid"


2.2 The Algorithm

1. Select relevant section:

Inspect the final letter of the form;
if there is no section corresponding to that letter, then terminate;
otherwise, consider the first rule in the relevant section.

2. Check applicability of rule:

If the final letters of the form do not match the reversed ending in the rule, then goto 4;
if the ending matches, and the intact flag is set, and the form is not intact, then goto 4;
if the acceptability conditions (see below) are not satisfied, then goto 4.

3. Apply rule:

Delete from the right end of the form the number of characters specified by the remove total; if there is an append string, then append it to the form;
if the continuation symbol is "." then terminate;
otherwise (if the continuation symbol is ">") then goto 1

4. Look for another rule:

Move to the next rule in the table;
if the section letter has changed then terminate;
otherwise goto 2.


2.3 Acceptability Conditions
If these conditions were not present, the words "rent", "rant", "rice", "rage", "rise", "rate", "ration", and "river" would all be reduced to "r" by the rules shown. The conditions used are:
a) if the form starts with a vowel then at least two letters must remain after stemming (e.g.,
"owed"/"owing" -> "ow", but not "ear" -> "e").
b) if the form starts with a consonant men at least three letters must remain after stemming and at least one of these must be a vowel or "y" (e.g., "saying" -> "say" and "crying" ->
"cry", but not "string" -> "str", "meant" -> "me" or "cement" -> "ce").
These conditions wrongly prevent the stemming of various short-rooted words (e.g., "doing", "dying", "being"); it is probably best to deal with these separately by lexical lookup.


Notes:
[1] Of course, the present paper does exactly the same thing!
[2] I daresay there are people who tinker with the rules, but they presumably do so in an ad hoc fashion; no systematic work on stemmer optimisation seems to have appeared in the literature.
[3] 'Paice/Husk' serves to distinguish this from a more primitive algorithm outlined in Paice (1977), and acknowledges the contribution of Gareth Husk in implementing and testing the algorithm.
[4] The author can supply Pascal procedures to read in a rule set and to apply the stemmer.
[5] This is of course not essential, but makes the implementation slightly simpler.


References

Dawson,J.L 1974: "Suffix removal and word connation," ALLC Bulletin, 2(3), 33-46 (1974).

Frakes,W.B. 1982: Term Conflation for Information Retrieval, PhD. dissertation, Syracuse University, August 1982.

Lennon,M., Pierce,D.S., Tarry,B.D. and Willett,P. 1981: "An evaluation of some conflation algorithms for information retrieval", Journal of information Science, 3, 177-183 (1981).

Lovins,J.B. 1968: "Development of a stemming algorithm". Mechanical Translation and Computational Linguistics, 11, 22-31(1968).

Paice,C.D. 1977: Information Retrieval and the Computer, London: MacDonald & Jane's, 1977; Chapter 4.

Porter,M.F. 1980: "An algorithm for suffix stripping", Program, 14, 130-137 (1980)

Ulmschneider,J. and Doszkocs,T, 1983: "A practical stemming algorithm for online search assistance". Online Review, 7(4), (1983).