![]() |
The Lancaster Stemming Algorithm |
| Porter: | |||
| Home | The Porter stemmer was first presented in 1980 and was developed by Martin Porter at the University of Cambridge. The stemmer is a context sensitive suffix removal algorithm. It is the most widely used of all the stemmers and implementations in many languages are available. The stemmer is divided into a number of linear steps, five or six depending upon the definition of a step, that are used to produce the final stem. The following section will describe each step in turn, combined with a diagram displaying all the key steps of the algorithm. The following definitions regarding the stemmer need to be made before the steps can be explained. A consonant is a letter other than A, E, I, O, U and Y preceded by a consonant. For example the in the word boy the consonants are B and Y, but in try they are T and R. A vowel is any letter that is not a consonant. A list of consonants greater than or equal to length one will be denoted by a C and a similar list of vowels by a V. Any word can therefore be represented by the single form; [C] (VC)m [V] Where the superscript m denotes m repetitions of VC and the square brackets [] denote the optional presence of their contents [19] The value m is called the measure of a word and can take any value greater than or equal to zero, and is used to decide whether a given suffix should be removed. All such rules are of the form; (condition) S1 -> S2 Which means that the suffix S1 is replaced by S2 if the remaining letters of S1 will satisfy the condition. The first step of the algorithm is designed to deal with past participles and plurals. This step is the most complex and is separated into three parts in the original definition, 1a, 1b and 1c. The first part deals with plurals, for example sses -> ss and removal of s. The second part removes ed and ing, or performs eed ? ee where appropriate. The second part continues only if ed or ing is removed and transforms the remaining stem to ensure that certain suffices are recognised later. The third part simply transforms a terminal y to an i, this part is inserted as step 2 in the flowchart. The remaining steps are relatively straightforward and contain rules to deal with different order classes of suffices, initially transforming double suffices to a single suffix and then removing suffices providing the relevant conditions are met. |
||
| Introduction | |||
| Stemming Algorithms | |||
| Porter | |||
| Lovins | |||
| Paice/Husk | |||
| Others | |||
| Algorithm Implemenatations | |||
| Evaluation Techniques | |||
| Evaluation Program | |||
| Resources | |||
| Bibliography | |||