Stemming

What is Stemming?


 

Information Retrieval

Information Retrieval (IR) is essentially a matter of deciding which documents in a collection should be retrieved to satisfy a user's need for information. The user's information need is represented by a query or profile, and contains one or more search terms, plus perhaps some additional information such importance weights. Hence, the retrieval decision is made by comparing the terms of the query with the index terms (important words or phrases) appearing in the document itself. The decision may be binary (retrieve/reject), or it may involve estimating the degree of relevance that the document has to the query.

Unfortunately, the words that appear in documents and in queries often have many morphological variants. Thus, pairs of terms such as "computing" and "computation" will not be recognised as equivalent without some form of natural language processing (NLP).

Stemming

In most cases, morphological variants of words have similar semantic interpretations and can be considered as equivalent for the purpose of IR applications. For this reason, a number of so-called stemming Algorithms, or stemmers, have been developed, which attempt to reduce a word to its stem or root form. Thus, the key terms of a query or document are represented by stems rather than by the original words. This not only means that different variants of a term can be conflated to a single representative form it also reduces the dictionary size, that is, the number of distinct terms needed for representing a set of documents. A smaller dictionary size results in a saving of storage space and processing time.

For IR purposes, it doesn't usually matter whether the stems generated are genuine words or not thus, "computation" might be stemmed to "comput" provided that (a) different words with the same 'base meaning' are conflated to the same form, and (b) words with distinct meanings are kept separate. An algorithm which attempts to convert a word to its linguistically correct root ("compute" in this case) is sometimes called a lemmatiser.

Examples of products using stemming algorithms would be search engines such as Lycos and Google, and also thesauruses and other products using NLP for the purpose of IR. Stemmers and lemmatizers also have applications more widely within the field of Computational Linguistics.

Þ Information on the nature of Stemming Errors is available here

Þ Information on the Evaluation and Performance of Stemming algorithms is available here

Þ A Stemming Bibliography is available here

 

Different Stemming Algorithms:

Word lists to test out Stemming Algorithms:


BackBack to: The Offical Paice/Husk Homepage


Lancaster University WWW | Computing Department Intranet | Computing Department FTP server

Comments or questions about these web pages to cdp@comp.lancs.ac.uk