Note: These tutorials are out of date, please see the Lemur Wiki instead.
Lemur Project Tutorials:
Starting Out
Indexing: Stopword Lists and Stemmers
Contents
Stopword Lists
While building an index, it is common to filter out common words. These words are known as "stopwords".
A stopword is defined as A frequently used word, such as "a" or "the", that is typically not indexed because its use within an index
is so common that it is not useful to query on.
When creating a parameter file and using the Indri format, you can define a stopword list by creating a <stopper> tag and defining words within it, such as in the following:
<stopper>
<word>a</word>
<word>and</word>
<word>the</word>
</stopper>
If you are not using the Indri format, you can specify a stopword file that contains one stopword per line.
Word Stemming
Stemming attempts to reduce the forms of a word to a single term. For example, given the word "stopping", the stemmer would convert the term to "stop". This is useful for basic synonym matching but also gives the added advantage of creating a smaller index because only one form of any particular word is stored.
The Lemur Toolkit provides implementations for both the Porter and Krovetz stemming algorithms.
![]() |
![]() |
![]() |
| [Previous: Different Document Formats] | [Back to TOC] | [Next: Lemur Index Types] |
The Lemur Project
Last modified: June 21, 2007. 09:14:12 am




