News
Features
The Lemur Toolkit
Indri Search Engine
Lemur Query Log Toolbar
Lemur Wiki
Download
People
Discussion
Archived Forums
Tutorials
Sign Up

 
CMU - Language Technologies Institute
Carnegie Mellon University
CIIR, University of Massachusetts Amherst
University of Massachusetts
 

The Lemur Project is sponsored by the Advanced Research and Development Activity in Information Technology (ARDA) under its Statistical Language Modeling for Information Retrieval Research Program and by the National Science Foundation.


Note: These tutorials are out of date, please see the Lemur Wiki instead.


Lemur Project Tutorials:
Starting Out

Indexing: Stopword Lists and Stemmers


Contents

  1. Stopword Lists
  2. Word Stemming

Stopword Lists

While building an index, it is common to filter out common words. These words are known as "stopwords". A stopword is defined as A frequently used word, such as "a" or "the", that is typically not indexed because its use within an index is so common that it is not useful to query on.
 
When creating a parameter file and using the Indri format, you can define a stopword list by creating a <stopper> tag and defining words within it, such as in the following:

  <stopper>
    <word>a</word>
    <word>and</word>
    <word>the</word>
  </stopper>

If you are not using the Indri format, you can specify a stopword file that contains one stopword per line.

Word Stemming

Stemming attempts to reduce the forms of a word to a single term. For example, given the word "stopping", the stemmer would convert the term to "stop". This is useful for basic synonym matching but also gives the added advantage of creating a smaller index because only one form of any particular word is stored.

The Lemur Toolkit provides implementations for both the Porter and Krovetz stemming algorithms.

 


Previous: Different Document Formats Back to TOC Next: Lemur Index Types
[Previous: Different Document Formats] [Back to TOC] [Next: Lemur Index Types]

 


The Lemur Project The Lemur Project
Last modified: June 21, 2007. 09:14:12 am