News
Features
The Lemur Toolkit
Indri Search Engine
Lemur Query Log Toolbar
Lemur Wiki
Download
People
Discussion
Archived Forums
Sign Up

 
CMU - Language Technologies Institute
Carnegie Mellon University
CIIR, University of Massachusetts Amherst
University of Massachusetts
 

The Lemur Project is sponsored by the Advanced Research and Development Activity in Information Technology (ARDA) under its Statistical Language Modeling for Information Retrieval Research Program and by the National Science Foundation.


Note: These tutorials are out of date, please see the Lemur Wiki instead.


Lemur Project Tutorials:
Starting Out

Indexing: Creating a Simple Index


Contents

  1. Preparing your Documents
  2. Creating a Parameter File
  3. Building an Indri Index
  4. Building other Index Types

This page gives a quick "beginner's" overview. For a more detailed view of indexing, see the Intermediate Track's indexing pages.

Preparing your Documents

The Lemur Toolkit can inheriently index several type of documents. The two most common types of documents are TREC text, TREC web, and standard web (HTML) documents. See the next page for a more detailed overview of the document types. If you have a set of plaintext documents that you wish to index, one of the easiest ways to prepare the documents is to use a PERL script to iterate through the documents adding TREC tags for <DOC> and a unique <DOCNO>.

Creating a Parameter File

After your documents are prepared, you should create a parameter file that will be used to guide the indexer to where your source documents are and where to place the index.
A sample parameter file looks like:

<parameters>
  <corpus>
	<path>/path/to/text/files/</path>
	<class>trectext</class>
	<annotations>/path/to/offset/annotations</annotations>
  </corpus>
  <memory>256m</memory>
  <index>/path/to/your/index</index>
</parameters>
	
In the parameter file above, the "corpus / path" defines where to find your source files. If this path is a directory, it will tell the indexer to index all files in the directory. The "class" parameter of the corpus defines what type of documents the source documents are. The example above uses trectext. Other common class types include "trecweb", "html" and "pdf". If the <class> parameter is left out, the index builder will attempt to parse the files based on their file extension, skipping over any files that it does not know how to process.
 
The "annotations" path for the corpus is a path to an offset annotation file to use for this corpus. For more information on using offset annotations, please see the "Working with Offset Annotations" section in the tutorials.
 
The "memory" parameter is a "soft-limit" of the amount of memory the indexer should use before flushing it's buffers to disk.
 
Finally, the "index" parameter tells the indexer where to place the built index.

More information on valid parameters, please see the indexing section for IndriBuildIndex.

Building an Indri Index

To launch an index build, run "IndriBuildIndex [param_file]" where [param_file] is the filename of the parameter file you have built.

Building other Index Types

If you wish to build other index types besides an Indri index, run "BuildIndex [param_file]" where [param_file] is the filename of the parameter file you have built. Parameter files for BuildIndex have a couple of different options that can be used:

The parameters are:
 

  1. index: name of the index table-of-content file without the extension. use full path information here to use index later from other directories. i.e. /lemur/indexes/myindex
  2. indexType: the type of the index you want to build
    • key for KeyfileIncIndex (.key)
    • indri for IndriIndex (.ind)
  3. memory: memory (in bytes) to use as a soft-limit before flushing the buffers to disk.
  4. stopwords: name of file containing the stopword list.
  5. acronyms: name of file containing the acronym list, currently not supported by IndriIndex. These acronyms will still be indexed in lowercase by IndriIndex.
  6. countStopWords: If true, count stopwords in document length.
  7. docFormat:
    • trec for standard TREC formatted documents
    • web for web TREC formatted documents
    • chinese for segmented Chinese text (TREC format, GB encoding)
    • chinesechar for unsegmented Chinese text (TREC format, GB encoding)
    • arabic for Arabic text (TREC format, Windows CP1256 encoding)
  8. stemmer:
    • porter Porter stemmer.
    • krovetz Krovetz stemmer.
    • arabic arabic stemmer, requires additional parameters
      1. arabicStemDir: Path to directory of data files used by the Arabic stemmers.
      2. arabicStemFunc: Which stemming algorithm to apply, one of:
        • arabic_stop : arabic_stop
        • arabic_norm2 : table normalization
        • arabic_norm2_stop : table normalization with stopping
        • arabic_light10 : light9 plus ll prefix
        • arabic_light10_stop : light10 and remove stop words
  9. dataFiles: name of file containing list of datafiles to index.

 


  Back to TOC Next: Different Document Formats
  [Back to TOC] [Next: Different Document Formats]

 


The Lemur Project The Lemur Project
Last modified: June 21, 2007. 09:14:12 am