Note: These tutorials are out of date, please see the Lemur Wiki instead.
Lemur Project Tutorials:
Starting Out
Indexing: Lemur Index Types
Contents
Overview
The Lemur Toolkit currently has support for two indexing types. As of version 4.4, the Lemur Toolkit supports the following indexes:
| Index Type | Description |
|---|---|
| KeyfileIncIndex | An index that stores the indexed terms in an inverted list as well as the positions of the terms within the documents. This index can also have a document manager associated with it to store properties (metadata) about each document. The KeyfileIncIndex also has the additional ability to support incremental building by allowing new documents to be added to the index. |
| IndriIndex | The Indri Index is the most complex and durable indexing format that the Lemur Toolkit supports. An Indri Index stores the terms in an inverted list as well as the positions of the terms within the document. The index can also store various metadata and field properties of each document within it. The index also stores a compressed version of the original document for quick, cached retreival. The Indri index also inherently supports adding documents to the index on the fly. |
Feature List
Trying to decide which index type to use for your application? The tables below describes and compares the various available features in the Keyfile indexes and Indri indexes:
| Storage | ||
| Feature | Keyfile Index | Indri Index |
|---|---|---|
| Stored Term Positions | Yes | Yes |
| Stored Document Representation | Only stores the term vector as term IDs. | Keeps an internal copy of the source document |
| Space Usage | Approx 1.2x the corpus size | Approx 2x the corpus size |
| Indexing | ||
| Feature | Keyfile Index | Indri Index |
| Indexable Document Formats | TREC Text; TREC Web; HTML | TREC Text; TREC Web; HTML; Plain Text; XML; PDF; MBox; Microsoft Word and PowerPoint (Windows-only) |
| Stored Metadata | Yes | Yes |
| Fields / Annotations support | No | Yes. Can be either in-line or in the form of offset annotations |
| Document Priors | No | Yes |
| Incremental Indexing | Yes, but offline only | Yes, allows for incremental indexing while index is in-use (online). |
| Retrieval | ||
| Feature | Keyfile Index | Indri Index |
| Query Language | Uses an implemetation of the InQuery Query Language | Uses the Indri Query Language |
| Wildcard Support | No | Yes (Suffix-based wildcards only) |
| INEX Task Support (nexi query language support) | No | Yes |
| XPath like support for structured document queries | No | Yes |
| Applications | ||
| Feature | Keyfile Index | Indri Index |
| Index Building | BuildIndex | IndriBuildIndex or BuildIndex |
| Batch Retrieval | RetEval | IndriRunQuery or RetEval |
| Gathering Anchor Text | n/a | harvestlinks |
| Adding Priors | n/a | makeprior |
![]() |
![]() |
|
| [Previous: Stopword Lists and Stemmers] | [Back to TOC] |
The Lemur Project
Last modified: June 21, 2007. 09:14:12 am



