News
Features
The Lemur Toolkit
Indri Search Engine
Lemur Query Log Toolbar
Lemur Wiki
Download
People
Discussion
Archived Forums
Tutorials
Sign Up

 
CMU - Language Technologies Institute
Carnegie Mellon University
CIIR, University of Massachusetts Amherst
University of Massachusetts
 

The Lemur Project is sponsored by the Advanced Research and Development Activity in Information Technology (ARDA) under its Statistical Language Modeling for Information Retrieval Research Program and by the National Science Foundation.


Note: These tutorials are out of date, please see the Lemur Wiki instead.


Lemur Project Tutorials:
Starting Out

Indexing: Lemur Index Types


Contents

  1. Overview
  2. Feature List

Overview

The Lemur Toolkit currently has support for two indexing types. As of version 4.4, the Lemur Toolkit supports the following indexes:

Index Type Description
KeyfileIncIndex An index that stores the indexed terms in an inverted list as well as the positions of the terms within the documents. This index can also have a document manager associated with it to store properties (metadata) about each document. The KeyfileIncIndex also has the additional ability to support incremental building by allowing new documents to be added to the index.
IndriIndex The Indri Index is the most complex and durable indexing format that the Lemur Toolkit supports. An Indri Index stores the terms in an inverted list as well as the positions of the terms within the document. The index can also store various metadata and field properties of each document within it. The index also stores a compressed version of the original document for quick, cached retreival. The Indri index also inherently supports adding documents to the index on the fly.

Feature List

Trying to decide which index type to use for your application? The tables below describes and compares the various available features in the Keyfile indexes and Indri indexes:

Storage
Feature Keyfile Index Indri Index
Stored Term Positions Yes Yes
Stored Document Representation Only stores the term vector as term IDs. Keeps an internal copy of the source document
Space Usage Approx 1.2x the corpus size Approx 2x the corpus size
Indexing
Feature Keyfile Index Indri Index
Indexable Document Formats TREC Text; TREC Web; HTML TREC Text; TREC Web; HTML; Plain Text; XML; PDF; MBox; Microsoft Word and PowerPoint (Windows-only)
Stored Metadata Yes Yes
Fields / Annotations support No Yes. Can be either in-line or in the form of offset annotations
Document Priors No Yes
Incremental Indexing Yes, but offline only Yes, allows for incremental indexing while index is in-use (online).
Retrieval
Feature Keyfile Index Indri Index
Query Language Uses an implemetation of the InQuery Query Language Uses the Indri Query Language
Wildcard Support No Yes (Suffix-based wildcards only)
INEX Task Support (nexi query language support) No Yes
XPath like support for structured document queries No Yes
Applications
Feature Keyfile Index Indri Index
Index Building BuildIndex IndriBuildIndex or BuildIndex
Batch Retrieval RetEval IndriRunQuery or RetEval
Gathering Anchor Text n/a harvestlinks
Adding Priors n/a makeprior

 


Previous: Stopword Lists and Stemmers Back to TOC  
[Previous: Stopword Lists and Stemmers] [Back to TOC]  

 


The Lemur Project The Lemur Project
Last modified: June 21, 2007. 09:14:12 am