Building Indexes

From LemurWiki

Jump to: navigation, search

This page gives a quick overview of the indexing process and the differences between index types. For more detailed information on both building indexes and using annotations, see:


Contents

[edit] Index Types

The lemur toolkit comes with two types of indexes that can be built using the out-of-the-box tools: the KeyFile and the Indri index types.

Both index types will index term positions and metadata and allows for incremental updating of the indexes. There are, however, some significant differences between the two:

  • In addition to metadata, Indri indexes can store field and annotation data, which can be searched on. See the section Fields and Metadata for more information.
  • Although both indexes can be updated incrementally, a KeyFile index must have exclusive access to the index (i.e. the index must be offline) whereas an Indri index can be added to at any time while the index is in use.
  • Both index types allow for retrieval using the InQuery Query Language, but an Indri index also allows the use of The Indri Query Language.
  • The Indri indexer allows the use of more types of documents to be indexes (see the "Document Preparation" section below)
  • Finally, there are some structural differences between the indexes that are built. See the "Time and Space Requirements" below, and the Indri Repository Structure section for details.

[edit] Document Preparation

The Lemur Toolkit can inherently deal with several different document format types without any modification. Both KeyFile and Indri indexes can handle TREC Text, TREC Web formatted text, and HTML text. In addition, Indri indexes can handle XML, PDF, plain text, MBox (Unix mailboxes) and (on Windows machines with MS Office), Microsoft Word and PowerPoint.

If your documents are not in one of those formats already, you can see the section Indexer File Formats for ways of wrapping your documents into TREC Text or TREC Web formatted documents.

Alternatively (and much more advanced), you can write your own parser to index your files online (see Creating your own Parser).

[edit] Indexing Parameters

Basic parameters for building an index include where to find your data files, where to place the index, how much memory to use while indexing, stopword, stemming, fields and other various parameters.

The parameters for BuildIndex or IndriBuildIndex can be passed in via a parameter file (or multiple parameter files), or they can be specified on the command line. See the section Build Index and IndriBuildIndex for full details of available indexing parameters.

[edit] Time and Space Requirements

The time to build an index will certainly be dependent on the computer being used to build it, but on an average machine (Windows XP Professional, Pentium 4, 2.6 GHz, 1 GB RAM) it takes approximately 1 hour to index approximately 10GB of text. Better and new machines, will of course, perform better.

One of the most common ways that the indexer slows down is due to opening and closing of many source files. If you have a lot of small data files in TREC Text or TREC Web format, consider concatenating the files into a few larger data files. This has been shown to quite effectively speed up the indexing time.

In terms of the amount of space, an KeyFile index will expect to take up about 1/2 the size of the source data, whereas an Indri index with no annotations will expect to take up roughly the same same as the original data (due to keeping compressed copies of the original source text). With annotations, expect an Indri index size to increase proportional to the number of annotations.

Personal tools