News
Features
The Lemur Toolkit
Indri Search Engine
Lemur Query Log Toolbar
Lemur Wiki
Download
People
Discussion
Archived Forums
Tutorials
Sign Up

 
CMU - Language Technologies Institute
Carnegie Mellon University
CIIR, University of Massachusetts Amherst
University of Massachusetts
 

The Lemur Project is sponsored by the Advanced Research and Development Activity in Information Technology (ARDA) under its Statistical Language Modeling for Information Retrieval Research Program and by the National Science Foundation.


Note: These tutorials are out of date, please see the Lemur Wiki instead.


Lemur Project Tutorials:
Starting Out

Indexing: Creating a Simple Index


The three most common document formats used by Lemur include standard TREC text format, TREC web format, and HTML.

TREC Text Documents

The most common document is a plaintext document in TREC format. A document in TREC format must have a <DOC> tagset surrounding the document. The document must also at a minimum include a <DOCNO> tagset enclosing the document ID and a <TEXT> tagset enclosing the text to be indexed. As an example:

	<DOC>
	<DOCNO>document_id</DOCNO>
	<TEXT>
		Index this document text.
	</TEXT>
	</DOC>
	

TREC Web format

The "trecweb" format is similar to TREC text with the cavaet that the main body text contains HTML formatted text, and it may also contain an optional <DOCHDR> tagset at the beginning that hold the header information from the HTTP request. This information may include the original URL, the date and server from when the page text was gathered, and other various information.

Also note that if you are using the IndriBuildIndex application with TRECWeb text, the URL of the page gathered from the header information will automatically be pushed into a field named "url" in the document.

HTML Format

HTML document are standard, well-formed HTML pages. HTML documents need no pre-processing before indexing and are believed to include only one document per file.

Other document types

Currently, the Indri index builder knows of the following other document types:

xml   XML formatted data, one document per file (same as html, but without link processing)
mbox   Unix mailbox files
doc   Microsoft Word documents (Windows only, requires Microsoft Office)
ppt   Microsoft PowerPoint documents (Windows only, requires Microsoft Office)
pdf   Adobe PDF
txt   Text documents

 


Previous: Creating a Simple Index Back to TOC Next: Stopword Lists and Stemmers
[Previous: Creating a Simple Index] [Back to TOC] [Next: Stopword Lists and Stemmers]

 


The Lemur Project The Lemur Project
Last modified: June 21, 2007. 09:14:12 am