News
Features
The Lemur Toolkit
Indri Search Engine
Lemur Query Log Toolbar
Lemur Wiki
Download
People
Discussion
Archived Forums
Tutorials
Sign Up

 
CMU - Language Technologies Institute
Carnegie Mellon University
CIIR, University of Massachusetts Amherst
University of Massachusetts
 

The Lemur Project is sponsored by the Advanced Research and Development Activity in Information Technology (ARDA) under its Statistical Language Modeling for Information Retrieval Research Program and by the National Science Foundation.


Note: These tutorials are out of date, please see the Lemur Wiki instead.


Lemur Project Tutorials:
Starting Out

Offset Annotations: Overview


In-line Annotations

Most people are familiar with in-line annotations, or in-line field definitions. These can be found in tagged text such as HTML or XML. For example, if you have the following HTML snippet:

<h1>The Lemur Toolkit</h1>
<h2>for Language Modeling and Information Retrieval</h2>
<p>Language modeling has recently emerged as an attractive new
framework for text information retrieval, leveraging work on language
modeling from other areas such as speech recognition and statistical
natural language processing.</p>
  
When this document is parsed and readied to be indexed, the text within any of the HTML markup tags (<>) is indexed, but we can also consider the indexing of the markup tags themselves. For instance, in HTML, the <h1> tag is typically used for title text. If we can tell the indexer to mark where any <h1> fields exist, then we can perform queries based on that field.

Offset Annotations

This is definitely useful, but what if you have a document that has no markup in it, but wish to run it through, say, a named entity tagger or a part-of-speech tagger? In this case, you can certainly have the tagger edit and mark up the original document, and then index that. Alternatively, you can use an "offset annotation" file to tell the indexer where the fields would exist if they were in the source text.

An offset annotation file contains annotations to be used with a document, but are not in-lined with the document itself. That is to say, the original document does not have to be modified, rather you can create an "offset annotation" file to tell the indexer what tag and attribute annotations to add to a document while indexing.

Note: Currently, storing and indexing annotations and fields only works with Indri-style indexes. For a more detailed comparison of Indri indexes versus KeyFile indexes, please see this page.

 


  Back to TOC Next: Preparing Text for Offset Annotations
  [Back to TOC] [Next: Preparing Text for Offset Annotations]

 


The Lemur Project The Lemur Project
Last modified: June 21, 2007. 09:14:12 am