News
Features
The Lemur Toolkit
Indri Search Engine
Lemur Query Log Toolbar
Lemur Wiki
Download
People
Discussion
Archived Forums
Tutorials
Sign Up

 
CMU - Language Technologies Institute
Carnegie Mellon University
CIIR, University of Massachusetts Amherst
University of Massachusetts
 

The Lemur Project is sponsored by the Advanced Research and Development Activity in Information Technology (ARDA) under its Statistical Language Modeling for Information Retrieval Research Program and by the National Science Foundation.


Note: These tutorials are out of date, please see the Lemur Wiki instead.


Lemur Project Tutorials:
Starting Out

Offset Annotations: Preparing Text for Offset Annotations


Creating an offset annotation file basically consists of two parts. First, the appropriate annotation tags must be found within the source text, and secondly, and process must take place to align the annotation tags with the byte offsets of the original text.

Note: the following code on this page and the next has not been thoroughly tested. It is only intended to give the reader an example of how processing offset annotations might happen. Use at your own risk.

For this example scenario, we will be using the Monty Tagger, a simple part-of-speech tagger.

A simple java wrapper to the Monty Tagger might look like:

public class MontyTaggerWrapper {
  public static void main(String[] args) {
    // create a new tagger
    JMontyTagger mt=new JMontyTagger();

    String inLine;
    String document;

    try {
      // assumes a plain text file
      // args[0] is the filename of the file to tag
      BufferedReader in=new BufferedReader(new java.io.FileReader(args[0]));

      // read in the file


      // append the lines to the overall document
      while ((inLine=in.readLine())!=null) {
        document += inLine + " ";
      }

      in.close();

      // tag it
      String taggedDocument=mt.Tag(document);

      // print the tagged file to stdout
      System.out.println(taggedDocument);
    } catch (java.io.FileNotFoundException e) {
      System.err.println("!! Cannot find file: " + args[0]);
    } catch (IOException e) {
      System.err.println("!! I/O Error reading: " + args[0]);
    }
  }
}
  

Running a piece of plain-text document through this wrapper will result in the tagged tokens being printed to stdout. For example, if we had a file named testfile.txt that contained the following text:


Lemur is a toolkit designed to facilitate research in language modeling and
information retrieval.
  
The output of running "java MontyTaggerWrapper testfile.txt" may look like:

Lemur/NNP is/VBZ a/DT toolkit/NN designed/VBN to/TO facilitate/VB research/NN in/IN
language/NN modeling/NN and/CC information/NN retrieval/NN ./.
  

If you are working with TREC text, you will want to strip the surrounding TREC markup tags (<DOC>, <DOCNO> and <TEXT>) from the text to be tagged.

 


Previous: Overview Back to TOC Next: Indexing a corpus with offset annotations
[Previous: Overview] [Back to TOC] [Next: Indexing a corpus with offset annotations]

 


The Lemur Project The Lemur Project
Last modified: June 21, 2007. 09:14:12 am