Note: These tutorials are out of date, please see the Lemur Wiki instead.
Lemur Project Tutorials:
Starting Out
Offset Annotations: Preparing Text for Offset Annotations
Creating an offset annotation file basically consists of two parts. First, the appropriate annotation tags must be found within the source text, and secondly, and process must take place to align the annotation tags with the byte offsets of the original text.
Note: the following code on this page and the next has not been thoroughly tested. It is only intended to give the reader an example of how processing offset annotations might happen. Use at your own risk.
For this example scenario, we will be using the Monty Tagger, a simple part-of-speech tagger.
A simple java wrapper to the Monty Tagger might look like:
public class MontyTaggerWrapper {
public static void main(String[] args) {
// create a new tagger
JMontyTagger mt=new JMontyTagger();
String inLine;
String document;
try {
// assumes a plain text file
// args[0] is the filename of the file to tag
BufferedReader in=new BufferedReader(new java.io.FileReader(args[0]));
// read in the file
// append the lines to the overall document
while ((inLine=in.readLine())!=null) {
document += inLine + " ";
}
in.close();
// tag it
String taggedDocument=mt.Tag(document);
// print the tagged file to stdout
System.out.println(taggedDocument);
} catch (java.io.FileNotFoundException e) {
System.err.println("!! Cannot find file: " + args[0]);
} catch (IOException e) {
System.err.println("!! I/O Error reading: " + args[0]);
}
}
}
Running a piece of plain-text document through this wrapper will result in the tagged tokens being printed to stdout. For example, if we had a file named testfile.txt that contained the following text:
Lemur is a toolkit designed to facilitate research in language modeling and information retrieval.The output of running "java MontyTaggerWrapper testfile.txt" may look like:
Lemur/NNP is/VBZ a/DT toolkit/NN designed/VBN to/TO facilitate/VB research/NN in/IN language/NN modeling/NN and/CC information/NN retrieval/NN ./.
If you are working with TREC text, you will want to strip the surrounding TREC markup tags (<DOC>, <DOCNO> and <TEXT>) from the text to be tagged.
![]() |
![]() |
![]() |
| [Previous: Overview] | [Back to TOC] | [Next: Indexing a corpus with offset annotations] |
The Lemur Project
Last modified: June 21, 2007. 09:14:12 am




