Note: These tutorials are out of date, please see the Lemur Wiki instead.
Lemur Project Tutorials:
Starting Out
Offset Annotations: Creating an offset annotation file
Once the data has been tagged, it needs to be lined up with the original byte offsets of the text and output to a properly formatted offset annotation file.
The structure of an offset annotation file consists of 9 columns (tab-delimited). The description of the columns are as follows:
- docno
- external document id corresponding to the document in which the annotation occurs.
- type
- TAG or ATTRIBUTE
- id
- an id number for the annotation; each line should have a unique id >= 1.
- name
- for TAG, name or type of the annotation for ATTRIBUTE, the attribute name, or key
- start
- start and length define the annotation's extent. The values should be byte offsets relative to the start of the document.
- length
- meaningless for an ATTRIBUTE. For a TAG, it's the number of bytes the annotation spans.
- value
- for TAG, an optional INT64 (for numeric values) for ATTRIBUTE, a string that is the attribute's value
- parentid
- for TAG, refers to the id number of another TAG to be considered the parent of this one; this is how hierarchical annotations can be expressed. a TAG that has no parent has parentid = 0 for ATTRIBUTE, refers to the id number of a TAG to which it belongs and from which it inherits its start and length. *NOTE: the file must be sorted such that any line that uses a given id in this column must be *after* the line that uses that id in the id column.
- debug
- ignored by the OffsetAnnotator; can contain any information that is beneficial to a human reading the file
To align the above text and part-of-speech tagged text and create an offset annotation file, we can use a perl script much like the following:
#!/usr/bin/perl
# usage: lineupAnnotations.pl <docno> <orig_text> <pos_tagged_text>
$thisDocNo=$ARGV[0];
$inputFilename=$ARGV[1];
$posTagFilename=$ARGV[2];
# load in the pos tags into an array and split them by space
$posTaggedText="";
open(POSIN, $posTagFilename) || die ("Cannot open POS tagged file.\n");
while (<POSIN>) {
$posTaggedText.=$_;
}
close(POSIN);
# read in and split the original text...
$origDocText="";
open(ORIGDOC, $inputFilename) || die ("Cannot open original document.\n");
while (<ORIGDOC>) {
$origDocText.=$_;
}
close(ORIGDOC);
# now split into characters..
@origText=split(//, $origDocText);
# now split on the whitespace...
@taggedTokens=split(/\s/, $posTaggedText);
$currentOffset=0;
# loop through the tokens and print out our annotations
# to stdout
$currentTagID=1;
foreach $thisToken (@taggedTokens) {
$tagLen=0;
# split the token at the / break
my ($thisToken, $thisTag) = $thisLine =~ m/^(.*)\/(.*)$/;
# find the next offset in the original text
@tokenChars=split(//, $thisToken);
# look through until we get to the next token...
$keepLooping=1;
$loopPos=0;
while ($keepLooping==1) {
if ($tokenChars[$loopPos]!=$origText[$currentOffset]) {
$keepLooping=0;
} else {
$loopPos++;
}
$currentOffset++;
}
# get the length of this token
$tagLen=length($thisToken);
# now, print the offset annotation information
print "$thisDocNo\tTAG\t$currentTagID\t$thisTag\t$currentOffset\t$tagLen\t$currentTagID\t0\t$thisToken\n";
}
If the above PERL file was called with a DOCNO of "01" and the original text and the text output from the tagger, the output of the file would produce the following:
01 TAG 1 NNP 0 5 1 0 Lemur 01 TAG 2 VBZ 7 2 2 0 is 01 TAG 3 DT 10 1 3 0 a 01 TAG 4 NN 12 7 4 0 toolkit 01 TAG 5 VBN 20 8 5 0 designed 01 TAG 6 TO 29 2 6 0 to 01 TAG 7 VB 32 10 7 0 facilitate 01 TAG 8 NN 43 8 8 0 research 01 TAG 9 IN 52 2 9 0 in 01 TAG 10 NN 55 8 10 0 language 01 TAG 11 NN 64 8 11 0 modeling 01 TAG 12 CC 77 3 12 0 and 01 TAG 13 NN 81 11 13 0 information 01 TAG 14 NN 93 9 14 0 retrieval
Looking at the first line in the data above, this represents an offset annotation with the following attributes:
- The document ID is "01"
- It is a TAG (as opposed to an ATTRIBUTE)
- The annotation ID is "1"
- The annotation field is "NNP"
- The starting byte offset of this annotation is at 0 (from the beginning of the document)
- The length of the annotation is 5 bytes
- The tag's value is set to 1 (this is optional and can be arbitrary, but this could be used for such things as searching for tags within a certain numeric range)
- The annotation's parent ID is 0 (meaning that it has no parent)
- And finally, for the optional debug field, we have chosen to include the word or phrase that this annotation corresponds to.
You can then use a similar methodology to process your whole corpus, continously appending the stdout output to your final offset annotations file.
![]() |
![]() |
![]() |
| [Previous: Preparing Text for Offset Annotations] | [Back to TOC] | [Next: Indexing a corpus with offset annotations] |
The Lemur Project
Last modified: June 21, 2007. 09:14:12 am




