News
Features
The Lemur Toolkit
Indri Search Engine
Lemur Query Log Toolbar
Lemur Wiki
Download
People
Discussion
Archived Forums
Tutorials
Sign Up

 
CMU - Language Technologies Institute
Carnegie Mellon University
CIIR, University of Massachusetts Amherst
University of Massachusetts
 

The Lemur Project is sponsored by the Advanced Research and Development Activity in Information Technology (ARDA) under its Statistical Language Modeling for Information Retrieval Research Program and by the National Science Foundation.


Note: These tutorials are out of date, please see the Lemur Wiki instead.


Lemur Project Tutorials:
Starting Out

Offset Annotations: Creating an offset annotation file


Once the data has been tagged, it needs to be lined up with the original byte offsets of the text and output to a properly formatted offset annotation file.

The structure of an offset annotation file consists of 9 columns (tab-delimited). The description of the columns are as follows:

docno
external document id corresponding to the document in which the annotation occurs.
type
TAG or ATTRIBUTE
id
an id number for the annotation; each line should have a unique id >= 1.
name
for TAG, name or type of the annotation for ATTRIBUTE, the attribute name, or key
start
start and length define the annotation's extent. The values should be byte offsets relative to the start of the document.
length
meaningless for an ATTRIBUTE. For a TAG, it's the number of bytes the annotation spans.
value
for TAG, an optional INT64 (for numeric values) for ATTRIBUTE, a string that is the attribute's value
parentid
for TAG, refers to the id number of another TAG to be considered the parent of this one; this is how hierarchical annotations can be expressed. a TAG that has no parent has parentid = 0 for ATTRIBUTE, refers to the id number of a TAG to which it belongs and from which it inherits its start and length. *NOTE: the file must be sorted such that any line that uses a given id in this column must be *after* the line that uses that id in the id column.
debug
ignored by the OffsetAnnotator; can contain any information that is beneficial to a human reading the file

To align the above text and part-of-speech tagged text and create an offset annotation file, we can use a perl script much like the following:


#!/usr/bin/perl

# usage: lineupAnnotations.pl <docno> <orig_text> <pos_tagged_text>
$thisDocNo=$ARGV[0];
$inputFilename=$ARGV[1];
$posTagFilename=$ARGV[2];

# load in the pos tags into an array and split them by space
$posTaggedText="";
open(POSIN, $posTagFilename) || die ("Cannot open POS tagged file.\n");
while (<POSIN>) {
  $posTaggedText.=$_;
}
close(POSIN);

# read in and split the original text...
$origDocText="";
open(ORIGDOC, $inputFilename) || die ("Cannot open original document.\n");
while (<ORIGDOC>) {
  $origDocText.=$_;
}
close(ORIGDOC);

# now split into characters..
@origText=split(//, $origDocText);

# now split on the whitespace...
@taggedTokens=split(/\s/, $posTaggedText);

$currentOffset=0;

# loop through the tokens and print out our annotations
# to stdout

$currentTagID=1;
foreach $thisToken (@taggedTokens) {
  $tagLen=0;

  # split the token at the / break
  my ($thisToken, $thisTag) = $thisLine =~ m/^(.*)\/(.*)$/;

  # find the next offset in the original text
  @tokenChars=split(//, $thisToken);

  # look through until we get to the next token...
  $keepLooping=1;
  $loopPos=0;
  while ($keepLooping==1) {
    if ($tokenChars[$loopPos]!=$origText[$currentOffset]) {
      $keepLooping=0;
    } else {
      $loopPos++;
    }
    $currentOffset++;
  }

  # get the length of this token
  $tagLen=length($thisToken);

  # now, print the offset annotation information
  print "$thisDocNo\tTAG\t$currentTagID\t$thisTag\t$currentOffset\t$tagLen\t$currentTagID\t0\t$thisToken\n";
}
  

If the above PERL file was called with a DOCNO of "01" and the original text and the text output from the tagger, the output of the file would produce the following:


01  TAG  1   NNP   0   5   1   0  Lemur
01  TAG  2   VBZ   7   2   2   0  is
01  TAG  3   DT    10  1   3   0  a
01  TAG  4   NN    12  7   4   0  toolkit
01  TAG  5   VBN   20  8   5   0  designed
01  TAG  6   TO    29  2   6   0  to
01  TAG  7   VB    32  10  7   0  facilitate
01  TAG  8   NN    43  8   8   0  research
01  TAG  9   IN    52  2   9   0  in
01  TAG  10  NN    55  8   10  0  language
01  TAG  11  NN    64  8   11  0  modeling
01  TAG  12  CC    77  3   12  0  and
01  TAG  13  NN    81  11  13  0  information
01  TAG  14  NN    93  9   14  0  retrieval
  

Looking at the first line in the data above, this represents an offset annotation with the following attributes:

You can then use a similar methodology to process your whole corpus, continously appending the stdout output to your final offset annotations file.

 


Previous: Preparing Text for Offset Annotations Back to TOC Next: Indexing a corpus with offset annotations
[Previous: Preparing Text for Offset Annotations] [Back to TOC] [Next: Indexing a corpus with offset annotations]

 


The Lemur Project The Lemur Project
Last modified: June 21, 2007. 09:14:12 am