Lemur Toolkit Discussion :  Lemur Toolkit Message Phorum The Lemur Tookit Phorum
Lemur Toolkit Discussion:
A place for users of the Lemur toolkit to discuss their experiences and problems with the software.

 

Note: as of Monday, June 23rd 2008, these forums are now read-only.
Please use the new forums on SourceForge

Goto Thread: PreviousNext
Goto: Forum ListMessage ListNew TopicSearchLog In
How to get text out of ParsedDocument
Posted by: leondz (IP Logged)
Date: February 15, 2008 10:57AM

Hi,

I've got a result, and I'd like to get the text of all documents returned form this result in Java. I'm currently using an array of ParsedDocument, but both the text nor content properties of the ParsedDocument object seem to include all the document metadata, instead of just the info inside <TEXT></TEXT> tags, and the getContent method doesn't seem to exist.

I'm using Indri 2.6, and indexed the AQUAINT corpus using a trectext parser.

Re: How to get text out of ParsedDocument
Posted by: leondz (IP Logged)
Date: February 28, 2008 09:44AM

Hi,

Can anyone help? I'll add a bit more detail as I'm coming up a brick wall.

Using a QueryEnvironment's documents() call to extract text, and a ScoredExtendResult[] to hold results:

results = env.runQuery(query, indriResultCap);

rawtext = env.documents(results);

I'm trying to extract relevant passages. However, the content of rawtext[n].text contains an entire document in TREC format


<DOC>
<DOCNO> APW19981210.1535 </DOCNO>
<DOCTYPE> NEWS STORY </DOCTYPE>
<DATE_TIME> 12/10/1998 20:14:00 </DATE_TIME>
<HEADER>
w2120 &Cx1f; wstm-
r i &Cx13; &Cx11; BC-FEA-Japan-ToweringDil 12-10 0629
</HEADER>
<BODY>
<SLUG> BC-FEA-Japan-Towering Dilemma,0629 </SLUG>
<HEADLINE>
Heaven Forbid? Tokyo landmark may be headed for the suburbs
</HEADLINE>
&UR; AP Photos TOK110,112 &QL;
&UR; By NAOMI OKADA &QC;
&UR; Associated Press Writer &QC;

<TEXT>
TOKYO (AP) _ For decades, Tokyo Tower has been to the Japanese
capital's skyline what the Empire State Building is to New York or
the Eiffel Tower to Paris.
About
...


I'm just interested in the data between <TEXT> tags. To do passage level retrieval, results[n].begin and results[n].end contain boundaries of where the relevant text is, but using these as character offsets in the text retrieved doesn't often return relevant text - usually the specified offset begins at 0 (which has the SGML document header) and continues up to a paragraph or two into the document. The line of code used to print the excerpt (in Java) is based on that from Trevor Strohman's Indri Hints:


// find position of the relevant passage within the document
byteBegin = results[i].begin;
byteEnd = results[i].end;

// extract the relevant passage from the ScoredResultVector array
text[i] = rawtext[i].text.substring(byteBegin, byteEnd);



How should ScoredExtentResult.begin / ScoredExtentResult.end be used in order to retrieve the relevant passage of text for a query?

Thanks!



Edited 1 time(s). Last edit at 02/28/2008 10:09AM by leondz.

Re: How to get text out of ParsedDocument
Posted by: dfisher (IP Logged)
Date: February 28, 2008 10:42AM

1) The content field of the ParsedDocument is accessed as an attribute in java, eg:

ParsedDocument[] docs = null;
docs = env.documents(results);
System.out.println(docs[0].text);
System.out.println("%%%" + docs[0].content);

Note that the content field will differ from the text field only in the case were the input documents are trecweb formatted. For your collection, the two will be identical.

2) The begin and end attributes of a ScoredExtentResult are the token positions. They need to be translated to byte offsets if you want to access the relevant substring of the document text, eg:

QueryEnvironment env = new QueryEnvironment();
env.addIndex("/usr/ind1/tmp2/dfisher/src/indri/test/index/GX146-86");
String query = "#combine[passage20:5](white house)";
ScoredExtentResult [] results = env.runQuery(query, 1);
ParsedDocument[] docs = null;
docs = env.documents(results);
int begin = results[0].begin;
int end = results[0].end - 1;
int byteBegin = docs[0].positions[begin].begin;
int byteEnd = docs[0].positions[end].end;
String passage = docs[0].text.substring(byteBegin, byteEnd);
System.out.println(passage);
env.close();

which outputs:

indri6:/usr/ind1/tmp2/dfisher/src/indri/test> java -Djava.library.path=`pwd` -classpath .:indri.jar IndriTest
they would consider providing
additional help for workers. Last year, the White House remained silent about
the issue until House


Note that if the query is not a passage restricted query, the begin and end offsets will be the full content of the document.

--
David Fisher (dfisher@cs.umass.edu)
Senior Software Engineer

Re: How to get text out of ParsedDocument
Posted by: leondz (IP Logged)
Date: February 29, 2008 05:52AM

Thanks! This is a real help.

I'm still having a problem identifying the beginning byte. Using the following code:

// run an Indri query, returning all results
results = env.runQuery(query, indriResultCap);

// fetch the names of the retrieved documents
rawtext = env.documents(results);
...
begin = results[i].begin;
end = results[i].end - 1;

byteBegin = rawtext[i].positions[begin].begin;
byteEnd = rawtext[i].positions[end].end;

text[i] = rawtext[i].text.substring(byteBegin, byteEnd);


The results returned begin significantly before the opening <TEXT> tag:


12/10/1998 20:14:00 </DATE_TIME>
<HEADER>
w2120 &Cx1f; wstm-
r i &Cx13; &Cx11; BC-FEA-Japan-ToweringDil 12-10 0629
</HEADER>
<BODY>
<SLUG> BC-FEA-Japan-Towering Dilemma,0629 </SLUG>
<HEADLINE>
Heaven Forbid? Tokyo landmark may be headed for the suburbs
</HEADLINE>
&UR; AP Photos TOK110,112 &QL;
&UR; By NAOMI OKADA &QC;
&UR; Associated Press Writer &QC;

<TEXT>
TOKYO (AP) _ For decades, Tokyo Tower has been to the Japanese
capital's skyline what the Empire State Building is to New York or
the Eiffel Tower to Paris.
...


What have I missed?

Re: How to get text out of ParsedDocument
Posted by: dfisher (IP Logged)
Date: February 29, 2008 08:13AM

What is the query that you are using?

Is the extract of the document above the complete output for text[i]?

If you indexed the date field (DATE_TIME), as discussed in this thread: [www.lemurproject.org], the first token of the document will be 12. If the query does not contain an extent restriction, the ScoredExtentResult will be the span [0..number of tokens in document], which is consistent with your elided output above.

--
David Fisher (dfisher@cs.umass.edu)
Senior Software Engineer

Re: How to get text out of ParsedDocument
Posted by: leondz (IP Logged)
Date: February 29, 2008 09:20AM

Hi,

The query I'm using is "how tall is the eiffel tower".

The extract of the document given isn't complete - it ends at the end of the body text, which is perfect.

That's a correct assumption, I implemented your suggestions re: the DATE_TIME field.

In an attempt to be able to retrieve the contents of <text> and avoid the preamble, I've reindexed with the following config:


<parameters>
<index>/share/nlp.raid1/trec_qa/darwin/indexes/indri/aquaint</index>
<corpus>
<path>/share/nlp.raid1/trec_qa/darwin/corpora/aquaint.decompressed</path>
<class>trectext</class>
</corpus>
<memory>800m</memory>
<stemmer>
<name>porter</name>
</stemmer>
<metadata>
<field>
<name>date</name>
<numeric>true</numeric>
<parserName>DateFieldAnnotator</parserName>
</field>
<field>header</field>
<field>slug</field>
<forward>date</forward>
</metadata>
<field>text</field>
<field>p</field>
<field>headline</field>
</parameters>

However, after doing this, queries such as "#combine[text](How tall is the Eiffel tower)" return 0 results. What's going on?

Re: How to get text out of ParsedDocument
Posted by: dfisher (IP Logged)
Date: February 29, 2008 09:45AM

Please review the indexing parameters documentation at: [www.lemurproject.org]

Fields need to be specified with a name element, eg:

<field>
<name>text</name>
</field>

--
David Fisher (dfisher@cs.umass.edu)
Senior Software Engineer



Sorry, you do not have permission to post/reply in this forum.
This forum powered by Phorum.