Lemur 4.4 release notes
4.4 (Jan 12, 2007)
- 4.4 includes changes to the keyed file package that are not backwards compatible with previous versions. All indexes will need to be rebuilt.
- 4.4 removes support for the deprecated index classes, InvIndex and InvFPIndex. These should be replaced with a KeyfileIncIndex or an indri index.
- 4.4 corrects various issues in the 4.3.x distribution package, adds a wildcard operator to the Indri Query Language, a snippet builder for indri retrieval results and other features described below.
- Applications compiled with the Lemur Toolkit require the following libraries: z, iberty, pthread, and m on linux, and additionally socket and nsl on solaris. Applications built in Visual Studio require the additional library wsock32.lib.
- We have tested using GCC 3.2 (solaris), 3.2.2(linux), 3.4(linux), 3.4.3(linux x86_64), 4.0.2(linux), VC++ .NET 7.1(Windows XP), and Visual Studio 2005 (Windows XP).
Bugs Fixed (4.4):
- Problem: TaggedDocumentIterator can terminate the metadata
before reading the end metadata tag.
Solution: Don't test for the end tag against data in the buffer from a previous iteration. - Problem: UTF8Transcoder won't compile with GCC 3.3
due to an optimizer bug that consumes all available memory.
Solution: Refactored the initialization code to work around the optimizer issue. - Problem: Indri's C# QueryAnnotationNode wrapper has
no members
Solution: Remove spurious struct declaration that hid the members from SWIG - Problem: In the indri distribution, C# wrapper dll
name incorrect, module name incorrect.
Solution: Correct the names. - Problem: ODNode missing matched extents in QueryAnnotation
Solution: DNode::copy was missing duplicate->setNodeName( nodeName() ); - Problem: IndriTermInfoList drops last entry in the list
Solution: Have the count_iterator advance to the end after reading the final entry rather than before. - Problem: IndriRunQuery prints wrong document names
Solution: Load correct document names for the slice being printed. - Problem: ISet and CSet add cause a segfault after grow
Solution: Don't reference sn after a call to grow. - Problem: Indri's configure doesn't set -lantlr in
LIBS when antlr is found externally
Solution: Add -lantlr to LIBS - Problem: AnchorTextAnnotator can leak TagExtents.
Solution: Only allocate a TagExtent if it will be inserted into tags. - Problem: AnchorTextAnotator mainbody tags don't
begin their scope until the beginning of the last tag in the
input document.
Solution: sort the added tags so that the mainbody scope is from the beginning of the mainbody. - Problem: In IndriRunQuery, _annotation is used
uninitialized in QueryThread, causing delete of a random value.
Solution: initialize the value. - Problem: makeprior can delete the same MergeFile twice.
Solution: Take the file off of the stack, rather than a reference. - Problem: IndriTextHandler can add spurious empty
documents if multiple </DOC> tags are encountered.
Solution: Don't add a document that doesn't have a docno. - Problem: TaggedTextParser can leak TagExtents.
Solution: Delete the TagExtents in the destructor.
Known Problems:
This is a list of bugs and known problems with the current version of Lemur (4.4) and Indri (2.4). Many problems have fixes or workarounds that are posted on the forum. Please check there if you do not see something here.
- No known problems currently exist.
Enhancements (4.4):
- A wildcard operator is now added to the Indri Query Language. See the Indri Query Language Reference for more detail
- URL tokenization in indri has been changed to tokenize on the '.' character, rather then removing it. eg www.lemurproject.org used to tokenize as wwwlemurprojectorg and now tokenizes as www lemurproject org.
- A URLTextAnnotator has been added to indri that extracts urls from the metadata fields, tokenizes them, and adds the tokens to the parsed document text for indexing
- The keyfile btree package has been updated to permit storing variable sized records in the index block (efficiency improvement). This change requires that all indexes be rebuilt
- A snippet builder has been added to the Indri API. This is a C++ implementation of the php snippet generator used by the site search php UI.
- QueryRequest and QueryResult have been added to the indri API. These provide an encapsulated form for running a query and collecting typically useful results in a single structure. The results include external document id, text snippet, score, and requested metadata values (such as title). The API is available in C++ and Java.
- The Windows installers now test for the presence of Visual Studio 2005 and install project files for that version when the source install option is selected.
- The summarization applications have been upgraded to use any index opened via IndexManager::open.
- A UIMA CAS consumer and search application for Indri have been added to the toolkit. See UIMA Indri Components.
Deprecations:
The following classes have been deprecated and have been removed from the Lemur Toolkit.
- IndexWithCat, BasicIndexWithCat -- There is no application that supports constructing one of these, nor interacting with one if it could be built. Support for categories can be realized with metadata in an indri repository, or as field data in an indri repository, either inline, or specfied as offset annotations.
- FlattextDocMgr -- Scales poorly, loads entire lookup table into memory. Functionality it provides is provided by KeyfileDocMgr, ElemDocMgr, and an indri repository.
- InvIndex, InvFPIndex, InvPushIndex, InvFPPushIndex, InvIndexMerge, InvFPIndexMerge, InvFPTextHandler, InvTermList, InvFPTermPropList -- Scales poorly, loads entire vocabulary, etc into memory. Functionality is provided by KeyFileIncIndex and an indri repository.
- IncFPPushIndex, IncFPTextHandler, IncIndexer(application) -- Incremental versions of prior. Same scale problems, incremental functionality provided by KeyfileIncIndex and an indri repository.
- InvPassagePushIndex, InvPassageTextHandler, IncPassagePushIndex, IncPassageTextHandler, PassageIndexer (application) -- Fixed window passage indexes for prior. Same scale problems, adds little value. Fixed window passage retrieval is supported directly by the Indri query language and via the TextQueryRetMethod::scoreDocPassages API.
The Lemur Project
Last modified:June 21, 2007. 09:12:26 am

