Lemur 4.1 release notes
(October 06, 2005)Related Links
- Due to the changes necessary for 64-bit compilation, this version is not backwards compatible with Lemur version 4.0. All indexes and document managers need to be rebuilt to ensure proper behavior.
- Applications compiled with the Lemur Toolkit require the following libraries: z, iberty, pthread, and m on linux, and additionally socket and nsl on solaris. Applications build in Visual Studio require the additional library wsock32.lib.
- We have tested using GCC 3.2 (solaris), 3.2.2(linux), 3.4(linux), 3.4.3(linux x86_64), and VC++ .NET 7.1(Windows XP).
- Enhancements:
- Lemur 4.1 only
- INEX task activity support, including nexi query language support in IndriRunQuery. See INEX for information about the INEX task. See IndriRunQuery for a description of the INEX task output parameters. Nexi query language queries are specified with a query type of nexi. Indri query language queries are specified with a query type of indri. See the next item for an example of a nexi query.
- Addition of CDATA element for parameter files. This enables
embedding otherwise unparseable data, such as a nexi query containing
the < operator in an element. The format is:
<!CDATA[data to escape]]>
the data to escape may not contain the character sequence ']]' as that would prematurely terminate the expression. For example:
<query> <number>163</number> <type>nexi</type> <text> <!CDATA[//article[.//pdt >= 1995]//sec[about(., Text and Index Compression Algorithms)]]]> </text> </query> - 64-bit clean API. The lemur toolkit and indri configure, compile, and execute on linux x86_64 platforms.
- Multithreaded query support for IndriRunQuery (lemur 4.1)/runquery (indri 2.1).
- Support for offset annotations when constructing an indri repository. See offset annotations for details.
- Throw an exception in TaggedDocumentIterator::_readLine if the buffer exceeds 50M. This gives the user some feedback in the case of passing in trecweb as the file class when the documents are not trecweb format. Without this, trying to read the metadata causes the buffer to grow to roughly 2 * size of input file, consuming all available memory on some machines.
- mbox file class environment for indri repository construction. Can be used to index unix mail (mbox format) files.
- Lemur 4.1 only
- Bugs Fixed:
- Problem: LocalQueryServer::documentIDsFromMetadata and
LocalQueryServer::documentsFromMetadata return empty results.
Solution: Copy the results into the result vector for return. - Problem: TermList delta encoding of field extents can pass
negative values to RVL::compress_int.
Solution: Compute delta encoding with respect to previous extent begin to ensure that no negative values are passed to RVL::compress_int. - Problem: The PLSA application does not run correctly when doTrain
is false on Windows due to opening its data files in text mode, rather
than binary mode.
Solution: Open the data files in binary mode. - Problem: If a parameter file contains mismatched opening and
closing tags for an element, eg <position> with
</prosition> or no closing tag for an element, the XMLReader can
throw an uncaught std::string exception.
Solution: Add a test for a missing close tag and throw an Exception that ParamPushFile can report to the user. - Problem: Indri query language context restriction queries can
fail. Queryies of the form: #combine[speech]( romeo.speaker
juliet.line ) do not correctly retrieve documents according to the
context restrictions.
Solution: Correctly compute the context restriction in ContextCountAccumulator. - Problem: Indri 2.0 top documents list can be computed incorrectly
at indexing time. A query run with optimizations turned off (via the
-skipping=false parameter) can return different results than the
optimized query.
Solution: Correctly construct the topdocs list during indexing by inserting all candidates into the topdocs queue. - Problem: DistRetEval can leak memory.
Solution: delete the leaking pointers as needed. - Problem: ContextCountAccumulator miscounts occurrences.
Solution: Correctly count zero length extents. - Problem: greedy_vector::erase does not copy all elements downward.
Solution: Copy all elements downward when erasing an element. - Problem: field id off by one in MemoryIndex::fieldTermCount and
MemoryIndex::fieldDocumentCount.
Solution: Use the correct field id value as the index. - Problem: IndriTextHandler leaks memory.
Solution: Make the docsource buffer local to handleEndDoc and delete it after it has been used. - Problem: indri::index::CombinedVocabularyIterator does not
initialize properly if its first VocabularyIterator is empty.
Solution: If _first is empty in startIteration, set _usingSecond to true and have _second startIteration. - Problem: smoothing support WeightedExtentOr nodes missing.
Solution: Add a case for WeightedExtentOr nodes. - Problem: [indri] configure does not set JAR when --with-javahome is
not specified and javac is found on the path.
Solution: Have configure set JAR in this case. - Problem: [indri] configure does not have an option to
enable or disable building of the java or php wrapper libraries when
--enable-swig is given.
Solution: Add --enable-php and --enable-java options to permit building either or both of the wrapper libraries. - Problem: Duplicate terms in #wsyn and #syn operators cause score
mismatches.
Solution: Filter duplicate term extents so that they are not double counted. - Problem: IndexWriter::_writeFieldList can overflow
totalCount for a field for a sufficiently large collection.
Solution: Use a 64-bit integer to maintain the totalCount. .
- Problem: JelinekMercerTermScoreFunction smoothing bug, the value
of _foregroundLambda is incorrect.
Solution: Set _foregroundLambda = (1 - _collectionLambda) instead of _foregroundLambda = _collectionLambda + _documentLambda - Problem: indri query parser fails on numbers followed by
high ascii character.
Solution: Update query parser to accept the token. - Problem: keyfilecode::init_file_name produces a file open
error when there is a "." in directory path. Regression of
a bug fixed in 3.1.1.
Solution: Reinstate bug fix from 3.1.1 to stop scanning for a file extension when a path separator is encountered.
- Problem: LocalQueryServer::documentIDsFromMetadata and
LocalQueryServer::documentsFromMetadata return empty results.
