Lemur 4.5 release notes
4.5 (Jun 21, 2007)
- 4.5 includes changes to the indri Repository structure that are not backwards compatible with previous versions. All indri indexes will need to be rebuilt.
- 4.5 corrects various issues in the 4.4.x distribution package, adds a date field annotator for encoding dates as numeric fields, document deletion and index compaction for indri Repositories, efficiency improvements to the OffsetAnnotationAnnotator, improvements to the CGI, and other features described below.
- Applications compiled with the Lemur Toolkit require the following libraries: z, iberty, pthread, and m on linux, and additionally socket and nsl on solaris. Applications built in Visual Studio require the additional library wsock32.lib. The java jar files were built with Java 5 (jdk 1.5.0). We have tested using GCC 3.2 (solaris), 3.2.2(linux), 3.4(linux), 3.4.3(linux x86_64), 4.0.2(linux), VC++ .NET 7.1(Windows XP), and Visual Studio 2005 (Windows XP).
Bugs Fixed (4.5):
-
Problem: TopDocument::greater returns incorrect result on Win with VS 2005
Solution: When one == two or one == k*two (count and length), greater will return true, when it should return false. Floating point precision issue. Change the comparison to use the fractions on both sides.
-
Problem: count_iterator assert breaks IndriTermInfoList
Solution: The assert in _buildValue is not required, the condition need not be true (in fact it must be false on the final call). Remove the assert.
-
Problem: documentsFromMetadata and documentIDsFromMetadata delete
response twice
Solution: only delete the response once.
-
Problem: Repository merges more often than needed when indexing w/ a
large number of fields
Solution: Change the number of open files estimate for merging to include one fields file, rather than one file per field.
-
Problem: _lookupTermID in IndexWriter has hardcoded constants that don't
work with a large number of fields
Solution: Dynamically allocate the necessary memory, rather than using a fixed size buffer.
-
Problem: Memory parameter can overflow int in IndriBuildIndex.
Solution: Change the call to env.setMemory that uses the operator int signature to use the operator INT64 signature.
-
Problem: Conversion of string to INT64 with suffix includes suffix in
the value.
Solution: Strip the trailing suffix (k,m,g,K,M,G) before computing the value.
-
Problem: IndriFile::openTemporary fails on WIN32
Solution: Use OPEN_ALWAYS, not OPEN_EXISTING in open.
-
Problem: java wrapper does not fill in docid attribute of QueryResult
Solution: Fill in the atribute value.
-
Problem: put_rec can fail without reporting an error.
Solution: catch and report the error.
-
Problem: Writing a document with no positions to a compressed collection
corrupts the metadata.
Solution: Write the #POSITIONS# key and data, even if the positions vector is empty.
-
Problem: makeprior is off by one inserting default values at the end of
the list of documents.
Solution: insert a default value for the final document.
-
Problem: makeprior skips entries reading in its table for testing
compressibility.
Solution: seek over two UINT32, not 8.
-
Problem: BulkTreeReader can overflow an int when handling very large
trees, yielding incorrect results.
Solution: The product of two 32-bit integers can overflow. Change dataSize() to return UINT64 to ensure proper promotion when multiplying its return value by an integer.
-
Problem: Property calls destructor on objects more than once,
corrupting memory.
Solution: Don't call the destructor more than once.
-
Problem: ArabicStemmer does not set its identifier correctly.
Solution: Assign the value of the identifier to iden.
- Problem: ArabicStemmer can overflow its stem buffer, corrupting memory.
Solution: Don't attempt to stem terms longer than the stem buffer.
Known Problems:
This is a list of bugs and known problems with the current version of Lemur (4.5) and Indri (2.5). Many problems have fixes or workarounds that are posted on the forum. Please check there if you do not see something here.
-
Problem: There is a bug using the #band and #uw operators where a signed value is cast to an unsigned value, causing
extent matching to fail.
Solution: In /retrieval/src/UnorderedWindowNode.cpp, do not cast iterators to an unsigned value. See the following Phorum post for details: http://www.lemurproject.org/phorum/read.php?11,3808 -
Problem: There is an off by one error in DocumentStructureNode that causes it to attempt to load the document structure
for one past the last document in the index. This may potentially produce a crash as evident with using the INEX 1.4 data on solaris.
Solution: In /retrieval/src/DocumentStructureNode.cpp, do not allow the document ID to be greater than or equal to the maximum document ID for the index. See the following Phorum post for details: http://www.lemurproject.org/phorum/read.php?11,3821 -
Problem: When merging indexes, the IndexWriter fails to correctly compute the maximumDocument for the merged indexes.
Solution: Correctly compute the maximum number of documents from the document base. See the following Phorum psot for details: http://www.lemurproject.org/phorum/read.php?11,3828
Enhancements (4.5):
-
#scoreif, #scoreifnot added as alternatives to #filreq, #filrej in the
Indri query language.
-
URL tokenization of absolute-url and relative-url (eg, <a
href="http://www.lemurproject.org/">) in the indri HTML parser has been
changed to tokenize on the '.' and '/' characters, rather then injecting the
untokenized url into the document terms. The leading scheme (eg http://)
is removed.
-
CGI: Slight modifications were made to the Lemur CGI to allow the use of an
IndriDaemon to process text queries. More information can be found in
the
section of the README file under ./site-search/cgi/
-
OffsetAnnotationAnnotator: Optimizations for speed were made to the way the Indri indexer
processes and indexes large external offset annotation files. Now, by
default the indexer will judge the size of the incoming offset
annotation file and pre-allocate its internal buffers accordingly. As
an alternative, if the document IDs in your offset annotation files
are in the same order as your corpus, you can tell the indexer to not
pre-allocate this memory, and instead incrementally read the offset
annotation file. Note that if your IDs are not in the same order as
your incoming corpus, the annotations will be ignored. See the
documentation regarding IndriBuildIndex parameters for usage.
-
DateFieldAnnotator: The Indri DateFieldAnnotator converts date strings in a numeric field to
the number of days since 01/01/1600. This enables the use of the date
operators in the Indri query language when the field being indexed is
named "date".
Acceptable date formats:- 11-01-2004 (DD-MM-YYYY)
- 11-JAN-2004 (DD-Month-YYYY)
- 2004-01-11 (YYYY-MM-DD)
- January 11 2004 (Month DD YYYY)
- 11 January 2004 (DD Month YYYY)
- 01/11/2004 (MM/DD/YYYY)
- 2004/01/11 (YYYY/MM/DD)
- 20040111 (YYYYMMDD)
Four digit years are required. The leading 0 can be ommited in all formats except YYYYMMDD, eg, 1/11/2004. Month names may be an abbreviation or the full name.
Specify in the index build parameters file with:<field> <name>date</name> <numeric>true</numeric> <parserName>DateFieldAnnotator</parserName> </field>
-
Document deletion with space reclamation, compaction, and repository
merging have been added to the indri Repository API. The dumpindex
application has three new commands:
These commands change the data inside the repository:compact (c) None Compact the repository, releasing space used by deleted documents. delete (del) Document ID Delete the specified document from the repository. merge (m) Input indexes Merges a list of Indri repositories together into one repository.
that enable these interactions.
The Lemur Project
Last modified:December 19, 2007. 13:33:05 pm

