- This version is not backwards compatible with Lemur version
3.x. All indexes and document managers need to be
rebuilt to ensure proper behavior.
- The syntax for parameter files has been changed to use the
Indri syntax. See Lemur Parameter
Files for a description of the new format and the conversion
script.
- Applications compiled with the Lemur Toolkit require the
following libraries: z, iberty,
pthread, and m on linux, and additionally
socket and nsl on solaris. Applications build in
Visual Studio require the additional library wsock32.lib.
- We have tested using GCC 3.2 (solaris), 3.2.2(linux),
3.4(linux) and VC++ .NET 7.1(Windows XP).
- Enhancements:
- Added C++ namespaces for indri and lemur components. The namespaces
lemur::api and indri::api contain most of the classes
necessary to building applications with the lemur toolkit. Existing
applications should be able to be recompiled by adding using
namespace lemur::api; (and/or using
namespace indri::api;) to the source file. Users of the toolkit
will be able to focus their attention on those two namespaces for most
of their programming needs, which should ease the learning curve.
- Indri 2.0
- Threads and concurrent indri repository access. See concurrency.pdf for a detailed description of
the changes.
- Numeric field handling in IndriBuildIndex updated to automatically
configure conversion of numeric field elements to integer values.
- An xml FileClassEnvironment for individual xml documents has been
added to indri.
- removeIndex and removeServer have been added to the
QueryEnvironment API.
- LemurIndriIndex updated to use indri 2.0.
- IndexManager handling of indri repositories updated to use the
top-level directory name of the indri repository when opening a
LemurIndriIndex. This is the same value as is used for the
index parameter when building the index.
- New parameter file format. See Lemur
Parameter Files for a description of the new format and the
conversion script.
- RVLCompress, RVLCompressStream, and RVLDecompressStream updated to
handle compression and decompression of signed and unsigned values in a
more elegant fashion, replacing separately named methods with overloaded
methods.
- The keyfile package has been updated to clean up some compilation
warnings.
- The Lemur Toolkit documentation has been extensively updated.
- Deprecations:
- FUtil::fileExist has been removed from the
API. It has been replaced with
indri::file::Path::exists. The files FUtil.hpp and
FUtil.cpp have been removed from the distribution.
- The parsing applications no longer default to using the
TrecParser. If the
docFormat parameter is not specified, or contains a value of an
unknown type, the application will exit with an error.
- The files error.h, error.c, ht.h, parameters.h,
parameters.c, util.h, and util.c have been removed from the
distribution.
- Bugs Fixed:
- Problem: FlattextDocMgr does not know about documents in
files other than the first file listed in dataFiles parameter.
Solution: Remove writing of spurious data to the lookup file.
- Problem: KeyfileIncIndex emits shuffle error message
indexing wsj87.dat when no memory parameter is provided (using 96M).
Solution: Change the underlying keyfile package to correctly
shuffle keys when the compression prefix changes.
- Problem: BuildIndex fails silently when building a
KeyfileIncIndex on Windows when built in Debug mode.
Solution: Initialize the _readOnly attribute in the
KeyfileIncIndex constructor.
- Problem: A malformed query can cause a seg fault in the indri QueryEnvironment.
Solution: Change grammar generated by antlr to not generate
default exception
handler, allowing them to propogate up to the calling QueryEnvironment,
where they are handled.
- Problem: Windows binary install missing runtime dll files.
Solution: Add the appropriate dll files MSVCP71.dll and
MSVCR71.dll to the distribution.
- Problem: MMRSumm::autoMMRQuery leaks memory.
Solution: Add appropriate calls to delete allocated objects.
- Problem: ChineseParser miscounts document length.
Solution: Increment position after calling yyless, rather
than before.
- Problem: If an application makes multiple instances of
SimpleKLRetMethod with different indexes, the static allocation
of the distQuery arrays in the expansion methods can result in
memory access or corruption errors if the different indexes have
different sizes.
Solution: Change uses of static allocation in methods to
use dynamic allocation. Applies to FreqVector,
ResultFile, SimpleKLRetMethod, and TFIDFRetMethod.
- Problem: BasicDocStream::hasMore can return an unitialized value.
Solution: Initialize the variable moreDoc in hasMore.
- Problem: Unterminated comments in ChineseCharParser.l, lines
94 and 95 contain unterminated comments, effectively removing
lines 95 and 96 from the source.
Solution: Properly terminate the comments.
- Problem: IndriTextHandler causes a bus error on solaris.
Solution: Add initialization of curdocno to constructor so
that free is not called with an uninitialized value.
- Problem: Summarization applications produce a runtime error
initializing a std::string with NULL.
Solution: Initialize empty strings with "".
- Problem: QryBasedSample segfaults inside recursive call to randomWord.
Solution: convert randomWord to use iteration instead of
recursion. Change uses of char * in map/set to uses of std::string
so that calls to find would behave correctly.
- Problem: QryBasedSample dumps core on memory corruption on linux.
Solution: move delete(dbm) out of the while loop, preventing
multiple delete calls on the object.
- Problem: Indri query parser does not parse negative numbers.
Solution: Add appropriate production for recognizing
negative numbers to the query grammar.
- Problem: BasicIndexWithCat::catCount off by one.
Solution: Don't subtract 1 from the termCount to derive the
catCount.
- Problem: ElemDocMgr crashes on build if filenames have spaces.
Solution: Change dataFiles reader to not split file names on
spaces.
- Problem: Filenames containing spaces in TOC files cause
problems in document managers and indexes.
Solution: Change TOC and other data file readers to not
split file names on spaces.
- Problem: Indri query parser does not accept negative numbers
as query terms.
Solution: Add a production to the query grammar.
- Problem: Using Indri #base64quote operator causes a segfault.
Solution: Change production to reference correct automatic
variable for the scope.
- Problem: Indri query grammar does not accept the BASE64
encoding pad character '='.
Solution: Add the character to BASESIXFOUR_CHAR as a valid
alternative.
- Problem: NetworkServerProxy::termCount deletes an XMLNode
twice, causing a segfault on linux.
Solution: Only delete the node once.