Lemur 4.3.2 release notes
4.3 (June 21, 2006) 4.3.1 (June 26, 2006) 4.3.2 (July 27, 2006)Related Links
- 4.3.2 corrects sixteen various issues in the 4.3.1 distribution package, adds child, parent, and ancestor references to the Indri Query Language, and adds an enhancement to the CGI application.
- Due to the changes necessary for handling large numbers of fields, an indri Repository built with this version is not backwards compatible with Lemur version 4.3. All indri indexes need to be rebuilt to ensure proper behavior. Indices built with version 4.3.1 should not be affected.
- The indri Java wrappers have been refactored to facilitate integrating with the new Lemur Toolkit Java wrappers. The package edu.umass.cs.indri has been renamed lemurproject.indri. The package edu.umass.cs.indri.ui has been renamed lemurproject.indri.ui. Users of the indri Java wrappers will need to refactor their code accordingly.
- Applications compiled with the Lemur Toolkit require the following libraries: z, iberty, pthread, and m on linux, and additionally socket and nsl on solaris. Applications build in Visual Studio require the additional library wsock32.lib.
- We have tested using GCC 3.2 (solaris), 3.2.2(linux), 3.4(linux), 3.4.3(linux x86_64), 4.0.2(linux), and VC++ .NET 7.1(Windows XP).
-
Problem: UTF8Transcoder.hpp and UTF8CaseNormalization.cpp exercise an optimzer bug in GCC 3.3.1 on linux, exhausting virtual memory.
Solution: Refactor both to initialize their hash tables in a way that does not exercise the bug. -
Problem: JAVADOC binary path is not set when --with-javahome is not supplied to configure.
Solution: Set path to JAVADOC in configure. -
Problem: Indri query language parser fails to parse #any:fieldname.
Solution: Correct the ambiguity in the query parser grammer. Also add the syntax #any(fieldname), which will supercede the #any:fieldname syntax. -
Problem: Indri query language parser fails to correctly parse the #date: operators.
Solution: Rename the #date: operators to #datebefore, #datebetween, #dateafter. Rework the query grammar accordingly. Add the operator #dateequals to the query grammar as well. -
Problem: Indri query language parser fails to correctly parse some of the example formats for date expressions.
Solution: Rework the query grammar to correctly handle the example formats. -
Problem: Indri query language ODNode fails to copy the node name when duplicationg a node, causing the matching extents for the node to be ommitted when using QueryEnvironment::runAnnotatedQuery.
Solution: Copy the node name when duplicating. -
Problem: Files created for an indri repository have 0600 (user only read/write) permissions.
Solution: Have files created use the user's umask to set the file permissions. -
Problem: DocumentStructure and MemoryIndex emit spurious debugging output.
Solution: Remove the spurious debugging output. -
Problem:DiskKeyfileVocabularyIterator skips the first entry in the list.
Solution: Refactor to not advance the list pointer twice when intializing iteration. -
Problem: Indri csharp wrappers have incorrect module name, causing the wrong shared library name to be specified, which prevents loading.
Solution: Set the module name correctly. -
Problem: Indri csharp wrappers for some vector types generated incorrect access wrapper code, preventing use of the objects.
Solution: Rework the SWIG interface to correctly wrap the vector types. -
Problem: If SWIG is not set by configure, the Makefile still attempts to run the program during build.
Solution: Don't run $(SWIG) if it is not set to an executable pathname. -
Problem: site-search crawler can get stuck in a crawler trap on some wiki pages.
Solution: Rework the exclusion patterns for wiki pages to prevent fetching the offending pages. -
Problem: DeletedDocumentList rewrites its data, even if there were no changes to the Repository.
Solution: Only write out the deleted document list on initial build, and when it has changed. -
Problem: site-search php interface is somewhat ugly.
Solution: rework the style sheet and output format to more closely match Google, Yahoo, MSN, etc style output. -
Problem: CGI would place extraneous characters into query box after a search.
Solution: Code modified to ensure that a copy of the original query is perserved.
This is a list of bugs and known problems with the current version of Lemur (4.3.2) and Indri (2.3.2). Many problems have fixes or workarounds that are posted on the forum. Please check there if you do not see something here.
- No known problems currently exist.
- Child, parent, and ancestor references is now added to the Indri Query Language. See the Indri Query Language Reference for more detail
- The summaries created for the search results from the CGI application were messy and allowed HTML markup tags. The summaries have now been cleaned up and modified so that it mirrors the PHP version of the summary generation code.
- Problem: The file site-search/crawl-index.in does not copy the file excluded_hosts to the crawl directory, causing a crawl to fail.
Solution: add excluded_hosts to the files to be copied. - Problem: The file site-search/cgi/Makefile contains a typo the causes make install to fail.
Solution: Change REAME to README in the install target. - Problem: Indri query parser fails to parse #date operators correctly.
Solution: Update grammar rule for #date operators.
- Problem: NumericFieldAnnotator overwriting numeric values
specified in offset annotations.
Solution: For numeric fields given in offset annotations, the field parameter for the given field needs to specify a different parserName parameter, eg: <parserName>OffsetAnnotationAnnotator</parserName> - Problem: contentLength off by one in some document extractors
Solution: Don't include the trailing null in the content string length. - Problem: Calling DocumentStructure.loadStructure() with nothing in the field vector causes errors.
Solution: Don't resize the vector when there are a negative number of nodes. - Problem: Compilation failures on Mac OS X due to missing "isnan()" function used in ShrinkageBeliefNode.cpp
Solution: Add additional checks in configure to enable proper compilation. - Problem: UTF8CaseNormalizationTransformation uses excessive memory
Solution: The _buffers_allocated needs to be cleared on each call to transform, rather than once in the transformation's destructor. - Problem: LemurIndriIndex::document fails on OOV docid
Solution: Return OOV (0) for OOV document ids. - Problem: Offset annotation annotator consumes excessive memory
Solution: Delete offset annotation annotators when finished with them in IndexEnvironment. - Problem: Problem propagating keys up the tree on block splits in split_block
Solution: If the propagated key happens to replace (insertions are OK) either the first key of a new left block or the last key of a new right block and has a different prefix than the key it's replacing the prefix can be set incorrectly in the new block, which means that all of the keys in that block will be wrong. Set the prefix correctly. - Problem: Specifying offset annotation file with corpus path a directory tries to open multiple non-existant annotation files.
Solution: Notice if a file exists before trying to open it. - Problem: arabic stemmer parameter incorrect
Solution: extraneous _ in arabic_light10_stop removed.
- Improved documentation and tutorials
- Out of the box site search, complete with crawler, cgi, and PHP UI. See site search for details.
- Improved JNI support for using the Lemur toolkit with Java applications. The indri API Java wrappers that were previously only available in the indri standalone distribution are now available within the Lemur Toolkit.
- C# wrapper support for using the Lemur toolkit and indri with C# applications.
- PHP wrapper support for using indri with PHP applications.
- Java retrieval and indexing GUIs are now part of the main Lemur Toolkit distribution. The retrieval GUI has been rewritten to use the new Java wrappers.
- The Indri applications harvestlinks, dumpindex, and makeprior have been added to the Lemur toolkit distribution.
The following classes are being marked as deprecated and will be removed from a future revision of the Lemur Toolkit.
- IndexWithCat, BasicIndexWithCat -- There is no application that supports constructing one of these, nor interacting with one if it could be built. Support for categories can be realized with metadata in an indri repository, or as field data in an indri repository, either inline, or specfied as offset annotations.
- FlattextDocMgr -- Scales poorly, loads entire lookup table into memory. Functionality it provides is provided by KeyfileDocMgr, ElemDocMgr, and an indri repository.
- InvIndex, InvFPIndex, InvPushIndex, InvFPPushIndex, InvIndexMerge, InvFPIndexMerge, InvFPTextHandler, InvTermList, InvFPTermPropList -- Scales poorly, loads entire vocabulary, etc into memory. Functionality is provided by KeyFileIncIndex and an indri repository.
- IncFPPushIndex, IncFPTextHandler, IncIndexer(application) -- Incremental versions of prior. Same scale problems, incremental functionality provided by KeyfileIncIndex and an indri repository.
- InvPassagePushIndex, InvPassageTextHandler, IncPassagePushIndex, IncPassageTextHandler, PassageIndexer (application) -- Fixed window passage indexes for prior. Same scale problems, adds little value.
