Lemur 4.7 release notes
4.7 (Jun 23, 2008)
- 4.7 corrects various issues in the 4.6 distribution package, adds a relevance judgment UI to the Lemur Retrieval UI; a java-based trec_eval alternative; a PageRank (TM http://www.google.com/technology/) application; a performance-enhanced harvestlinks; a Firefox query log toolbar; a SOAP server for indri repositories; and more.
- Applications compiled with the Lemur Toolkit require the following libraries: z, iberty, pthread, and m on linux, and additionally socket and nsl on solaris. Applications built in Visual Studio require the additional library wsock32.lib. The java jar files were built with Java 5 (jdk 1.5.0). The java UIs require Java 5. We have tested using GCC 3.2 (solaris), 3.2.2(linux), 3.4(linux), 3.4.3(linux x86_64), 4.0.2(linux), VC++ .NET 7.1(Windows XP), and Visual Studio 2005 (Windows XP).
Bugs Fixed (4.7):
(see SourceForge for the complete tickets):
| Bug # | Issue |
|---|---|
| 1866819 | docPool used unitialized, producing an error on VS2005. Initialize to NULL. |
| 1866820 | SimpleKLQueryModel loads OOV terms causing segfault. If an OOV term is put into the model, the retrieval step will fetch the document list for it, receiving a NULL, causing a segfault. |
| 1866821 | String.hpp fails to compile with GCC 4.1. The ostream operators need a second declaration outside of the String class and inside the namespace. |
| 1866915 | CORI retrieval method produces scores > 1.0. The value of rmax for score adjustment does not take into account the weight of each query term. |
| 1878939 | UTF8CaseNormalizationTransformation::transform doesn't copy. The buf_index is not advanced as new characters are copied. |
| 1893631 | harvestlinks yields incorrect anchor text due to the HTMLParser not adjusting the token counters after inserting the original URL into the document's terms vector. Subsequent access to the anchor text terms drifts backwards through the term's array as hrefs are processed. |
| 1893632 | harvestlinks strips url parameters due to HTMLParser::normalizeURL removing parameters from the url. |
| 1893637 | Combiner produces duplicate entries. When running harvestlinks, the Combiner will process the final entry in a file twice, producing duplicated output and incorrect link counts. |
| 1897910 | backward indexing empty metadata yields Exception. addDocument needs to test the value to ensure that it is not empty before trying to add it to the reverse lookup table. |
| 1907126 | LemurIndriIndex fails to open repository with extension. If the indri repository has a file extension, eg CACM.index, LemurIndriIndex will fail to open the repository, due to stripping the extension from the path. |
| 1911091 | Windows lemur.lib missing dependencies in installer. The combined lemur.lib built for the binary install on Windows does not include the contrib libraries: zlib.lib, antlr.lib, xpdf.lib. This can cause linking of projects built against lemur.lib to fail with undefined externals. |
| 1911208 | wildcard operator fails on numerics. The query grammar has been updated to allow numerics to appear in wildcard expressions. |
| 1913665 | keyfile: **prefix_simple_insert failed in replace_max_key. When using long keys, such as URLS, the size needed to insert a key can be overestimated, causing prefix_simple_insert to fail. |
| 1926060 | path queries fail to retrieve documents due to the TagList not setting the parent field for the first tag in the outermost containing scope. |
| 1927219 | nexi language queries yield default scores. The ShrinkageBeliefNode is using the wrong Extent constructor, passing a weight of 1, which is being interpreted as the begin. Signature mismatch due to constructor changes in Extent when adding the parent field. |
| 1927244 | nexi language queries yield incorrect scores when a query term occurs in the last inner field of the document. |
| 1927493 | about can not appear as a term in a nexi query. Add a special case to rawText to enable the ABOUT keyword to appear as a term. |
| 1937678 | Adding an index twice to a QueryEnvironment, if an index is added twice to a query environment (a logical error), it is impossible to remove both instances via the removeIndex API. Change addIndex and addServer to silently ignore an entry that has already been added. |
| 1950814 | Out-of-bounds memory accesses in Krovetz stemmer. Several suffix rules access word[j-1] without regard to the possibility that j == 0. |
| 1988909 | bits/atomicity.h not found for GCC 4.2+. Configure and atomic.hpp updated to find ext/atomicity.h. |
| 1993141 | Memory leak in LemurIndriIndex plugged. |
Known Problems:
This is a list of bugs and known problems with the current version of Lemur (4.7) and Indri (2.7). Many problems have fixes or workarounds that are posted on the Lemur Forums. There may also be open bug tickets issued on sourceforge, see https://sourceforge.net/tracker/?group_id=161383&atid=819615 for the complete tickets. Please check there if you do not see something here.
- No known problems currently exist.
Enhancements (4.7):
- ireval: Implemented in java, ireval provides a command line interface, similar to trec_eval. In addition to all of the metrics provided by trec_eval, ireval also computes Normalized Distributed Cumulative Gain (NDCG). Additionally, ireval can be used to compare the performance of a pair of system outputs, providing the paired T test, Wilcoxon's Sign test, and randomization test for statistical significance testing. Ireval's output has been validated against trec_eval v8.1. A future version of ireval will include a graphical user interface and report generation.
- PageRank: Computes floating point raw scores, binned integer page ranks, and prior probabilities suitable for installation in an Indri repository via the makeprior application.
- Harvestlinks: Version 4.7 includes an updated version of the Harvestlinks utility. The updated version utilizes the already present keyfile code that is actively maintained within the Lemur Toolkit as well as streamlining the link extraction and sorting process. The new version also provides speed improvements especially when harvesting links from large data sets. The rewritten code and standardized classes used with the new version will also make it easier to maintain the code for future optimizations.
- Query log toolbar: The Lemur Query Log Toolbar is a FireFox add-on that monitors a variety of user actions, collects the data and allows the aggregation of these logs through a Query Log Server. On the client-side, a set of configuration options enable researchers and users to specify toolbar behavior. Several privacy filters are in place that enable users to specify information that will not be collected and/or shared. A version of the toolbar for Internet Explorer is planned for the December 2008 release.
- Query Log Toolbar Server: On the server-side, the Lemur transaction database, a set of Java servlets allow the aggregation of the user log data into a MySQL database. The servlets, hosted via a servlet container such as Apache's Tomcat, also includes several configuration options for specifying the database connection and the levels of allowed privacy on the client toolbar side. Also included is an application to run a server if you do not have a servlet container.
- Indri SOAP server: Provides a web service that allows clients to add a document to an index, delete a document from an index, retrieve document vectors from an index, and query an index in a language-independent fashion. This enables building user interfaces that access indri indexes in the developer's language of choice, such as Python, PHP 5, Ruby, C#, or Java.
- Relevance judgment UI: To help provide support for creating evaluation relevance data for queries and datasets, we have augmented the Java-based retrieval UI for the Lemur Toolkit to allow a user to specify scores for query results. The various queries, scores and resulting document IDs can then be exported to a qrel file on disk for use with the standard TRECEval utility. Furthermore, previous qrel judgments can be loaded in and manipulated by the program before being saved to disk.
- Mac OSX Binary Intel Image: Mac OSX Intel binary install disk image added for users who just want to run the Lemur Toolkit applications.
- Lemur CGI: Various query speedups, URL normalization and code tidying to Indri style searches.



