Lemur 4.12 release notes
4.12 (June 21, 2010)
- 4.12 corrects various issues in the 4.11 distribution package, provides improvements to indri, improvements to the Lemur CGI, improvements to the query log toolbars and server, and more.
- Applications compiled with the Lemur Toolkit require the following libraries: z, iberty, pthread, and m on linux. Applications built in Visual Studio require the additional library wsock32.lib. The java jar files were built with Java 6 (jdk 1.6.0). The java UIs require Java 6. We have tested using GCC 3.4.6 (CentOS 4.2 linux x86_64), 4.1.2 (CentOS 5.3 linux), 4.3.3 (Ubuntu 10.04 linux), 4.2.1 (OS/X), Visual Studio 2005 (Windows XP), and Visual Studio 2008 (Windows Vista, WIN32 and x86_64).
Enhancements :
Bugs Fixed :
(see SourceForge for the complete tickets):
- 3014524
- Update google parser for query log toolbar server.
- 3014521
- Query log toolbar server can now be run with an optional hostname parameter, which will be used instead of localhost if specified.
- 3013328
- Fix crash on large queries in the CGI.
- 3013325
- Fix CGI snippets.
- 3013315
- Fix crash in CGI when fewer than 50 documents are returned.
- 3013313
- Fix CGI to get document text when using multiple indri indexes.
- 3004284
- Fix memory leaks in QueryEnvironment::expressionCount and QueryEnvironment::expressionList.
- 3000138
- Fix snippet generation with queries that use the #max operator.
- 2989973
- Prevent SIGPIPE being raised in IndriDaemon.
- 2985880
- Prevent field restricted queries when using the non-LM baseline retrieval.
- 2982858
- Modify the query parser to transform hyphenated terms into #1 expressions. This is closest to the result of splitting tokens on hyphens when indexing in terms of query matching.
- 2979952
- Fix error in termCountUnique for repositories with more than one internal index.
- 2979952
- Windows format documents, with \r\n line endings, cause the iterator to fail to match the closing document tag. Fix the iterator to be aware of the line endings.
- 2939425
- Fix libxpdf linking with GCC 4.3+.
- 2939370
- Fix processing of heritrix generated warc files.
Feature Requests Implemented :
(see SourceForge for the complete tickets):
- 3013321
- Add a query timeout parameter that can be set from the config file to the CGI.
- 2935868
- Add setOrdinalField and setParentalField to IndexEnvironment Java API wrapper.
Deprecations :
The 1.x UIMA processors have been deprecated and removed.
The following components have been deprecated and will be removed in a subsequent release, most likely Dec. 2010.
- The distributed retrieval module, including the distrib subdirectory and the applications CollSelIndex, DistRetEval, and QryBasedSample.
- The summarization module, inlcuding the summarization subdirectory and the applications BasicSummApp, MMRSummApp.
- The cross-lingual retrieaval application, XLingRetEval and its support classes, XLingDocModel, XLingRetMethod, PDict, and the application PDictManager.
- The DocumentManager API, including ElemDocMgr, KeyfileDocMgr, IndriDocMgr, DocMgrMgr, and the application BuildDocMgr. Applications which used this API will need to be updated to use the Indri ParsedDocument API.
- The KeyfileIncIndex and associated support files. The only index type will be the Indri repository. Users of the indexType key will have to reindex their data with IndriBuildIndex. As the indri repository provides stemming and stopping internally, the separate step of stemming and stopping queries is not required. The associated support files include the Lemur TextHandler chain APIs, Parser, Stemmer, Stopper. Classes that will be removed are: ArabicParser, ArabicStemmer, Arabic_Stemmer, BrillPosParser, BrillPosTokenizer, ChineseCharParser, ChineseParser, IdentifinderParser, InqArabicParser, KStemmer, PorterStemmer, ReutersParser, TextHandlerManager, TrecParser, WebParser, and WriterTextHandler. The applications ParseToFile, ParseQuery, BuildPropIndex, and BuildIndex will be removed. Until such time as RetEval is removed, users of that application should prepare their input queries in BasicDoc format.
- The java GUI applications LemurIndex and LemurRet. The IndexUI and RetUI applications provide the same functionality via the Indri API.
- Ancillary applications, dumpDoc, dumpTerm, EstimateDirPrior, GenL2Norm, RetQueryClarity, and TwoStageRetEval.
The following components have been deprecated and will be removed in a subsequent release. Some of these will be rewritten to use the Indri API, with the goal of having all of the components capable of handling GOV2-scale (25M pages, greater than MAX_INT32 terms) collections. The Lemur Project would welcome user community contributions in these areas.
- The clustering component, including the cluster subdirectory and the applications Cluster, OfflineCluster, and PLSA. The k-means implementation (OfflineCluster) will be updated to handle GOV2-scale collections.
- RelFBEval, the IndriRunQuery application will be updated to enable running true relevance feedback experiments before this application is removed.
- QueryClarity. This application will be re-implemented to enable use of GOV2-scale collections.