Lemur 4.12 release notes

4.12 (June 21, 2010)

(Older versions: 1.1, 1.9, 2.0, 2.1, 2.2, 3.0, 3.1, 4.0, 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 4.10 4.11 )


Enhancements :

Bugs Fixed :

(see SourceForge for the complete tickets):

3014524
Update google parser for query log toolbar server.
3014521
Query log toolbar server can now be run with an optional hostname parameter, which will be used instead of localhost if specified.
3013328
Fix crash on large queries in the CGI.
3013325
Fix CGI snippets.
3013315
Fix crash in CGI when fewer than 50 documents are returned.
3013313
Fix CGI to get document text when using multiple indri indexes.
3004284
Fix memory leaks in QueryEnvironment::expressionCount and QueryEnvironment::expressionList.
3000138
Fix snippet generation with queries that use the #max operator.
2989973
Prevent SIGPIPE being raised in IndriDaemon.
2985880
Prevent field restricted queries when using the non-LM baseline retrieval.
2982858
Modify the query parser to transform hyphenated terms into #1 expressions. This is closest to the result of splitting tokens on hyphens when indexing in terms of query matching.
2979952
Fix error in termCountUnique for repositories with more than one internal index.
2979952
Windows format documents, with \r\n line endings, cause the iterator to fail to match the closing document tag. Fix the iterator to be aware of the line endings.
2939425
Fix libxpdf linking with GCC 4.3+.
2939370
Fix processing of heritrix generated warc files.

Feature Requests Implemented :

(see SourceForge for the complete tickets):

3013321
Add a query timeout parameter that can be set from the config file to the CGI.
2935868
Add setOrdinalField and setParentalField to IndexEnvironment Java API wrapper.

Deprecations :

The 1.x UIMA processors have been deprecated and removed.

The following components have been deprecated and will be removed in a subsequent release, most likely Dec. 2010.

  1. The distributed retrieval module, including the distrib subdirectory and the applications CollSelIndex, DistRetEval, and QryBasedSample.
  2. The summarization module, inlcuding the summarization subdirectory and the applications BasicSummApp, MMRSummApp.
  3. The cross-lingual retrieaval application, XLingRetEval and its support classes, XLingDocModel, XLingRetMethod, PDict, and the application PDictManager.
  4. The DocumentManager API, including ElemDocMgr, KeyfileDocMgr, IndriDocMgr, DocMgrMgr, and the application BuildDocMgr. Applications which used this API will need to be updated to use the Indri ParsedDocument API.
  5. The KeyfileIncIndex and associated support files. The only index type will be the Indri repository. Users of the indexType key will have to reindex their data with IndriBuildIndex. As the indri repository provides stemming and stopping internally, the separate step of stemming and stopping queries is not required. The associated support files include the Lemur TextHandler chain APIs, Parser, Stemmer, Stopper. Classes that will be removed are: ArabicParser, ArabicStemmer, Arabic_Stemmer, BrillPosParser, BrillPosTokenizer, ChineseCharParser, ChineseParser, IdentifinderParser, InqArabicParser, KStemmer, PorterStemmer, ReutersParser, TextHandlerManager, TrecParser, WebParser, and WriterTextHandler. The applications ParseToFile, ParseQuery, BuildPropIndex, and BuildIndex will be removed. Until such time as RetEval is removed, users of that application should prepare their input queries in BasicDoc format.
  6. The java GUI applications LemurIndex and LemurRet. The IndexUI and RetUI applications provide the same functionality via the Indri API.
  7. Ancillary applications, dumpDoc, dumpTerm, EstimateDirPrior, GenL2Norm, RetQueryClarity, and TwoStageRetEval.

The following components have been deprecated and will be removed in a subsequent release. Some of these will be rewritten to use the Indri API, with the goal of having all of the components capable of handling GOV2-scale (25M pages, greater than MAX_INT32 terms) collections. The Lemur Project would welcome user community contributions in these areas.

  1. The clustering component, including the cluster subdirectory and the applications Cluster, OfflineCluster, and PLSA. The k-means implementation (OfflineCluster) will be updated to handle GOV2-scale collections.
  2. RelFBEval, the IndriRunQuery application will be updated to enable running true relevance feedback experiments before this application is removed.
  3. QueryClarity. This application will be re-implemented to enable use of GOV2-scale collections.