The Lemur Toolkit

The Lemur Toolkit APIs have been deprecated. The final released version of the Lemur Toolkit is version 4.12, released 06/21/2010.

The Lemur Toolkit is designed to facilitate research in language modeling and information retrieval (IR), where IR is broadly interpreted to include such technologies as ad hoc and distributed retrieval with structured queries, cross-language IR, summarization, filtering, and categorization. The system's underlying architecture was built to support the technologies above. We provide many useful sample applications, but have designed the toolkit to allow you to easily program your own customizations and applications.

Features

Sophisticated structured query languages (using InQuery and Indri)
Support for XML and structured document retrieval
Used commonly with a wide range of research test collections (e.g., TREC CDs 1-5, wt10g, RCV1, gov, gov2)
Index your web pages with an "out-of-the-box" site search capability
Interactive interfaces for Windows, Linux, and Web
Distributed information retrieval and document clustering applications
Cross-platform, fast and modular code written in C++
C++, Java and C# APIs
Free and open-source software
In use for over 6 years by a large and growing user community

Indexing

Multiple indexing methods for small, medium and large-scale (terabyte) collections
Built-in support for English, Chinese and Arabic text
Porter and Krovetz word stemming
Incremental indexing
Out-of-the-box indexing support for TREC Text, TREC Web, plain text, HTML, XML, PDF, MBox, Microsoft Word, and Microsoft PowerPoint
Indexes inline and offset text annotations (e.g., part-of-speech and named entities)
Indexes document attributes

Retrieval

Supports major language modeling approaches such as Indri and KL-divergence, as well as vector space, tf.idf, Okapi and InQuery
Relevance- and pseudo-relevance feedback
Wildcard term expansion (using Indri)
Passage and XML element retrieval
Cross-lingual retrieval
Smoothing via Dirichlet priors and Markov chains
Supports arbitrary document priors (e.g., Page Rank, URL depth)

Download

Lemur can be obtained from the Sourceforge Lemur Project Page.

Release History