The Lemur Toolkit
The Lemur Toolkit APIs have been deprecated. The final released version of the Lemur Toolkit is version 4.12, released 06/21/2010.
The Lemur Toolkit is designed to facilitate research in language modeling and information retrieval (IR), where IR is broadly interpreted to include such technologies as ad hoc and distributed retrieval with structured queries, cross-language IR, summarization, filtering, and categorization. The system's underlying architecture was built to support the technologies above. We provide many useful sample applications, but have designed the toolkit to allow you to easily program your own customizations and applications.
- Sophisticated structured query languages (using InQuery and Indri)
- Support for XML and structured document retrieval
- Used commonly with a wide range of research test collections (e.g., TREC CDs 1-5, wt10g, RCV1, gov, gov2)
- Index your web pages with an "out-of-the-box" site search capability
- Interactive interfaces for Windows, Linux, and Web
- Distributed information retrieval and document clustering applications
- Cross-platform, fast and modular code written in C++
- C++, Java and C# APIs
- Free and open-source software
- In use for over 6 years by a large and growing user community
Indexing
- Multiple indexing methods for small, medium and large-scale (terabyte) collections
- Built-in support for English, Chinese and Arabic text
- Porter and Krovetz word stemming
- Incremental indexing
- Out-of-the-box indexing support for TREC Text, TREC Web, plain text, HTML, XML, PDF, MBox, Microsoft Word, and Microsoft PowerPoint
- Indexes inline and offset text annotations (e.g., part-of-speech and named entities)
- Indexes document attributes
Retrieval
- Supports major language modeling approaches such as Indri and KL-divergence, as well as vector space, tf.idf, Okapi and InQuery
- Relevance- and pseudo-relevance feedback
- Wildcard term expansion (using Indri)
- Passage and XML element retrieval
- Cross-lingual retrieval
- Smoothing via Dirichlet priors and Markov chains
- Supports arbitrary document priors (e.g., Page Rank, URL depth)