Lemur Components
The Lemur Project has the following components and sub-projects. Click on the name to find out more about each one.
Indri
Indri is a search engine that provides state-of-the-art text search and a rich structured query language for text collections of up to 50 million documents (single machine) or 500 million documents (distributed search). Available for Linux and Windows.
Galago
Galago is a Java toolkit for experimenting with text search. It is based on small, pluggable components that are easy to replace and change, both during indexing and during retrieval.
Lemur Toolkit
The Lemur Toolkit is designed to facilitate research in language modeling and information retrieval (IR), where IR is broadly interpreted to include such technologies as ad hoc and distributed retrieval with structured queries, cross-language IR, summarization, filtering, and categorization. The system's underlying architecture was built to support the technologies above. We provide many useful sample applications, but have designed the toolkit to allow you to easily program your own customizations and applications. The final released version of the Lemur Toolkit is version 4.12, released 06/21/2010.
Lemur Query Log Toolbar
A web browser plugin that captures user search and browing behavior to support research on information seeking behavior, learning to rank, and related topics. Available for Firefox and Internet Explorer.
RankLib
RankLib is a library of learning to rank algorithms. Full details are in the RankLib Documentation on the Lemur Project wiki.
ClueWeb09 Dataset
A dataset of 1 billion high PageRank web pages in ten languages collected in January and February, 2009. The dataset is used by several tracks of the TREC conference in 2009 and 2010.
ClueWeb12 Dataset
A dataset of 1 billion English pages collected February through April 2012.