Lucindri
Lucindri is an open-source implementation of Indri search logic and structured query language using the Lucene Search Engine. Lucindri consists of two components: the indexer and the searcher.
Getting Started
Lucindri requires the 64-bit version of Java 11. If you don't have it already, download the Java 11 JDK.
Downloading executable jar files
Lucindri executable jar files are available from SourceForge, on the Lemur Project page.
Building from the source code
The Lucindri source code is available on github.
To build the Lucindri source code, Apache Maven is required.
To get started, first clone trec-car-tools from the Trema Lab at UNH.
After cloning the trec-car-tools, build using Maven:
mvn clean install
Next, clone the lucindri repository and build using mvn clean install in this order:
- LucindriAnalyzer
- LucindriSearcher
- LucindriIndexer
Lucindri Indexer
The main class in indexer is: org.lemurproject.lucindri.indexer.BuildIndex. This program takes a single properties file as an argument. See index.properties in the indexer directory as an example.
Description of indexing properties:
#implementation options
# documentFormat options = text, wsj, gov2, json, wapo, warc, trectext, cw09, cw12, car, marco
documentFormat=[text | wsj | gov2 | json | wapo | warc | trectext | cw09 | cw12 | car | marco]
# indexing platform options = lucene, solr
indexingPlatform=[lucene | solr]
#data options
dataDirectory=[Directory or file where data is]
indexDirectory=[Directory where index will be written]
indexName=[Name of the index]
#field options
#If index.fulltext is set to true, a field with all document text is created. This is recommended.
#fulltext is the default field for queries if it is indexed
indexFullText=[true (recommended) | false]
fieldNames=[Comma separated list of field names to be stored (e.g. title, url, body)]
#analyzer options
stemmer=[kstem | porter | none]
removeStopwords=[true | false]
ignoreCase=[true | false]
#solr options - only needed if indexingPlatform=solr
host=[host name or IP address]
port=[port number}
Example index.properties:
#implementation options
# documentFormat options = text, wsj, gov2, json, wapo, warc, trectext, cw09, cw12, car, marco
documentFormat=cw09
indexingPlatform=lucene
#data options
dataDirectory=/usr/home/data/cw09data
indexDirectory=/usr/home/
indexName=CW09_lucindri_index
#field options
#If index.fulltext is set to true, a field with all document text is created. This is recommended.
#fulltext is the default field for queries if it is indexed
indexFullText=true
fieldNames=title,url
#analyzer options
stemmer=kstem
removeStopwords=true
ignoreCase=true
#solr options - only needed if indexingPlatform=solr
#Not needed because lucene is selected as the indexing platform
Running the LucindriIndexer can be done from inside an IDE, invoking the main class (org.lemurproject.lucindri.indexer.BuildIndex), or using the jar file in the target directory. Use at least 2G of heap space (preferably 4G - 8G).
java -jar -Xmx4G LucindriIndexer-1.0-jar-with-dependencies.jar index.properties
Lucindri Searcher
The Lucindri Searcher has Indri Dirichlet and Jelinek-Mercer smoothing rules (a.k.a. Similarity in Lucene) implemented. The results are printed in TREC format.
The main class in searcher is: org.lemurproject.lucindri.searche.IndriSearch. It takes an xml parameter file, which contains queries, as an argument. The query parameters follow the same format as Indri.
Retrieval Parameters
- index: path to an Indri Repository. Specified as
/path/to/repository in the parameter file and as -index=/path/to/repository on the command line. This element can be specified multiple times to combine Repositories. - count: an integer value specifying the maximum number of results to return for a given query. Specified as
number in the parameter file and as -count=number on the command line. - query: An indri query language query to run. This element can be specified multiple times.
- rule: specifies the smoothing rule (TermScoreFunction) to apply.
- Format of the rule is: ( key ":" value ) [ "," key ":" value ]*
Valid methods:
- dirichlet (also 'd', 'dir') (default mu=2000)
- jelinek-mercer (also 'jm', 'linear') (default collectionLambda=0.4), collectionLambda is also known as just "lambda"
Here is an example rule in parameter file format:
<rule>dirichlet:1000</rule>
OR
<rule>jm:.3</rule>
This corresponds to Dirichlet smoothing with mu equal to 2000.
Here is an example query file:
<parameters>
<index>PATH_TO_INDEX</index>
<trecFormat>true</trecFormat>
<rule>dirichlet:2000</rule>
<count>100</count>
<query>
<number> 51 </number>
<text>#5(president clinton)</text>
</query>
<query>
<number> 52 </number>
<text> #combine( avp ) </text>
</query>
</parameters>
Running the LucindriSearcher can be done from inside an IDE, invoking the main class (org.lemurproject.lucindri.searcher.IndriSearch), or using the jar file in the target directory. Use at least 2G of heap space (preferably 4G - 8G).
java -jar -Xmx4G LucindriSearcher-1.0-jar-with-dependencies.jar queries.xml
Lucindri Query Language
Lucindri Fields
Lucindri documents are stored in fields, which are specified at index time. If indexFullText is set to true during indexing, a fulltext field is created and is used as the default query field if no field is specified.
You can search any field by typing the term you are looking for followed by a period "." and then the field name.
For example:
President.fulltext Obama.title
Lucindri implements these Indri belief operators:
- #combine (equivalent to #and)
- Example: #combine(dog training)
- #or
- Example: #or(dog cat)
- #not
- Example: #and(president #not(obama))
- #wand (weighted and)
- Example: #wand(0.2 president 0.8 obama)
- #wsum (weighted sum)
- Example: #wsum(0.2 presdient 0.8 obama)
- #max
- Example: #max(dog train) - returns maximum of b(dog) and b(train)
- #band (boolean and)
- #band(Q) is scored as #uw(Q) - an unordered window of the length of the document
- #N (also known as #nearN and #windowN)
- ordered window - terms must appear ordered, with at most N-1 terms between each
- Example: #2(white house) - matches "white * house" (where * is any word or null)
- #uwN (unordered window)
- unordered window - all terms must appear within window of length N in any order
- Example: #uw2(white house) - matches "white house" and "house white"
- #syn (synonym)
- Example: #syn( #1(united states) #1(united states of america) )
And these term operators: