Adverts
A while back now I found the need to add searching to this and other websites that I run. I toyed with the idea of developing a search library and almost got to the point of starting to develop one when I told myself to stop being stupid and look around for a pre-written library. As always there was one available that goes by the name Lucene. I didn't expect much from this library at first. It wasn't the sort of project that I expected to get much attention but as long as it did basically what I wanted I was willing to give it as go - was I in for a surprise. Lucene is an amazingly powerful search utility that can index and store vast amounts of data. It is written in such as way that it can be used to search any type of data that can be split up into fields. Typically that is going to be something like web pages but it could just as easily store other types of data.
Removing Documents
All the tutorials on using Lucene tell you how to add documents to the index. This is great as most of the time it's what you want to do. The problem is the world is not a static place and it is generally necessary to add and remove documents over time. To remove a document you need to do a little forward planning. Hopefully each document you add to the index will have some sort of unique key. If the documents you are indexing are built from a database you can use the database primary key. If they are pages found on the Internet perhaps use the URL of the page. Whatever happens though you need a unique key. Add this unique key to the document that is added to the index in a field called key or some such - this field has to be index though (Field.Index.TOKENIZED). You can then remove the document from the index like this:
Term term = new Term( SomeDocument.KEY, Integer.toString( page.getKey() ) );
int deleted = getSearcher().getIndexReader().deleteDocuments( term );
} catch( IOException ioe ) {
//Log Exception
}
The only thing to remember is that the index writer must be closed before the index reader tries to delete a documents. Failure to close the writer first results in an exception being thrown (java.io.IOException Lock obtain timed out: Lock...).
Problems
This section aims to list the problems that you will likely run into when starting out with Lucene. For the most port Lucene is fairly simple to use and if you are having difficulty you are probably doing something wrong or trying to do something Lucene wasn't designed to do. There are a couple of genuine gotchas though and they are listed below.
Bad File Descriptor
This little chestnut appears when you try and read values from a Hit after the IndexSearcher has been closed. If you use an in-memory directory implimentation then this exception never arises as there is, of course, never a FileDescriptor. If you then switch over to using a file based directory watch out for this exception.
Analyzer Types
This is one that will trip you up and produce strange results if you aren't expecting it. When you create a directory you have to use an Analyzer to strip out all the non-searchable words and perform various other analysis on the input data. The same analyzer has to be used when you search the directory as well or the search will fail in mysterious ways. This advice is true whether you use the QueryParser to produce a query or you try and roll your own query. I tried to roll my own query for hours before I realized what I was doing wrong. The problem is that most of the time if you have a large data set queries will return results even if you use the wrong (or no) analyzer simply because they hit a document by chance. In my case some queries would return a full set of results and others would return nothing even though I knew that term appeared in the data set. In one case I found a search that would return a complete set of results if the term was lower case and nothing if the term was upper case dispite having both upper and lower case versions of the term in the data set.
The reason I had tried to roll my own query was to search multiple fields of a document in one request. This appeared, initally, to be impossible because the QueryParser accepts only one field to search. A little head scratching and I felt a bit stupid. The query parser can be called multiple times with different fields to produce multiple queries. These queries can then be combined in a BooleanQuery and fed to the searcher. This avoids any problems with the analyzer.