20100423

Speeding up Lucene Index

Lucene is indexing and search technology, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities. .NET version is automatic Java Lucene port.
Solution works perfect in most cases but recently I've found it too slow. Directory can be stored in file system or in memory. While search indexes are cached in memory document contents are loaded each time from directory. I haven't checked Java implementation but .NET seems to parse file very slow. File system version requires additionally lots of disc access, which is probably well cached by operating system but not efficent for large indexes (in my case file size is over 500MB). Memory directory implementation tends to leak which is unacceptable.
Solution below demonstrates how to enable document caching in Lucene.
Cache needs to be added to FieldsReader class. Code below shows only changes that needs to be made:
  • Cache directory needs to be added.
  • Cache needs to be clear at the end of Close() method.
  • Document is returned from cache if already loaded at the beginning of Doc() method.
  • Document is added to cache after read in Doc() method.
This change has big memory cost but speeds up large queries even up to 10 times.
Further optimization can be found here.

public sealed class FieldsReader
{
    private SortedList<int,Document> cache=new SortedList<int, Document>();
//---------
    public void Close()
    {
//---------
        cache.Clear();
    }
//---------
    public Document Doc(int n)
    {
        if (cache.ContainsKey(n))
        {
            return cache[n];
        }
//---------
        if (!cache.ContainsKey(n))
        {
            cache.Add(n, doc);
        }
        return doc;
    }
//---------

No comments: