Lucene and SOLR Get Commercial Support 47
ruphus13 writes "Two of the technical leads and core committers of the Lucene Project have launched Lucid Imagination, a venture backed company now offering commercial versions of Lucene and SOLR in the hopes of making it the de facto choice of search technologies used by companies within their products. 'The Lucene search library ranks amongst the top 5 Apache projects, installed at over 4,000 global companies. Although OStatic is primarily Drupal-based, our site's search is based on Lucene. According to Lucid Imagination officials, the Solr search server, which transforms the Lucene search library into a ready-to-use search platform for building applications, is the fastest growing Lucene sub-project...Lucid's business model is roughly comparable to Red Hat's very successful model, in that it centers on support and services for free, open source software.'"
Re:oookay. (Score:3, Informative)
Lucene is a full-text indexer and search library. Solr is a full-text indexer and search server, based on Lucene.
Re:possible alternative: xapian (Score:2, Informative)
About to move to the Java port of Lucene... (Score:5, Informative)
We're currently using the Zend PHP port of Lucene. It was nice, because we were able to use all our existing code for loading our PHP objects from the database for indexing. It worked fine, as long as are indexes stayed small.
Now we have several indexes weighing in at around 300+ megabytes, and Zend Lucene has proven to be absolute crap. It takes seconds of CPU time, and hundreds of megs of ram to process simple queries against these indexes. When tested in Luke [getopt.org], the same queries against the same indexes finish in milliseconds with minimal memory usage. Either the Zend port, or PHP itself is clearly unsuitable for production use on large indexes.
Either way, we're going to switch it out for Solr ASAP, and we anticipate the development overhead should be minimal -- we'll keep using the same code to load our objects, and pass them to Solr via JSON.
full-text search (Score:5, Informative)
You mentioned SQL SELECTs elsewhere. Full-text search isn't like a SELECT. It's more like what what happens when you google something: many documents are searched in a split second, and complex queries can be done, like documents containing a phrase, but not this one, or documents that mention X with Y within a few sentences of that, or documents that mention X and Y, but not Z. Yes, SQL lets you do that, but not for text, except in very inefficient ways.
From what I've seen of it (which is very little), Lucene lets you, as a programmer, index data using your own field names. So, say you're indexing word documents and HTML documents. You can extract most of the text and index it as "maincontent", but seperately extract the author, title and subtitle, indexing those individually. This lets you query attributes, like: "space nasa and not genre:sci-fi". Full text search also does ranking based on the occurences of different words you query by, etc. Presumably Lucene would let you specify which fields/attributes are included in a search, and which ones have the highest scores in search results, for instance.
Yeah, I don't get where $5m USD went on that either. I didn't think it was THAT big a problem. But maybe it is. Personally, I'm holding out for a decent Triple API, which hopefully make all but the indexer of this obsolete.
Re:oookay. (Score:3, Informative)
No. It's a search engine for your website. It's not quite as simple as a SELECT query. Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. [apache.org]. That does quite a bit more than a SELECT query could hope to do.
Comment removed (Score:5, Informative)
Re:About to move to the Java port of Lucene... (Score:1, Informative)
I found the original Java libraries to be plenty fast as well. We index millions of records, and it's always been plenty fast returning even the most complex queries. Granted, it probably isn't as fast as the C library, but it is the most updated and feature rich. And, many of those later features that the C library lacks makes it COMPLETELY worth it.
SOLR has several advantages (Score:4, Informative)
I agree, Xapian is nice, and we considered it for a while. However, in the end, the decision was made to use SOLR because of one overriding factor in its favor: it takes care of all the nasty details to enable concurrent access, which makes developing web applications just so much easier. With SOLR you just don't have to worry about who might currently be reading or writing to the index, and the index replication features are very powerful, too.
That, and facet searches are very nice, too (e.g., searching for a keyword and then automatically displaying the # of hits per category, and refining per category).
SOLR has Python bindings, too, by the way. They currently are not in the official repository, but recently maintenance on them has picked up, and they work in a very Pythonic way.
Re:Based on open source? #5? (Score:1, Informative)
Re:full-text search (Score:2, Informative)
From what I've seen of it (which is very little), Lucene lets you, as a programmer, index data using your own field names. So, say you're indexing word documents and HTML documents. You can extract most of the text and index it as "maincontent", but seperately extract the author, title and subtitle, indexing those individually. This lets you query attributes, like: "space nasa and not genre:sci-fi". Full text search also does ranking based on the occurences of different words you query by, etc. Presumably Lucene would let you specify which fields/attributes are included in a search, and which ones have the highest scores in search results, for instance.
You've certainly hit close to the mark. I work on a site that uses Solr and it does work just as incredibly as others have said. You can tell it what fields you want to search. You can tell it what order you want results sorted in (and you can sort on more than one column in cases of relevancy ties). You can tell it you want matches in one column weighted more than another. You can tell it you want the terms to be within X words of each other. And you can tell it what words should not be in the results.
And then there's the other results it can offer. Faceted search is fantastic. If you have products split by department, you can facet by department and your search for widgets can then return not only the results, but a list of the departments the current results were found in with a result count for each. (Very common feature on ecommerce sites, especially those using Endeca.)
They also have more-like-this results you can use as well as match highlighting. I haven't had the opportunity to try the spelling correction parts yet.
And the indexes can be incredibly small. After indexing over 1 million pages of information, the index data folders were under 500MB. The Lucene indexer can literally hold our entire search set in RAM while it's running.