Lucene in Action 109
Simon P. Chappell writes "I don't know about you, but I hardly bother with browser bookmarks any more. I used to have so many bookmarks, back in the early days of Netscape's 4 series, that I would have to regularly trim and edit my bookmark file to prevent my browser from crashing on startup -- that's a lot of bookmarks, folks! Now, I go to my favourite web search engine, enter a couple of appropriate search terms and voila, there's my page! Search engines are so ubiquitous that we rarely give much thought to the technology that powers them. Lucene in Action by Otis Gospodnetic and Erik Hatcher , both committers on the Lucene project, goes behind the HTML and takes you on a guided tour of Lucene, one of a generation of powerful Free and Open-Source search engines now available." Read on for the rest of Chappell's review.
Lucene in Action | |
author | Gospodnetic and Hatcher |
pages | 421 (7 pages of index) |
publisher | Manning |
rating | 9 |
reviewer | Simon P. Chappell |
ISBN | 1932394281 |
summary | Solid introduction to Lucene |
Who's it for?
Lucene is a library and framework, rather than a complete application. It truly is an engine, around which you are expected to build and extend your own application. Like Lucene, the book is targeted at those who are looking for a tool to build their own search facility application rather than just "download and go." The book does include a number of case studies of Lucene usage (including at least one download and go search engine) but those are included to show how to use and adapt Lucene to fit differing environments rather than as ends in themselves.The Structure
The book is sensibly divided into two parts. The first part looks at "Core Lucene" functionality, while the second part addresses "Applied Lucene".Part one has six chapters, covering the central components and inner workings of Lucene. It's here that the book starts with a tutorial introduction, familiarising the reader with the concepts of Lucene as a search engine around which you wrap your own code. The other five chapters move steadily through good search engine fare, with indexing getting the whole of chapter two to itself The discussion of how to retrieve text from the documents being indexed is mentioned here but postponed until chapter seven, where it is dealt with exhaustively. Chapter three covers searching, and especially how Lucene ranks documents.
Chapter four examines analysis. In it's chapter introduction, the book explains that "Analysis, in Lucene, is the process of converting field text into it's most fundamental indexed representation, terms." This process is performed by an analyser, which tokenises text according to it's own built in rules; each analyser will have a different emphasis, some want only dictionary words, others might explicitly include acronyms and sometimes you'll want an analyser that will block stop words (those words in languages that are part of the structure, but that add nothing to the information being conveyed by the text; classic examples of stop words in English include "a", "and" and "the").
Chapter five looks at advanced search techniques; everything from sorting search results, searching on multiple fields to filtering searches. Many free or open source software tools are extensible, and Lucene is no exception. Chapter six addresses creating and using custom components within Lucene, everything from custom sort methods to custom filters.
Part two, the final four chapters, cover Applied Lucene. It is dedicated to practical uses of Lucene and answers the question "So, what can I do with a search engine?" Chapter seven covers ways and means to parse common, non-plain text document formats. The primary formats covered are RTF, XML, PDF, HTML and Microsoft Word. The ability to parse and index these file formats will cover the search engine needs of the majority of Lucene users. Chapter eight looks at a number of Lucene tools and extensions that are available; many of them being free and open source software. Chapter nine covers ports of Lucene. While for many users, Lucene being a Java library is not a problem, some users want its functionality in environments that do not have Java. The chapter looks at ports written in C++, C#, Perl and Python. Lastly, chapter ten takes a thorough look at seven Lucene case studies. Perhaps the "star" case study is the one about Nutch, a download and go search engine written by Doug Cutting , the original author of Lucene.
There are three appendices. The first offers installation advice for Lucene; a useful addition that those newer to working with Java libraries will surely appreciate. The second appendix has a very well explained description of the Lucene index format. This is the kind of information that can be hard to find, so it is welcome in a book of this sort. The last appendix contains a number of categorised resource references. The number and breadth of the resources provided could provide quite an incredible education in information retrieval theory if the reader was inclined to read them all.
What's to Like?
There are several things to like about this book. Let's start with the fact that the authors are part of the core development team of Lucene. This gives them both credibility and an excellent understanding of the internal workings of Lucene. Co-author Erik Hatcher is a fantastic writer, having previously been a co-author of the only Ant book worth bothering with, Manning's Java Development with Ant . (Full disclosure: I do know Erik personally.)The structure of the book is well thought out and each chapter does seem to move your understanding forward when combined with what you learned from the proceeding ones. The division into core and applied Lucene is also helpful. While you'd hope that this was the case, it often isn't; hence I note it as a positive.
I especially appreciate that this book does not fill up page after page with API documentation. The authors appear to have grasped that if you have Internet access to download the software, you might just be able to access the documentation online; rather, they concentrate on the way to use the software. What a concept!
As a part of Manning's "in Action" series, the book has excellent layout and has obviously been thoroughly edited by both technical evaluators and copyeditors. This might seem to be a small thing to some, but a well-edited book stands out clearly from the crowd.
What's to consider?
If you are looking for a book on using and configuring a download and go style of search engine, this book would be less suitable. While the case study on Nutch is of good length, it would be too short to useful as a configuration guide.Conclusion
I enjoyed reading this book. If you have any text searching needs, this book will be more than sufficient equipment to guide you to successful completion. Even, if you are just looking to download a pre-written search engine, then this book will provide a good background to the nature of information retrieval in general and text indexing and searching specifically.You can purchase Lucene in Action from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
Re:Raise your hand if... (Score:3, Informative)
Re:Raise your hand if... (Score:1)
Re:Raise your hand if... (Score:1)
I wrote a firefox extension to have the best of both worlds: my bookmarks are still stored locally, but they're now automatically backed up on a web site. Additionally, with this extension my other machines can synchronize with that web site, so I've got all my bookmarks stored locally on all my machines and they periodically automatically synchronized between them.
Check i
My solution (Score:3, Interesting)
Re:My solution (Score:1)
Re:My solution - Booby online PIM (Score:2)
Re:My solution - Booby online PIM (Score:2)
Re:My solution - Booby online PIM (Score:1)
My solution is not really high tech, but it has one major advantage over bookmarks - it's way, way faster. Scrolling through bookmarks is slow and tedious because scrolling in general is slow and tedious. If you have them on your home page, the first thing you are presented with is links. I
Re:My solution (Score:1, Interesting)
The "home" button on a browser is supposed to take you to YOUR OWN web space, maintained by you - i.e. your home. Some bits might be your front garden, visible to others, others private.
People who use the "home" button as just another bookmark to a search engine are missing the point of the web.
It isn't helped by the fact that current browsers aren't actually good as editors (unlike the original web browser vision) - Your web site should be a WYSIWIG-editable persomal/private pseudowiki. Many
Re:My solution (Score:1)
My solution is similar, but I just put a PHP page on my home desktop that serves my FF bookmarks.html. It took about 2 minutes to create and now I can get to my bookmarks from anywhere.
The only downside is that I would have to spend a little more time if I also ever wanted to add new bookmarks from anywhere..
Wow, open source search engines. (Score:2, Interesting)
Also, I agree with the author that bookmark functionality has gone the way of the dinosuars... with the exception of the "open all tabs" feature found in many browsers today... that is about the only one that I use often.
Im just wondering how the "search" functio
Re:Wow, open source search engines. (Score:1, Funny)
I was looking for a good book on Open Source search engines.
Well, you could have used Google to fi-- oh, I see.
Re:Wow, open source search engines. (Score:2)
Lucene in a nutshell (over-simplified) (Score:5, Informative)
The basic idea is that you want to build an index, and then search it, to find some document.
A document has several fields (e.g. text, title, lastModificationDate, author, categories, summary, url, etc.) which may be indexed, stored, or both.
You usually build your lucene documents, based on some real documents (e.g. web pages, PDF, records in a database, etc.), and then add them to the index.
Once you have an index, you build a query to search one or more fields (lucene provides a QueryParser class, which handles the most common cases), and you get a Hits collection containing the documents matching your query in some order (this can be customized).
Before a document is added to the index, it is passed through an Analyzer which converts the text in the fields to terms, which are the basic internal concept that is indexed.
Another interesting feature of lucene indexes is that they can be searched while they are being built without noticeable loss of search performance, and that they are process-safe (many processes can access them for reading, only one for writing), this has the drawback that the indexes are append-only (actually a separate index is created if you modify an index), but periodical optimization of the indexes removes unnecessary entries and inefficiencies.
Hope this helps!
juancn
Re:Wow, open source search engines. (Score:1)
Lucene is a search engine
Re:Wow, open source search engines. (Score:2)
Bookmarks are better (Score:4, Interesting)
I tend to use bookmarks in Firefox and the autocomplete about equally, and make use of the Quick Links toolbar for my most popular sites.
The Firefox bookmark all tabs feature is a breakthrough, since you can close your browser, and reopen it to the same set of tabs as before, which is great when installing extensions and you're forced to restart. The only drawback is that scrolling through bookmarks is too slow, but if you use your scroll wheel it speeds up considerably. That's a trick I didn't figure out until just last month.
Re:Bookmarks are better (Score:1)
Re:Bookmarks are better (Score:4, Interesting)
I think that abandoning bookmarks altogether is a bad idea.
Search, while useful, only works if you can find the exact keywords necessary to bring up a certain page. Search merely complements, rather than replaces, bookmarks.
Looking through my bookmark lists, I see many websites which I would never have known how to search for (they're mostly websites I stumbled upon from other websites). Some of these sites are hard to find because:
1) they don't have enough Statistically Improbable Words. e.g. try searching for software that describes biology of a python.
2) the page doesn't contain words associated with its significance to me (yes, it can happen). e.g. let's say you come across a page that has a nice layout that you want to revisit later -- if you ever forget the keywords on that page, you may never find it again. Whereas if I were to file it under "Nice websites" in my bookmark folder, I'd be able to find it again.
3) I can't remember any of the keywords associated with the page.
4) I forget that I've ever visited those webpages. Some search engines (e.g. a9.com) have histories that you can revisit, but they're no use unless you can classify them. And if you classify them, they're basicallly bookmarks.
I think the reason people dislike bookmarks is because they're a hassle to organize. We need some sort of tool to autoorganize bookmarks.
There two basic requirements:
1) Multiple hierarchy - a bookmark must be able to belong to more than one category. Example of this is GMail's labels [g04.com] -- each email can belong to more than one label.
2) Automatic classification - the proper term for this is automatic taxonomy. This can be accomplished using a Bayesian algorithm (like the one POPmail is using). In fact, DEVONthink already does this [devon-technologies.com].
When a user makes a bookmark, the program should come up with a list of category folders (sorted from likeliest to least likely) to file that bookmark under, and the user must be allowed to select more than one folder.
Re:Bookmarks are better (Score:3, Insightful)
I like your automatic classification ideas.
A complaint about Firefox: when I choose "bookmark this page" it comes up with a little dialog. This dialog has a one-line selector for where I want to create the bookmark (default being the folder named "bookmarks") and a little button to expand this one line into a scree
Re:Bookmarks are better (Score:1)
(I agree, it should really be default behavior.)
OpenBook Firefox Extension (Score:2)
Bookmarks toolbar folder is better! (Score:3, Interesting)
What I've begun doing is using the "Bookmarks Toolbar Folder" for all of my bookmarks. I've got "Essentials" with links to Gmail, Adsense, my website, Distributed.net stats and so forth, basically all of the sites that I try to visit daily. Then I've got "Favorite sites" that holds Slashdot (even though now it's "home"), Woot, Craigslist, Free6.com (hehe), Assambassador.com, Myspleen, demonoid, you get the point.
Then I've got the essential one: "Functions" - that
Re:Bookmarks toolbar folder is better! (Score:3, Funny)
Re:Bookmarks are better (Score:1)
del.icio.us is better (Score:3, Informative)
I find bookmarks slow to navigate, and it's hard for me to remember my own hierarchy when I've got enough bookmarks to organize. The problems with search have been expanded on by others in this thread.
So here's the solution: http://del.icio.us/ [icios.us].
You can create, edit, tag, describe, and search your own personal bookmarks. When you've done that, the world can see your links too. Subscribing to an RSS feed of some tags you're interested in ("pytho
DAMMIT - link is wrong, see update (Score:2)
To make matters worse, there appears to be a copycat typosquatter site at the link I put in there. Oh well, if you get all your vital information from Slashdot you deserve what you get
Re:del.icio.us is better (Score:2)
I run Simpy (see the signature or use the
demo/demo [simpy.com] account), which has some notable advantages over delicious [simpy.com], especially in the search area (surprise, surprise).
Re:Bookmarks are better (Score:2)
You will need to do that only one more time. That is when you install the Firefox extension called SessionSaver. From the website [mozilla.org]:
Algorithms (Score:1)
I've heard of Lucene through my fav. Computer magazine ( http://www.heise.de/ct [heise.de]), but I was more interested in indexing algorithm at that time.
So how much weight does the book give into algorithms? Is there anyone out there who's as mathematically/scientifi
Re:Algorithms (Score:2)
It also talks about the indexing format; how the indexes are stored and searched. If that aint enough, well, the source is up on apache.org
Benefits? (Score:2)
Re:Benefits? (Score:2)
Re:Benefits? (Score:2)
That said, for most projects you are better off to just use a google search, but there are times when knowing the structural properties of your da
Re:Benefits? (Score:4, Informative)
Some examples of customizable features are that you can index database entries and achieve quantum leaps in performance over that offered by Oracle, MySQL, PostGres, Firebird, etc. indexing. You can index formats that are not supported by the major search enginges.
It may not offer quite the performance of Google, Alta Vista, etc., but it's a FREE product, well supported by the folks at Apache, and many open source J2EE frameworks support it as well.
Re:Benefits? (Score:2)
Better Memory Than I (Score:3, Interesting)
You have a better memory than I my friend. Many times I only barely remember something I want to find again. Maybe I remember it was humourous, or maybe I remember it was an online game with pigs in it. Unless it's popular I doubt 'pig game' is gonna get me far. So bookmarks aren't so useless to those of us who don't keep everything in RAM.
Bookmarks, and a good hierarchy, also leverage the Associative aspect of our minds. Skim through your high level bookmark folders and you'll probably find what you were thinking of pretty quick. Additionally it reminds you of things you may have bookmarked yet forgotten.
Re:Better Memory Than I (Score:1)
I doubt 'pig game' is gonna get me far.
Maybe not, but I'll bet it would make an interesting Google Image Search with Safe-search turned off.
*dares not try at work*
Re:Better Memory Than I (Score:1)
And by the way, it looks like that pig game was pretty popular as it is the first link in a normal search and a screenshot of a game is the first link in a image search.
Re: (Score:2)
Re:Better Memory Than I (Score:2)
Re:Better Memory Than I (Score:2)
That's my idea, anywho.
Re:Better Memory Than I (Score:3, Interesting)
His solution, using a search engine, is a much better method as you might even come across something new and even MORE useful than what you had originally bookmarked.
I check a handful of websites daily. Mostly Google News, slashdot, MNspeak, geocaching.com, mngca.org, and usually some others. While having them setup in a hierarchy might leverage the association aspect, typing them in everytime e
Re:Better Memory Than I (Score:2)
Re:Better Memory Than I (Score:1)
1. Visit google.
2. Type in "Qt 3.4 documentation"
3. Hit submit
4. Find and click on the link
OR
1. Click on the bookmark.
Yeah NOT using bookmarks is so efficient.
Re:Better Memory Than I (Score:2)
Like a phone directory (Score:2)
Exactly!
My cellphone has only work contacts programmed into it, because the only time I'm going to need these numbers is when I'm on the clock, and when I'm carrying the cellphone.
But personal contacts? I've learned over the years to deliberately NOT program these in - forced repetition of typing in the numbers means I commit them to memory. Extremely handy for when I don't have my cellphone on me, or its battery dies.
Personally, I found bookmarks were almost harm
Re:Better Memory Than I (Score:1)
Re:Better Memory Than I (Score:2)
Like em, keepin em (Score:3, Insightful)
Anyhow, I simply build to critical mass before I sort them into their respective folders. Some things are automatically tossed into temporary bookmark folders that are going to get washed away after they are no longer useful. (Think auction links)
Now I'll tell you why using a search engine as replacement bookmark concept is a bad idea. Page ranking changes. That particular combination of words you can remember... might just not produce the same results next time. Wonder why? The interenet changes! It is not Aol keyword search...
That said, I did something as foolish as to rely on google to get back to some website regarding video sync signals. It was an excellent page and then I went back to search for it again and I could not find it. (Eventually I did though)
Bookmarks good, search engine good... not mutually exclusive.
Re:Like em, keepin em (Score:2)
Ha (Score:1)
Oh wait.. wrong chappelle
RSS (Score:2, Interesting)
While search engines are great, bookmarks are not obsolete. I use RSS feeds to keep up on anything that is serialized that I might care about. FF is great for that.
I still use a few regular bookmarks (like the URL that logs me into /.). Or for development servers with obscene URLs. That is the kind of thing that a search engine won't find. Especially if you have to deploy to a few web servers (this is the WebLogic machine, this is the OAS machine, etc). I have even bookmarked LDAP strings for testing.
Re:RSS (Score:1)
Yes, but...
The distribution contains some demo applications that you can point to a filesystem. One app will index the text, another will index HTML (or maybe one does both, I can't remember). Then you execute another app to query the index.
The hard part is to get Lucene to index non-text files such as Office files. The version of Lucene I've used is the Java version. Third-party libraries exist for Word and Excel docs (on a Windows filesystem), but none
Lucene is Apache Licensed (Score:2)
The nice thing about Lucene is it adds indexing and searching to anything you want -some search plugin for outlook (blech) is built on lucene.net; imagine an equivalent for the unix mail systems -thunderbird , evolution or emacs, for example.
Lucene providing search engine for Hula (Score:3, Interesting)
Reminds me of Mac System 7 (Score:2)
Google anyone ? (Score:3, Interesting)
MOD PARENT UP (Score:2)
Re:MOD PARENT UP (Score:2)
Take a look at http://www.theserverside.com/ [theserverside.com] - the enterprise java community. Their search is powered by Lucene. It's pretty fast and a very capable site search.
You also have open source projects, such as Beagle (the desktop search for Gnome), that uses the
Re:Google anyone ? (Score:4, Informative)
The best reason is that its very, very easy to set up a Google search... all you have to do is add site:your_site to the search query, and Bam! instant search.
Lucene takes some work to setup, and is best used where normal Web crawling doesn't work. For example, I work on an eCommerce Web App where all our products are stored in the database, and you reach them by setting a CGI parameter in the URL. Not all products have links to them on our site. We use Lucene because we can pull all the products out of the database and index them, and get hits that crawling would have missed. We can also customize things like redirecting a search for "help" to the help page, set up synonym lists, etc.
So long story short, their search needs are not complex enough to justify the effort of setting up a Lucene based application.
Control-D is the death of bookmarks (Score:1)
Re:Control-D is the death of bookmarks (Score:2)
I was focused on another SCRT.
I just wanted to thank you for closing screen, a tail with a lengthy grep, mysql, bind (I was running in the foreground for debugging), and god knows what else.
That is all.
We use Lucene ... (Score:1)
Good article (Score:3, Informative)
This is also a good link for all of you slashdotters who have no idea what Lucene is for and are posting rants wondering why people don't just use Google instead.
my problem solver (Score:2, Informative)
Bookmarks and chronology (Score:1)
Sitebar saved my sanity... (Score:3, Informative)
Usual disclaimer: I have nothing to do with Sitebar or its development, just a majorly satisfied user.
Try delicious? (Score:2, Informative)
If you use Firefox, there are extensions that allow you to view your bookmarks in a sidebar [mozdev.org] and sync your online bookmarks [ganx4.com] with your browser bookmarks.
But what about Fishy? (Score:1)
But other times, the search engine screws you over. Case in Point - fishy
The other day at work I wanted to play Fishy, so I typed it into Google, went to the top link, and started playing. What??? They changed FISHY?? NO, they didnt, the top link was some bizarro version o
What I want (for bookmarks) (Score:2)
PPS (Score:1)
And if you've ever wanted to create a personal proxy server that gives you a searchable database of your history and bookmarks, then you can do that too, just like I did: http://www.suttree.com/code/pps/ [suttree.com]
Re:PyLucene (Score:2, Informative)
Re:PyLucene (Score:2)
There are also python wrappers for clucene. I was not able to get gcj to work on my system (FreeBSD - even "hello world" did not run and I don't know why), while clucene works. Well sort of, it is clearly alpha, but good enough.
Re:PPS (Score:1)
Lucene implementations can parse MS Office files (Score:2)
Lucene is great! I use it all the time (Score:4, Informative)
http://www.devx.com/Java/Article/27728/0 [devx.com]
Lucene is so well documented and simple to use that I am surprised that this subject would fill an entire book
Lucene can be used as is, or you can extend it with your own document type handlers, etc.
As a programmer, I way prefer dynamic languages like Common Lisp, Ruby, Python, Smalltalk, etc. However, one of the things that keeps me firmly in the "Java camp" is the great free infrastructure software tools (like Lucene, Tomcat, JBoss, etc.) As a programming language, Java is kind-of weak.
Re:Lucene is great! I use it all the time (Score:2, Interesting)
Java is anything but weak.
Re:Lucene is great! I use it all the time (Score:1)
Lucene rocks! (Score:2, Informative)
And you can index pretty much anything you want, so long as you can
DotLucene (Score:1)
I used it to index a fairly complicated ASP.NET portal site on which there was little or no static content and all content was secured using a custom implementation of ACLs. The ASP.NET application allowed you to run mini-applications within it, called Portlets.
These portlets had very complex security rules. For instance, you could say that certain users could click this button, while others could not. Certain users can view this portlet p
Xapian (Score:2)
Bookmarks are personal (Score:1)
del.icio.us, the abstraction, is half the answer. Apple's "iDrive" is the technological half. What we all need is a *follow-me* resource available anywhere, anytime that is totally abstracted above the hardware layer.
iMarks, personal bookmarks, that load on launch. An open stand
Bookmarks solution from the book author (Score:2)
Anyhow, I just wanted to connect these 3 islands - Lucene in Action + bookmark problem + Simpy. I'l
Just thought I'd pipe in... (Score:3, Informative)
- Good ways of doing batch indexing operations
- The purpose of the compound document format
- How to generate explainers for searches
- Field-specific handling, and how to do it well
- Ideas like metaphone replacement (soundex) and use of WordNet to integrate a synonym database into search queries
- When to use CachedFilters to remember complex filters
- Ideas for how to build "Things Like This" lists
- Ideas on autocategorisation and geographic searches
- Named Entities and LingPipe - making the search system recognize "proper names" for things
- NGRAM recording to gauge word frequency and search terms to detect misspellings and offer alternate searches ("Did you mean XXX"?)
etc.
If you're building a search engine, this isn't just a useful resource on implementation - you probably don't need a book for that. What it is brilliant at is providing a lot of ideas that can take you to the next stage - how to build something really cool with your information, and not just a dumb text field search.
For that alone, the book was worth the purchase price, for me. It's now well annotated, and the back pages are full of references to ideas that can be used in our own implementation, and the page numbers to use to get there.
Highly recommended for anyone who needs something more than what a Google search of your site provides.
Lucene good, databases bad (Score:1)