Lucene in Action 109

Posted by timothy on Wednesday August 24, 2005 @03:51PM from the lucene-and-ricking dept.

Simon P. Chappell writes "I don't know about you, but I hardly bother with browser bookmarks any more. I used to have so many bookmarks, back in the early days of Netscape's 4 series, that I would have to regularly trim and edit my bookmark file to prevent my browser from crashing on startup -- that's a lot of bookmarks, folks! Now, I go to my favourite web search engine, enter a couple of appropriate search terms and voila, there's my page! Search engines are so ubiquitous that we rarely give much thought to the technology that powers them. Lucene in Action by Otis Gospodnetic and Erik Hatcher , both committers on the Lucene project, goes behind the HTML and takes you on a guided tour of Lucene, one of a generation of powerful Free and Open-Source search engines now available." Read on for the rest of Chappell's review.

Lucene in Action
author	Gospodnetic and Hatcher
pages	421 (7 pages of index)
publisher	Manning
rating	9
reviewer	Simon P. Chappell
ISBN	1932394281
summary	Solid introduction to Lucene

Who's it for?

Lucene is a library and framework, rather than a complete application. It truly is an engine, around which you are expected to build and extend your own application. Like Lucene, the book is targeted at those who are looking for a tool to build their own search facility application rather than just "download and go." The book does include a number of case studies of Lucene usage (including at least one download and go search engine) but those are included to show how to use and adapt Lucene to fit differing environments rather than as ends in themselves.

The Structure

The book is sensibly divided into two parts. The first part looks at "Core Lucene" functionality, while the second part addresses "Applied Lucene".

Part one has six chapters, covering the central components and inner workings of Lucene. It's here that the book starts with a tutorial introduction, familiarising the reader with the concepts of Lucene as a search engine around which you wrap your own code. The other five chapters move steadily through good search engine fare, with indexing getting the whole of chapter two to itself The discussion of how to retrieve text from the documents being indexed is mentioned here but postponed until chapter seven, where it is dealt with exhaustively. Chapter three covers searching, and especially how Lucene ranks documents.

Chapter four examines analysis. In it's chapter introduction, the book explains that "Analysis, in Lucene, is the process of converting field text into it's most fundamental indexed representation, terms." This process is performed by an analyser, which tokenises text according to it's own built in rules; each analyser will have a different emphasis, some want only dictionary words, others might explicitly include acronyms and sometimes you'll want an analyser that will block stop words (those words in languages that are part of the structure, but that add nothing to the information being conveyed by the text; classic examples of stop words in English include "a", "and" and "the").

Chapter five looks at advanced search techniques; everything from sorting search results, searching on multiple fields to filtering searches. Many free or open source software tools are extensible, and Lucene is no exception. Chapter six addresses creating and using custom components within Lucene, everything from custom sort methods to custom filters.

Part two, the final four chapters, cover Applied Lucene. It is dedicated to practical uses of Lucene and answers the question "So, what can I do with a search engine?" Chapter seven covers ways and means to parse common, non-plain text document formats. The primary formats covered are RTF, XML, PDF, HTML and Microsoft Word. The ability to parse and index these file formats will cover the search engine needs of the majority of Lucene users. Chapter eight looks at a number of Lucene tools and extensions that are available; many of them being free and open source software. Chapter nine covers ports of Lucene. While for many users, Lucene being a Java library is not a problem, some users want its functionality in environments that do not have Java. The chapter looks at ports written in C++, C#, Perl and Python. Lastly, chapter ten takes a thorough look at seven Lucene case studies. Perhaps the "star" case study is the one about Nutch, a download and go search engine written by Doug Cutting , the original author of Lucene.

There are three appendices. The first offers installation advice for Lucene; a useful addition that those newer to working with Java libraries will surely appreciate. The second appendix has a very well explained description of the Lucene index format. This is the kind of information that can be hard to find, so it is welcome in a book of this sort. The last appendix contains a number of categorised resource references. The number and breadth of the resources provided could provide quite an incredible education in information retrieval theory if the reader was inclined to read them all.

What's to Like?

There are several things to like about this book. Let's start with the fact that the authors are part of the core development team of Lucene. This gives them both credibility and an excellent understanding of the internal workings of Lucene. Co-author Erik Hatcher is a fantastic writer, having previously been a co-author of the only Ant book worth bothering with, Manning's Java Development with Ant . (Full disclosure: I do know Erik personally.)

The structure of the book is well thought out and each chapter does seem to move your understanding forward when combined with what you learned from the proceeding ones. The division into core and applied Lucene is also helpful. While you'd hope that this was the case, it often isn't; hence I note it as a positive.

I especially appreciate that this book does not fill up page after page with API documentation. The authors appear to have grasped that if you have Internet access to download the software, you might just be able to access the documentation online; rather, they concentrate on the way to use the software. What a concept!

As a part of Manning's "in Action" series, the book has excellent layout and has obviously been thoroughly edited by both technical evaluators and copyeditors. This might seem to be a small thing to some, but a well-edited book stands out clearly from the crowd.

What's to consider?

If you are looking for a book on using and configuring a download and go style of search engine, this book would be less suitable. While the case study on Nutch is of good length, it would be too short to useful as a configuration guide.

Conclusion

I enjoyed reading this book. If you have any text searching needs, this book will be more than sufficient equipment to guide you to successful completion. Even, if you are just looking to download a pre-written search engine, then this book will provide a good background to the nature of information retrieval in general and text indexing and searching specifically.

You can purchase Lucene in Action from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

Lucene in Action

This discussion has been archived. No new comments can be posted.

Search 109 Comments Log In/Create an Account

Comments Filter:

My solution (Score:3, Interesting)

by Neil Blender ( 555885 ) writes: <neilblender@gmail.com> on Wednesday August 24, 2005 @03:55PM (#13392035)

My home page is a nicely sorted webpage with all my frequently visited links in a password protected section of my web site. If something gets used enough in my bookmarks, it gets put on that page and gets deleted from my bookmarks. Then, no mater where I am or what computer I am on, I can access my links.

Wow, open source search engines. (Score:2, Interesting)

by keilinw ( 663210 ) * writes: on Wednesday August 24, 2005 @03:56PM (#13392038) Homepage Journal

Thanks! I was looking for a good book on Open Source search engines. While I have never heard of "Lucene" I will definitely be looking into it now. Its probably a good opportunity to learn all about Search Engine Heuristics, methods, etc...

Also, I agree with the author that bookmark functionality has gone the way of the dinosuars... with the exception of the "open all tabs" feature found in many browsers today... that is about the only one that I use often.

Im just wondering how the "search" functionality will actually play out in the future. Apple has "Spotlight" and Microsoft is supposedly incorperating magick folders or something like that into Vista. Can anyone tell me more about Lucene and how it differs from say Google or other search engines?

Thanks,

keilinw.

Bookmarks are better (Score:4, Interesting)

by saskboy ( 600063 ) writes: on Wednesday August 24, 2005 @03:58PM (#13392061) Homepage Journal

Bookmarks are more secure than a search engine, since a search engine could provide a poisoned link, and if you're typing in the URL by hand, if you make a spelling mistake, you could find yourself at a pharming site, or someplace you didn't want to go.

I tend to use bookmarks in Firefox and the autocomplete about equally, and make use of the Quick Links toolbar for my most popular sites.

The Firefox bookmark all tabs feature is a breakthrough, since you can close your browser, and reopen it to the same set of tabs as before, which is great when installing extensions and you're forced to restart. The only drawback is that scrolling through bookmarks is too slow, but if you use your scroll wheel it speeds up considerably. That's a trick I didn't figure out until just last month.

Better Memory Than I (Score:3, Interesting)

by Flamesplash ( 469287 ) writes: on Wednesday August 24, 2005 @04:03PM (#13392107) Homepage Journal

Now, I go to my favourite web search engine, enter a couple of appropriate search terms and voila, there's my page!

You have a better memory than I my friend. Many times I only barely remember something I want to find again. Maybe I remember it was humourous, or maybe I remember it was an online game with pigs in it. Unless it's popular I doubt 'pig game' is gonna get me far. So bookmarks aren't so useless to those of us who don't keep everything in RAM.

Bookmarks, and a good hierarchy, also leverage the Associative aspect of our minds. Skim through your high level bookmark folders and you'll probably find what you were thinking of pretty quick. Additionally it reminds you of things you may have bookmarked yet forgotten.

RSS (Score:2, Interesting)

by ezweave ( 584517 ) writes: on Wednesday August 24, 2005 @04:07PM (#13392143) Homepage

While search engines are great, bookmarks are not obsolete. I use RSS feeds to keep up on anything that is serialized that I might care about. FF is great for that.

I still use a few regular bookmarks (like the URL that logs me into /.). Or for development servers with obscene URLs. That is the kind of thing that a search engine won't find. Especially if you have to deploy to a few web servers (this is the WebLogic machine, this is the OAS machine, etc). I have even bookmarked LDAP strings for testing.

More to the point of TFR, I would be intereseted in learning more about OSS search engines. It would be great to set one up on my own net... hmmm. As an aside, can Lucene be used for local searches? It would be cool to make my own desktop search. What kind of licensing does it have?

Lucene providing search engine for Hula (Score:3, Interesting)

by bad_outlook ( 868902 ) writes: on Wednesday August 24, 2005 @04:11PM (#13392171) Homepage

the Lucene (http://jakarta.apache.org/lucene [apache.org]) indexer will be inplememtned within Hula the web and cal application (http://hula-project.org/Hula_Server [hula-project.org]) made from open sourced Novell NetMail code. Samples of the search engine have been comitted and should start functioning within weeks, just in time for the new cal UI, which you can now view a demo of here: http://nat.org/2005/august/hula.html [nat.org] That's looking to be an amazing app...

Google anyone ? (Score:3, Interesting)

by Potatomasher ( 798018 ) writes: on Wednesday August 24, 2005 @04:14PM (#13392198)

Does anyone find it a little funny that on the main lucene.apache.com webpage, there is a "Search this site with Google" textbox ? Kind of makes you NOT want to use their search engine if they dont' even trust it enough to work on their own site....

Re:Bookmarks are better (Score:4, Interesting)

by zhiwenchong ( 155773 ) writes: on Wednesday August 24, 2005 @04:26PM (#13392275)

Quite... I just want to say something:

I think that abandoning bookmarks altogether is a bad idea.

Search, while useful, only works if you can find the exact keywords necessary to bring up a certain page. Search merely complements, rather than replaces, bookmarks.

Looking through my bookmark lists, I see many websites which I would never have known how to search for (they're mostly websites I stumbled upon from other websites). Some of these sites are hard to find because:

1) they don't have enough Statistically Improbable Words. e.g. try searching for software that describes biology of a python.

2) the page doesn't contain words associated with its significance to me (yes, it can happen). e.g. let's say you come across a page that has a nice layout that you want to revisit later -- if you ever forget the keywords on that page, you may never find it again. Whereas if I were to file it under "Nice websites" in my bookmark folder, I'd be able to find it again.

3) I can't remember any of the keywords associated with the page.

4) I forget that I've ever visited those webpages. Some search engines (e.g. a9.com) have histories that you can revisit, but they're no use unless you can classify them. And if you classify them, they're basicallly bookmarks.

I think the reason people dislike bookmarks is because they're a hassle to organize. We need some sort of tool to autoorganize bookmarks.

There two basic requirements:
1) Multiple hierarchy - a bookmark must be able to belong to more than one category. Example of this is GMail's labels [g04.com] -- each email can belong to more than one label.

2) Automatic classification - the proper term for this is automatic taxonomy. This can be accomplished using a Bayesian algorithm (like the one POPmail is using). In fact, DEVONthink already does this [devon-technologies.com].

When a user makes a bookmark, the program should come up with a list of category folders (sorted from likeliest to least likely) to file that bookmark under, and the user must be allowed to select more than one folder.

Re:Better Memory Than I (Score:3, Interesting)

by garcia ( 6573 ) * writes: on Wednesday August 24, 2005 @04:26PM (#13392277)

I haven't used Bookmarks since 1998 or 1999. Too much of a hassle finding stuff when the links are dead anyway.

His solution, using a search engine, is a much better method as you might even come across something new and even MORE useful than what you had originally bookmarked.

I check a handful of websites daily. Mostly Google News, slashdot, MNspeak, geocaching.com, mngca.org, and usually some others. While having them setup in a hierarchy might leverage the association aspect, typing them in everytime exercises my memory and my typing. I guess we each have our own seperate areas we'd prefer to work on.

YMMV.

Bookmarks toolbar folder is better! (Score:3, Interesting)

by ImaLamer ( 260199 ) writes: on Wednesday August 24, 2005 @04:39PM (#13392376) Homepage Journal

Scroll wheel? Thanks, that is a major helper.

What I've begun doing is using the "Bookmarks Toolbar Folder" for all of my bookmarks. I've got "Essentials" with links to Gmail, Adsense, my website, Distributed.net stats and so forth, basically all of the sites that I try to visit daily. Then I've got "Favorite sites" that holds Slashdot (even though now it's "home"), Woot, Craigslist, Free6.com (hehe), Assambassador.com, Myspleen, demonoid, you get the point.

Then I've got the essential one: "Functions" - that holds mostly Javascript links but other things like TinyURL, @nonymous, Wordpress Press-It, BlogThis!, post to del.icio.us, Ping-o-matic, Send SMS message, mailto: and whatever. Then there is "Junk" which isn't really used any more because del.icio.us is so sweet. I generally dump something I might want to read later there and categorize it later (like every 8 months). Then of course I've got a few drop-down RSS feeds, but since they are torrent sites I'll keep them to myself. (Oh, I almost forgot - a huge drop down of del.icio.us bookmarks with the help of Foxylicious [mozilla.org])

This works well, and generally reminds me of a filing system. Since I'm never using the File, Edit, and etc menus this has become my new menu.

Re:My solution (Score:1, Interesting)

by Anonymous Coward writes: on Wednesday August 24, 2005 @05:40PM (#13392788)

Well, exactly.

The "home" button on a browser is supposed to take you to YOUR OWN web space, maintained by you - i.e. your home. Some bits might be your front garden, visible to others, others private.

People who use the "home" button as just another bookmark to a search engine are missing the point of the web.

It isn't helped by the fact that current browsers aren't actually good as editors (unlike the original web browser vision) - Your web site should be a WYSIWIG-editable persomal/private pseudowiki. Many people have the wiki part down now, but are stuck typing markup into text boxes when it should be a matter of pointing your browser at the site and making the change.

Where are the WebDAV compliant WYSIWIG wikis?

Microsoft kindof-sortof tried with frontpage, and Amaya and Mozilla Composer are both in existence, but the problem is most people on the net are now brainwashed into drooling "consumers" of corporate media instead of being active participants in society i.e. "citizens".

Re:Lucene is great! I use it all the time (Score:2, Interesting)

by bmalia ( 583394 ) writes: on Wednesday August 24, 2005 @06:16PM (#13392989) Journal

As a programming language, Java is kind-of weak.

Java is anything but weak.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Lucene in Action 109

Who's it for?

The Structure

What's to Like?

What's to consider?

Conclusion

Lucene in Action More Login

Lucene in Action

My solution (Score:3, Interesting)

Wow, open source search engines. (Score:2, Interesting)

Bookmarks are better (Score:4, Interesting)

Better Memory Than I (Score:3, Interesting)

RSS (Score:2, Interesting)

Lucene providing search engine for Hula (Score:3, Interesting)

Google anyone ? (Score:3, Interesting)

Re:Bookmarks are better (Score:4, Interesting)

Re:Better Memory Than I (Score:3, Interesting)

Bookmarks toolbar folder is better! (Score:3, Interesting)

Re:My solution (Score:1, Interesting)

Re:Lucene is great! I use it all the time (Score:2, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot