Forgot your password?
typodupeerror

Mastering Regular Expressions 208

Posted by samzenpus
from the programmers-handbook dept.
Simon P. Chappell writes "Classics are funny things, especially in the world of books. There are books that people say "should' be classics (I'll refrain from mentioning names to protect the pretentious) and then there are books that people are too busy actually using to get around to listing as classics. Mastering Regular Expressions, now in it's third edition, is in the second group. It's one of those books that you see on desks in computer departments the world over. This is a real "doers" book." Read the rest of Simon's review.
Mastering Regular Expressions
author Jeffrey E.F. Friedl
pages 515 (31 page index)
publisher O'Reilly
rating 11 out of 10
reviewer Simon P. Chappell
ISBN 0596528124
summary A classic of modern computer literature.


This is a book for programmers; managers, project managers and architects need not apply. If you talk about code instead of writing it and have teams of programmers report to you, then consider buying this book and giving it to them. If you're a technical lead or lead programmer, then shame on you if an earlier edition of this book isn't already on your shelves! The majority of examples are written using Perl, but if you can read basic Perl (Pidgin Perl, perhaps?) then you'll be fine with the examples. Programmers in PHP, Java, .NET and Ruby also have dedicated sections of the book, so it's very inclusive and almost platform agnostic.

The book has ten chapters divided into two parts. Chapters one through six are what Mr. Friedl calls the "story" of regular expressions. Chapters seven through ten are an examination of the specific regular expression capabilities of Perl, Java, .NET and PHP.

Chapter one is an introduction to regular expressions. At only 33 pages, you might think that it would be shallow, but rather, it is knowledge dense. The examples in the first chapter use egrep extensively. This makes a lot of sense as it's an advanced tool, easy to use and freely available for most modern operating systems.

Chapter two builds on this introduction with extended introductory examples. These are written in Perl (again, simple and easy to follow), but there is no doubt that the regular expressions are the stars of the show around here. The examples are small Perl programs, but their benefit is that Mr. Friedl talks the reader through the process of creating each of them. This is more useful than just presenting example programs, because with just pure examples, you are out of luck if your specific problem is not covered. With this approach, you're coached towards thinking in regular expressions and are more equipped to address your personal regular expression needs.

Chapter three provides an overview of regular expression features and flavors. It starts with a historical view of the development of regular expressions, including a few asides about the influence that the earlier versions of the book have had on that development. After that, the chapter uses a search and replace example to demonstrate some of the differences between flavors of regular expression capabilities provided by different programming languages. Strings, Unicode and metacharacters round out this overview.

Strap yourself in for chapter four; it's time to talk about the computer science that makes all of that matching work. If you didn't know the difference between an NFA and a DFA regular expression engine before you start this chapter, you most certainly will by the end of it. At first sight, it might seem that this is chapter for the pure propeller heads amongst us. While there is much theory here, it's all presented in the light of how your regular expression engine is trying to do what you asked. By understanding the approaches to regular expression processing, we can learn to help ourselves. We help ourselves when we write regular expressions that run faster and use less memory. We write better regular expressions when we understand the consequences of what we write. For example, the oft written ".*" (dot star) seems like a great way to ignore a bunch of stuff in the middle of an expression, but such simplistic use is just waiting to bite you. This chapter explains why and how to deal with the situations where you'd be tempted to use simplistic expressions and how just a little extra thought can bring you the behavior you want.

Chapter five is a practical counterpoint to the previous theory chapter. Here, Mr. Friedl discusses practical regular expression techniques. There are a number of short examples, before he works through medium sized HTML processing examples and finished up with a look at processing Comma Separated Value (CSV) data.

Chapter six is efficiency. Your regular expression can be as correct as you like, but if it takes what seems like eternity to run, then it's of little use. This chapter mostly addresses NFA based engines, because they have the greatest variability based on how the regular expression is written.

Chapters seven through ten cover the specifics of using regular expressions in Perl, Java, .NET and PHP. They're well written and cover everything you need to apply the content of the first six chapters to your programming language of choice.

Everything about this book is great. This is the kind of book that O'Reilly built its reputation with. A master of the subject matter, writing in a clear, easily understood manner, leaving the reader educated and able to operate comfortably with the subject matter. I may not be a regular expression guru, but I feel that I have a much better grasp of the fundamentals that I would need if I did want to be such a guru.

Mr. Friedl is to be commended for his clear explanations of what is, in all reality, much more complex computer science than many of us are used to dealing with. The fact that his explanations are highly readable and enjoyable is a significant bonus.

There is a website for the book, regex.info and a blog at regex.info/blog, where Mr. Friedl has some wonderful photographs of Japanese gardens with their autumn colors. (Nothing to do with regular expressions, but they appealed to my inner photographer.)

Lastly, while the book is not intended to be an encyclopedia of regular expressions, all of the examples are very relevant to programmers needs and this book can easily serve that reference role.

At the risk of sounding like some kind of O'Reilly shill or a relative of Mr. Friedl, I must report that I don't think that I found a single thing I didn't like about this book.

This is a classic of the first order. Nail it to your desk unless you want to be constantly retrieving it from your co-workers. If I might be permitted a Spinal Tap reference, this one goes to eleven. If you ever use regular expressions, are thinking of using regular expressions or are in the same room as a regular expression, then you need this book.


You can purchase Mastering Regular Expressions from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
This discussion has been archived. No new comments can be posted.

Mastering Regular Expressions

Comments Filter:

  • ...but it seems funny that someone signing himself as Simon P. Chappell would worry about "protect[ing] the pretentious".

  • The first two editions were also great books. An indispensable resource for sure and mandatory reading for my devs.
    • Re: (Score:2, Interesting)

      by Anonymous Coward
      Guess what - in Silicon Valley, a bunch of sometimes arrogant, more often brilliant, unrepentant commercialists made a system for the Macintosh called MPW. I used their proprietary system for years. I never wanted to deal with four uses of * and / and . and all the others. With a few greek characters, the expressions for Position before A, and Selection between A nd B and a bunch of others worked really, really well.

      Now that NeXT acquired Apple, the web is indispensible, and the BSD that drives the Mac is s
      • Here [wikipedia.org] ya go, sonny. Read up on Alonzo Church's grad student. "*" isn't "splat" in this context, it's the Kleene star [wikipedia.org].
      • Re:Agreed! (Score:4, Informative)

        by orasio (188021) on Wednesday September 13, 2006 @04:20PM (#16098800) Homepage
        Regular expressions have academic books behind them, and computer science books are written about them.
        Maybe what you talk about is nice, but REs (with extensions) are kind of ultimate solutions to the problem they try to solve (describing an automaton in a string of characters).

        The only thing that is needed to use another complete system is a theorem that proves there is a two way conversion between the system you like and REs, and then it would be fairly easy to implement everywhere.

      • by nickos (91443)
        I've heard interesting things about MPW (Macintosh Programmer's Workbench), but never used it myself.

        Anyone got any ideas where I can find a copy and how I can play with it?
  • Knock Knock (Score:5, Funny)

    by neonprimetime (528653) on Wednesday September 13, 2006 @03:36PM (#16098438)
    What did one regex say to the other?

    .+

  • by Jester998 (156179) on Wednesday September 13, 2006 @03:37PM (#16098447) Homepage
    I bought this (along with a few other O'Reilly titles) a couple months back, and I highly recommend Mastering Regular Expressions. Even though it's a dry technical topic, the presentation is awesome.

    I read through the whole thing as if it were a novel, and picked up more than a few new things about regexes.

    Very handy book, both to read through to really learn how regexes work, and as a day-to-day reference. The score of 11/10 given by the reviewer is bang on.
    • This book also goes by another name: Cliff's Notes: Interviewing With Steve Yegge Style Interviewers. It's like all the guy talks about.
  • by Anonymous Coward on Wednesday September 13, 2006 @03:38PM (#16098460)
    are always sad
  • by 140Mandak262Jamuna (970587) on Wednesday September 13, 2006 @03:43PM (#16098495) Journal
    The author and the reviewer are blatantly biased in favour of the regular expressions, ignoring the plight of the millions of downtrodden irregular expressions who are not able to get a platform to voice their grievances. All because they are viewed as somehow deviant or deficient. It is time for the irregular expressions to come out of the closet and assume their role as legitimate members of the syntax.
  • I own an older version of this book and it really rocks.

    As usual, Amazon [amazon.com] has it cheaper than BN ($29.69 vs $35.99).
  • Personally... (Score:5, Informative)

    by rainman_bc (735332) on Wednesday September 13, 2006 @03:46PM (#16098523)
    I just like to go to [regular-expressions.info]http://www.regular-expressions.info/ [regular-expressions.info] myself - I seem to find all the stuff I forget from time to time there...
    • Re: (Score:3, Insightful)

      by owlstead (636356)
      Note that I immediately found erros on the Java section of this site. E.g., according to the site, the default Java regexp support does not include searching for case insensitive strings, which it does. Beware.
    • <i>I seem to find all the stuff I forget from time to time there...</i><br><br>
      That has got to be a memory leak. Oh well, at least you have garbage collection.
  • by Sebastopol (189276) on Wednesday September 13, 2006 @03:50PM (#16098553) Homepage
    When I read the 2nd edit of this book I was floored by how much richness I was missing in the regex language (well, in Perl regex, that is).

    Like I kid at christmas, I immediately went nuts on my next project with \G and the lookaround operator(s).

    Sadly, when a big bundle of code I wrote was delivered to a team in a city on another very large eastern continent, no one could understand what I had written, so they deleted my nifty \G loops and replaced it all with a crappy first-year-college-grad-non-indented parsing state machine using gotos. The complaint was not that I went nuts with regex, but that I was using NONSTANDARD perl version which supported them (instead of their ancient version!), and that it was my duty to deliver a tool using standard versions. I was most angry at the fact that they just replaced the code with a buggy state machine, and then asked me to debug another problem caused by their mess because it was my tool originally. Ugh!

    Anyway, my point is: (perl) regex are a far richer tool than meets the eye, but beware The Boneheads: the people who refuse to learn something new that could make their life easier and cling to the old way. Gawd forbid someone learn something new on the job.

    Sigh. I was hoping at least ONE programmer over there would have shared my enthusiasm for \G. /endrant

  • This is slightly offtopic, but its regex related. Where are the regex training programs for windows/linux? Or even regex tools to parse data and help you design your expressions?

    Seems like a typical thing thats always overlooked. I saw regex buddy for PC, but it missing awk/sed/bash regex.

    While reading a book helps, a tool for the inexperienced would help train and get the job done.
    • by doti (966971)
      Specially if the this tool could work with the various regexp formats around (sed, vim, perl, etc).
      Like, you type a regexp in a format, and it also shows the equivalent regexp in the other formarts.
    • by Otter (3800)
      Where are the regex training programs for windows/linux? Or even regex tools to parse data and help you design your expressions?

      Check out KRegExpEditor in KDE...

    • by prostoalex (308614) * on Wednesday September 13, 2006 @04:27PM (#16098889) Homepage Journal
      The Regex Coach [weitz.de] - The Regex Coach is a graphical application for Windows and Linux/x86 (also usable on FreeBSD) which can be used to experiment with (Perl-compatible) regular expressions interactively.

      The Regulator [osherove.com] - The Regulator is an advanced, free regular expressions testing and learning tool written by Roy Osherove. It allows you to build and verify a regular expression against any text input, file or web, and displays matching, splitting or replacement results within an easy to understand, hierarchical tree.
    • by gojomo (53369) on Wednesday September 13, 2006 @05:24PM (#16099465) Homepage

      Give a try to my web-based tool, Regex Powertoy [powertoy.org]. Its interface is all DHTML/CSS/Javascript, but requires a hidden Java (1.5) applet for the advanced and steppable regex engine.

      Given that Java core, there are options for adding/removing usual Java literal escaping, which in Java code means lotsa backslashes. Not all Perl advanced features are supported.

      I hadn't considered a pick for awk/sed/bash syntax limits/conversion but will consider it. Any handy reference to how their syntax differs from Perl/Java? (The thing that usu. bites me with sed is escaping of parentheses.)

    • If you wanted to learn or develop some regexes, you sat down with regex(7) open in one terminal and an interactive perl in another window to test them out.

      It never occured to me that I would need or want a tool to generate them. It's not like they're that hard to comprehend. (Although they can be a pain to document... thankfully perl allows you to add whitespace and comments to a regular expression so it can make sense to a third party)
    • by Chrax (782154)
      "Where are the regex training programs for [Linux]?" /usr/bin/perl -w
  • Does anyone know what is new in the 3rd edition? This is missing from the review.
    • Re: (Score:2, Funny)

      by Skiron (735617)
      Yes, they missed a . on page 102, paragraph 14.
    • by c0rr1n (992967) on Wednesday September 13, 2006 @04:27PM (#16098883)
      Mastering Regular Expressions, Third Edition, now includes a full chapter devoted to PHP and its powerful and expressive suite of regular expression functions, in addition to enhanced PHP coverage in the central "core" chapters. Furthermore, this edition has been updated throughout to reflect advances in other languages, including expanded in-depth coverage of Sun's java.util.regex package, which has emerged as the standard Java regex implementation. The languages covered in Mastering Regular Expressions include Perl, Python, Ruby, Java, VB.NET and C# (and any language using the .NET Framework), PHP, and MySQL.
  • by cptgrudge (177113) <cptgrudge&gmail,com> on Wednesday September 13, 2006 @03:51PM (#16098561) Journal

    There are books that people say "should' be classics (I'll refrain from mentioning names to protect the pretentious)

    I'm not going to refrain.

    The Three Musketeers, Alexandre Dumas
    Pride and Prejudice, Jane Austen
    David Copperfield, Charles Dickens

    Look at me, I'm being pretentious!

    • by Gulthek (12570)
      You missed it, those books you list are already classics.

      The post author was referring to books that only pretentious people know about and think *should* be classics. Stuff like "Attack of the Bacon Robots" or something.
      • Attack of the Bacon Robots

        I have signed copy #353 of 1500 of that book. It should be a classic, which would make my signed copy worth even more.
  • A book on regular expressions? What, is the Internet broken?
  • So, why is ".*" bad, other than that you sometimes want Perl's non-greedy ".*?" instead?

    Now I'm curious (but still too cheap to buy the book).
    • by tehshen (794722)
      It's because some people didn't know about the greedy thing.

      The greedy thing goes thus: If you have a string like %{"Attack of the Bacon Robots" is better than "Pride and Prejudice"}, and you want to extract whatever's inside the quotes, the obvious thing for regex younglings to do is to use one like /"(.*)"/. Starts with a quote, stuff in the middle, ends with a quote.

      This is expected to catch "Attack of the Bacon Robots"; but because * is greedy, it eats up the entire string, all the way from Attack to Pr
  • by MarkByers (770551) on Wednesday September 13, 2006 @04:09PM (#16098713) Homepage Journal
    Some people, when confronted with a problem, think I know, I'll use regular expressions. Now they have two problems.
  • by trigeek (662294) on Wednesday September 13, 2006 @04:13PM (#16098735)
    To quote: "Sometimes a hacker has a problem, and he thinks to himself 'I know, I'll solve it with a regular expression!'. Now he has two problems." -- Jamie Zawinski
  • I'm already quite proficient at regexen (people at work come to me for help etc). How much do I stand to gain from this book?
    • Re: (Score:2, Informative)

      by gerbercj (267098)
      This book is not not really to teach you how to write regular expressions. This book teaches you to understand how your regular expressions will be parsed so that you can understand the impact of your approach and start creating expressions that are much more efficient, or that handle special cases more elegantly. It's the book that, in my case, took my skills to the next level. I still refer to it a few times a year, and am glad that it's a part of my library.
    • by LuckyStarr (12445)
      The theoretical part would perhaps further your insight into regexen. Hard to tell how good you really are. This book really was an eye-opener to me.
    • by teslar (706653)
      Honestly?
      If you have to ask, probably enough to warrant buying the book.
  • by pyrrho (167252) on Wednesday September 13, 2006 @04:19PM (#16098779) Journal
    .... wordprocessor and email program with a regular expression!

    PS: not really but wouldn't that be feckin' awesome! it was emacs... if I really had done it I mean.

  • We've got all three editions of this book in our office and they keep getting better. As the review says, this book will teach you the difference between a DFA and an NFA engine if you want to learn that, or just how to do some simple capturing if that's all you need. Friedl's writing is very approaching and the book's notation for showing what part of a string a regex will select is very helpful.

    And this stuff comes up over and over - if you ever need to tweak a JavaCC [java.net] grammar knowing how to specify a DF
  • by MobyDisk (75490) on Wednesday September 13, 2006 @05:19PM (#16099417) Homepage
    I am glad to see this on Slashdot since regular expressions is an area that geeks could really use help in.

    For example, instead of saying the common geek expression "Greetings Program!" try a more regular expression such as "Hello Sir" or the more casual "Wassup?" IRL, Tron references are not considered cool. Another common faux pas is using the expression "Hey n00b, what's your function?" instead of something more regular like "Hey dog, what's your problem?" If someone tries to threaten you, think about their technical skills before saying "Close your port before I pwn j00!" Life is not an FPS. "Shut up before I kick your ass" works very well.
  • "A classic is something that everybody wants to have read and nobody wants to read."

    One of my favorite Mark Twain quotes...
  • by teslar (706653) on Wednesday September 13, 2006 @05:52PM (#16099680)
    It's a good review and the book's great and all that, but I still had to cringe when I read this:
    there is no doubt that the regular expressions are the stars of the show around here.
    You don't say... in a book called 'Mastering Regular Expressions', that must have come as a real surprise...
  • by schlick (73861) on Wednesday September 13, 2006 @05:54PM (#16099700)
    Years ago I was calling around to bookstores looking for this book. A few bookstore employees asked me if it had a lot of pictures. They thought is was a book for people who have trouble communicating. Like knowing when to say,'hi' vs. 'hello' or somehting. sheesh. Now granted many people who read this book may be socially challenged, but this book won't help that.
  • I'll say it again (Score:4, Interesting)

    by pvera (250260) <pedro.vera@gmail.com> on Wednesday September 13, 2006 @06:43PM (#16100034) Homepage Journal
    I bought this book years ago and still can't STFU about it, sorry.

    At my previous job (web-based custom market research) we did hundreds of web surveys which had on the average some 400 data points per survey. These had distinct variable names, etc. and were built 100% by hand when I was hired in the company some time in 2002. My first survey project was a disaster, it took me about 20 hours from the final approved survey document to the dynamic version. The process was riddled with manual steps that created an infinite amount of room for errors.

    Enter regular expressions.

    While fiddling with BBEdit Pro I finally decided to take a shot at regular expressions. After an hour or so of experimenting I started writing a few filters that allowed me to cut down the turnaround from 20 hours per survey to a little over 10 hours. When I got to the point in which I wasn't able to figure things out from the BBEdit documentation and he web, I convinced the boss to buy me Mastering Regular Expressions.

    Within the first 50 pages, I had picked up on additional regular expressions concepts that allowed me to eventually cut down the turnaround per survey to less than 8 hours. That's not 8 hours programming, that's 8 hours from the moment the approved survey is handed over to programming to the moment it passes QA checks and is considered ready to go live.

    This was a $50 or so book, and it saved us thousands of dollars over the four years I worked at that company. Of course, my reward for saving the company all that money was to lay me off, and I "forgot" to leave instructions on how to use the text filters, so I imagine my replacement is right now writing surveys by hand.

    Some of the things that proved to be killer uses for regular expressions within that context:

    1. The approved survey would have specific variables that the analysts would need to keep for importing into SPSS later down the process. A text filter picks up those variables and generates a unique list of every variable needed for he survey. The variables are named with specific patterns, so you know which ones are strings, integers, etc.

    2. Now that we have a list of variables, it means we can quickly generate the CREATE TABLE statement for the survey data. What used to be done by copying and pasting 400 times is (was?) now done by highlighting the text and running a macro. The output is the SQL command you need.

    3. Since you already have the list of variables, you can generate the 400 statements needed to read each form variable into its proper variable in the asp code.

    4. The same way you can generate the hidden form fields that you need.

    5. The same way you can generate the INSERT statement to send your data to he database.

    Little things like that. Eliminating all that copying and pasting really cut down on the QA overhead per project.
  • One of my favorite articles on the web about regular expressions is How Regexes Work [plover.com] by Mark-Jason Dominus. It's a great article if you're at the point where you already have some experienceusing regular expressions, but you want to gain some insight into how they do what they do. I found that after I read this article it was easier for me to come up with cleaner regexps more quickly.

    I haven't read the book being discussed. It probably covers the same stuff, but I found M-J D's article easy to read,

  • "Some people, when confronted with a problem, think I know, I'll use regular expressions. Now they have two problems."
  • Honest. I'd learned HTML and Dreamweaver 4 had a search and replace facility for using these weird hieroglyhics for specifying patterns. "Dreamweaver 4 Bible" referred to them as regular expressions, citing the 1st edition of Jeffrey Friedl's book and I found a copy in the local (Islington/London) library. I was fascinated by the book and read it day after day. Since Perl seemed to be THE regex language I soon developed a fascination with Perl through Larry Wall's "Programming Perl". My "web design" career

Do not simplify the design of a program if a way can be found to make it complex and wonderful.

Working...