Mastering Regular Expressions 252
Mastering Regular Expressions, 2nd edition | |
author | Jeffrey E. Friedl |
pages | 460 |
publisher | O'Reilly |
rating | 9.5 |
reviewer | Gianluca Insolvibile |
ISBN | 0596002890 |
summary | An in-depth guide to lead the apprentice to mastering regular expressions' wizardry |
My first suspicion, I admit, was that I was facing one of the countless "man page reprints" that you find these days. It was only after reading the book that I eventually understood: before then, I had had no idea of what regexes were really about.
What it's about
The book is logically divided into three parts: the first one (Chapters 1, 2 and 3) introduces the reader to the basic concepts of regexes, building a common ground upon which the subsequent chapters will be based. The introduction is clear and straightforward, and lets the readers quickly grasp the key points in the regex business. This part is more or less a good summary, presenting information that can be found also in existing manual pages (albeit presented in a distilled form, which lets you perceive that the author has very clear ideas about the matter). If you already know something about regexes, you could skip this part entirely -- even if reading it turns out to be a nice occasion to brush up and overhaul your knowledge.The second part (Chapters 4, 5 and 6), is the one that struck me most for the depth of provided information and the richness of though. Rather than throwing at the reader usage dictates on one or another regex flavour, the author explains with a wealth of details the inward mechanisms which make regexes run and how you can exploit such knowledge to write better expressions.
Chapter 4 presents the different families of regex processing engines (namely, DFA, traditional and POSIX NFA), whose internal behavior differs so greatly that writing a regex in the appropriate way can make a substantial difference in both efficacy and efficiency. If you thought you knew it all about greedy and lazy regex operators, possessive quantifiers, backreferences and lookaround, you'd better think again: I was pleasantly surprised to discover how ignorant I was (to be honest, I had never heard of lookaround operators before!).
Chapter 5 slows down a little bit to let the reader absorb the massive previous chapter. Some simple (but still tricky) examples are presented, showing how to apply the techniques explained up to this point. A couple of examples are perhaps too contrived (ever needed to match aligned groups of 5 digits in an unspaced stream of characters?), but it is instructive anyway to follow the reasoning behind the construction of a complex regex.
Chapter 6 focuses on efficiency, considering how backtracking and matching can drive your regex engine to exponential complexities. Optimization techniques are then presented, first by explaining the automatic optimizations performed by the most common regex engines and then by giving a practical list of hints that you can follow to be sure that your expression will run as fast as possible. Again, I was quite surprised to find out how small changes in a regex can make such a big difference to the engine (and give rise to noticeable performance penalties if ignored).
What I absolutely liked most was that the author explains exactly why a certain optimization works, based on the information given in Chapter 4 (and provided that you have been able to assimilate it in the first pass). Finally, a paragraph entitled "Unrolling the loop" really put me in a good mood, reminding me of the past times of "old school" asm programming.
The third part of the book devotes three chapters to PERL, Java and .NET, respectively. Each chapter goes through the syntax and features of regexes for each language: while the information provided on Java and (VB).NET is quite commonplace, in the case of PERL the author deals with aspects rarely covered elsewhere, like dynamic regexes, embedded-code constructs, regex-literal overloading and specific optimization techniques.
What's to like
In one word: insight. The author is definitely knowledgeable of regular expressions and the whole book is filled with thoughtful suggestions and hints. Still, a friendly and straightforward writing style makes reading pleasant and seldom boring (well, you wanted details, didn't you?) while you learn internal regex mechanics rarely available elsewhere.A further nice point is the broad view offered to the reader, starting from regexes in general and focusing on specific flavours only in the final part of the book. The second edition also offers up-to-date information, covering the .NET framework and the latest versions of PERL (5.8) and Java (1.4).
What's to consider
Despite the book's reassuring conversational tone, dealing with such a specific topic with so many in-depth details might sometimes become boring, especially if you do not have a strong interest in getting the most out of regular expressions or in knowing how they internally work. If you are just an occasional regex user and dwell in manual pages, you can probably live without this book. Also, it is a pity that specific sections on Tcl, emacs and awk have disappeared in the second edition (maybe they were not as current as the .NET framework ?) and that pcre (a C regex library) is barely mentioned.The summary
Regular expressions are tied so strongly to the *nix culture that everyone who has been exposed to that culture has come to use them in a more or less conscious way. Still, most of the documentation around lags on basic features and presents only the most common regex operators. Mastering Regular Expressions is the book to read if you want to go further and get serious about regexes: even if extreme optimization might not be a big concern today, understanding how regex engines work under the hood greatly helps also in creating everyday small expressions.Table of Contents
PrefaceChapter 1. Introduction to Regular Expressions
Chapter 2. Extended Introductory Examples
Chapter 3. Overview of Regular Expression Features and Flavors
Chapter 4. The Mechanics of Expression Processing
Chapter 5. Practical regex techniques
Chapter 6. Crafting a Regular Expression
Chapter 7. Perl
Chapter 8. Java
Chapter 9. .NET
You can purchase the Mastering Regular Expressions, 2nd edition from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
i mastered regular expressions (Score:5, Funny)
Perl, Java, .NET.. oh my! (Score:4, Interesting)
Regexp's almost consistent across languages (Score:2)
Re:Regexp's almost consistent across languages (Score:2)
In basic regular expressions the metacharacters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \).
Re:Regexp's almost consistent across languages (Score:4, Insightful)
Damn those Perl people and their innovations. Why can't they just be happy doing everything the familiar, crappy way? Why must they push the envelope to make things easier and better? I hate that.
PS. I hope you haven't seen this yet [perl.org]. It'll really boil your blood.
Re:Regexp's almost consistent across languages (Score:3, Insightful)
Can I just say that I really like Larry Wall? I mean, reading that document, I realize that he is sooo good for Perl culture. You won't hear "that's how it has always been done" from him. His focus is on how to build a better system, not politics, not grandstanding. I would be very happy to see this kind of openness and disarmingly reasonable attitude influence certain other people in the Perl community.
Of course, I could be extrapolatin
Re:tend to be the de-facto standard - dream on! (Score:3, Informative)
True, but things get tricky quickly -- plain-old Unix awk predates perl. But GNU-awk (gawk) does not, so it has some perl-style regexp features, like \w, which are missing from Unix awk.
Re:Perl, Java, .NET.. oh my! (Score:2)
Re:Perl, Java, .NET.. oh my! (Score:2)
Re:Perl, Java, .NET.. oh my! (Score:2, Informative)
Re:Perl, Java, .NET.. oh my! (Score:2)
The big problem with PHP and regexes is that the C-like syntax makes no concessions to the needs of regular expressions. I ported some regexes from Perl to PHP using preg a while back, and while the regexes themselves didn't change, the guff around them was a lot more opaque in PHP. I guess this is the price PHP users pay for a 'consistent' language: pity the syntax was designed for writing operating systems at quasi-assembler level, not applications...
Funny you should say that... (Score:3, Informative)
...about switching programming environments. Right now there's some discussion about problems in regex engines which follow you around as you switch environments, due to problems in the engines.
Curent versions of glibc (apparently) made some inefficient design choices in their regex engine. When other tools such as sed switched to using glibc's version, their performance dropped quite a bit, leading to a couple [debian.org] of bug reports [debian.org].
The interesting thing is, one of the messages in the bug report mentions thi
Don't go overboard (Score:3, Interesting)
Re:Don't go overboard (Score:5, Insightful)
Re:Don't go overboard (Score:5, Funny)
And I'm pissed that it's NOT in the second edition (at least it couldn't easily be found). I was trying to impress this chick at B&N the other day by showing her how I understood that longass expression and low-and-behold, the back page where it's SUPPOSED to be is filled with a 3 line regex - not very impressive after you've made a huge deal about a full-page regex. Fortunately it all worked out since I had the original at home, and I was like "well, you'll just have to come over to MY place to check out the big regex".
Re:Don't go overboard (Score:3, Funny)
Regular Expressions (Score:2, Insightful)
Re:Regular Expressions (Score:3, Informative)
Re:Regular Expressions (Score:4, Informative)
Its caldera's c++ portable regex lib.
Re:Regular Expressions (Score:3, Funny)
Don't! It's probably got a Unix kernel in it. Beware the lawyers.
Re:Regular Expressions (Score:2)
pcre (Score:2)
On most systems, use `man regcomp` to see how to use regcomp, regexec, regerror, and regfree.
Essentially, you first compile the regular expressioin into a binary format with regcomp(), then use regexec() to match it against a string. It's all a little awkward to use until you get used to it.
C++ Regular Expressions (Score:5, Informative)
Different than 1st Edition? (Score:3, Interesting)
Re:Different than 1st Edition? (Score:5, Informative)
I was going to read this (Score:5, Funny)
I just can't fathom this (Score:4, Funny)
Re:I just can't fathom this (Score:2)
But you owe me.
Re:I just can't fathom this (Score:2)
Re:I was going to read this (Score:3, Funny)
Re:I was going to read this (Score:2, Insightful)
Re:I was going to read this (Score:2)
OK, then please specify what version of Perl you are talking about. Version 6 regexps default to using the
Cheap prices on Half.com (Score:5, Informative)
that's the first edition (Score:4, Informative)
Mastering Regular Expressions [oreilly.com] is now in its second edition. Mr. Friedl has posted a nice writeup [oreillynet.com] about what's different in the second edition.
Re:Cheap prices on Half.com (Score:2, Informative)
At $15 compared to $30, I'm not going to cancel my order even if it is just 1st edition. The only parts I'll miss is the extra info on new Perl 5.8 features, and maybe the unicode stuff. Guess I'll be reading perldoc.com for that.
The best place for buying technical books is... (Score:3, Informative)
Mastering Regular Expressions, 2nd Edition
Our Price: $24.50
Bookpool is consistently the cheapest place to buy technical books. And no, I am not affiliated with them in any way.
Obligatory crap regexp joke (Score:5, Funny)
Re:Obligatory crap regexp joke (Score:2)
Probably something more like:
Re:Obligatory crap regexp joke (Score:2)
Re:Obligatory crap regexp joke (Score:2)
Re:Obligatory crap regexp joke (Score:2)
What's new in this edition? (Score:2)
My only complaint about the book is that non-techies looked at the title when I was reading and said, "Aren't 'Hi there' are 'How are you?' regular expressions?"
Perl, not "PERL" (Score:5, Informative)
It's always surprised me when I see intelligent people write "PERL" when they refer to Larry Wall's programming language.
From the Perl FAQ, General Questions About Perl:
What's the difference between "perl" and "Perl"? :-) Larry now uses ``Perl'' to signify the language proper and ``perl'' the implementation of it, i.e. the current interpreter. Hence Tom's quip that ``Nothing but perl can parse Perl.'' You may or may not choose to follow this usage. For example, parallelism means ``awk and perl'' and ``Python and Perl'' look ok, while ``awk and Perl'' and ``Python and perl'' do not. But never write ``PERL'', because perl isn't really an acronym, aprocryphal folklore and post-facto expansions notwithstanding.
One bit. Oh, you weren't talking ASCII?
You can read the entire FAQ [perl.com] if you like.
Re:Perl, not "PERL" (Score:5, Informative)
Marjorie: Well, that certainly answered the question fully. I must admit I didn't expect you to go back as far as the beginning of the Universe.
Larry: I wanted a short name with positive connotations. (I would never name a language ``Scheme'' or ``Python'', for instance.) I actually looked at every three- and four-letter word in the dictionary and rejected them all. I briefly toyed with the idea of naming it after my wife, Gloria, but that promised to be confusing on the domestic front. Eventually I came up with the name ``pearl'', with the gloss Practical Extraction and Report Language. The ``a'' was still in the name when I made that one up. But I heard rumors of some obscure graphics language named ``pearl'', so I shortened it to ``perl''. (The ``a'' had already disappeared by the time I gave Perl its alternate gloss, Pathologically Eclectic Rubbish Lister.)
Another interesting tidbit is that the name ``perl'' wasn't capitalized at first. UNIX was still very much a lower-case-only OS at the time. In fact, I think you could call it an anti-upper-case OS. It's a bit like the folks who start posting on the Net and affect not to capitalize anything. Eventually, most of them come back to the point where they realize occasional capitalization is useful for efficient communication. In Perl's case, we realized about the time of Perl 4 that it was useful to distinguish between ``perl'' the program and ``Perl'' the language. If you find a first edition of the Camel Book, you'll see that the title was Programming perl, with a small ``p''. Nowadays, the title is Programming Perl.
Re:WHAT?!?! (Score:2)
There's your history lesson for the day, folks.
I concur (Score:5, Insightful)
This is no "Learn Regex in 21 Days" or "Regex for Dummies" book with lots of tips on page 400 about how the | is useful for finding Jones OR Smith. If you haven't gotten that down yet, this book's not for you.
As the reviewer says, this is a very worthwhile cover-to-cover read which will turn your empirical experiences with regex into a more structured understanding of the science and engineering of advanced regex. As a reference on my shelf, it sits comfortably next to Knuth's AoCP and Foley & van Damme.
netLibrary (Score:5, Informative)
To those at universities, see if your school offers netLibrary-based books. It's easy to read and it's free.
Soviet Russia Regex (Score:5, Funny)
They can be hard (Score:5, Informative)
Re:They can be hard (Score:2)
Re:They can be hard (Score:4, Insightful)
printf("Comments in C are written like /* this */ although I prefer the // C++ style");
That's why we use parsers to write compilers and not regexps. I came back from Perl after a few months using it, being very disillusionned by its read-onlyness.
Re:They can be hard (Score:3, Informative)
Parsers are, however, based on regular expressions. I orginally wrote this regular expression when I was writing a lexer (using JFlex [jflex.de]) for Java. The examples that I saw used a state machine and I wanted to do it with a regex. When combined with regular expression to find sting literals (and all the regular expressions for other junk), it does the r
Re:They can be hard (Score:4, Informative)
Re:They can be hard (Score:2)
Re:They can be hard (Score:2)
It's difficult to know if your regex is really correct for the stuff your're parsing.
I mean, it might work for the 9 cases of input you have. But for the 10th case, bam! your regex doesn't parse the 10th input properly. And reading regexs is worse than reading assembly IMO, when you want to fix a bug in some regex 6 months after you've written it.
But if you know your regex is correct, you can reduce 100 lines down to a mere two lines of code. So it's beautiful o
Re:They can be hard (Score:2, Informative)
Re:They can be hard (Score:3, Informative)
Assuming you wanted to capture "/* hello */" out of "/* hello */ hello */"
You see what you are missing is the '?' modifier that will cause the "(.*\r?\n)*" to not be greedy. Same with the ".*".
I think you are just missing the some of the functionality of regexes. You might want to pick up this book.
Re:They can be hard (Score:2)
Hmmm, I just tried it in my text editor and it worked. Maybe its more widely implemented than I thought. I could have sworn this didn't work last time I tried it there. Maybe it was added in a recent version. :-)
Re:They can be hard (Score:2)
Re:They can be hard (Score:2)
Re:They can be hard (Score:2)
Regex Learning Tool (Score:5, Informative)
Online resource (Score:4, Informative)
Re:Online resource (Score:2)
From Windows (Score:4, Insightful)
All i have to say is: (Score:5, Funny)
stupid filter wouldn't let me paste the regex here XD
Re:All i have to say is: (Score:2)
Re:All i have to say is: (Score:2)
Re:All i have to say is: (Score:2, Interesting)
REGEX for Brazilians (Score:2, Informative)
Here's my challenge... (Score:2)
Re:Here's my challenge... (Score:2)
Codebits used to have a lot on this, but the page has since moved [codebits.com] and seems to be having permissions errors at this moment.
If you want, you could always email me (phormix at phormix dot com) and I can attempt to help you with your regexp woes - I've used a lot of multiline perl regexps for HTML processors, etc.
Re:Here's my challenge... (Score:2)
Regex rant (Score:5, Insightful)
statement -> (command params)
statement -> (command)
params -> params params
params -> constant
params -> variable
params -> statement
Newbie review (Score:3, Informative)
Even having not dealt with regexes pretty much at all, the book was very easy to get into. The first few chapters go through the basic matching structures, along with requisite history. All of the points are done with understandable real life examples, with diagrams and [a small amount] of actual code. The later chapters go through individual languages, and goes through which features are there, what the nuances are, and a few of the gotchas. I must admit that I probably learned more useful things about perl from this book than from any other source. There is also a large section [which I did not read, and caanot comment on] which actually details the nuts and guts of regexes.
All and all, it's easily the best instructional [as opposed to reference] text I've ever purchased.
in a nutshell (Score:2)
No, that would be OReilly's in a nutshell series of books..
errata (Score:4, Informative)
And he's Qualified to review this book???? (Score:4, Funny)
(to be honest, I had never heard of lookaround operators before!).
Gezzzz, This guy hasn't even heard of lookaround operators before? What a clueless fool! He should be driven from /. after being tarred and feathered!
Everyone knows that a lookaround operator is that guy that goes into the bank first to make sure that there aren't any armed guards or policemen/women getting their paychecks deposited.
Interpretting parser (Score:3, Informative)
definitely a good read (Score:2)
Or without a book... (Score:3, Informative)
re-builder for Emacs (Score:3, Informative)
You actually liked this book? (Score:2, Informative)
Re:You actually liked this book? (Score:4, Interesting)
I loved the first edition, probably for the reasons you didn't. I'd read several short overviews of regexes, including Larry Wall's one in the Camel book, and, while they got me doing simple stuff, they left me with lots of unanswered questions, and the more I experimented the more my "why doesn't that work?" list grew. The Friedl book is totally thorough, and, I thought, aggessively pedagogical, if you want to learn about how a regex engine works rather than pick up stuff in a cookbook fashion.
That said, I do wonder about the guy. The colophon was astounding: he wrote half the book using regexes on a computer on the other side of the world, using a 37.5 bit/hour connection by the sound of it, and then he proceeded to write his own typesetting system so he could produce a phoenetically alphabetical index in English, Japanese and probably some other languages that I missed. I think he ought to get out more...
There are no ".NET Framework" languages (Score:3, Informative)
There are no ".NET framework" languages. There are languages that target the Common Language Runtime, or the CLR. The .NET Framework is nearly a class library like the JDK/JRE. If he doesn't even know that, why should I trust his book review?
My Version... (Score:5, Funny)
Repeat after me:
"I'm so hungry, I could eat a horse."
"It's been raining cats and dogs."
"I'll sleep with you when Hell freezes over."
And my personal favourite:
"Oh look, Hell just froze over!"
I've read the first edition and... (Score:5, Interesting)
The very fact that both vi and emacs support regular expressions must mean they are a best-in-breed tool, because if there was a way for those two communities to disagree, they would have done it.
I love the fact that I can use the same expressions with grep, sed, vim, Perl, and Java. that being said, however, the critics are who warn that regex can be over used are correct: regex's are difficult to debug and to maintain, so don't go overboard.
contrived examples? (Score:5, Interesting)
Yes, actually. Older FORTRAN codes (that have been slowly added to/modified over time) especially exhibit this kind of behavior thanks to formats that allow you to specify columns for output. The numbers actually run into each other on the line, and the only way to read the file is to know which column the data you want is in. I would never discount any regular expression example as contrived. Somewhere, someone has developed a program that uses that formatting in an input or output file, and someone else might need to be able to speak it's language in an automated fashion.
what did one regular expression say to the other? (Score:3, Funny)
Sample Chapters (Score:3, Informative)
Re:My problem with regular expressions... (Score:5, Interesting)
I think a lot of the people who use RE's a lot would be well-served by brushing up on their recursive-descent parser writing skills. For only a little more time then it takes to write a regular expression, you can (if you know how) write a simple recursive-descent parser that:
Parsers and regex (Score:2, Interesting)
There is a middle way between overly complex REs which mere mortals cannot read nor safely modify, and overly complex parsers that never take advantage anything more functional than getc().
Re:Why is it that people think regexps are hard? (Score:4, Insightful)
If you need a 500 page book on regexps, you might want to have a look at a good compiler book (red dragon, etc.) first.
And why would I want to learn about all the various automata (finite state machines, push-down automata, and Turing machines) not to mention all that language parsing crap (top-down versus bottom-up parsing, parse trees, etc, etc), when all I really want to learn is how to exploit a regular expression engine efficiently so I can solve real world problems?
Full non-CFG languages are so much more powerful than any regexp could ever dream of being, and more importantly they can have state.
Yeah, that's called a programming language. And yeah, I could implement any regular expression using a standard programming language, but why would I bother when a regular expression is far more concise and better suited to the job?
Geez, give someone a hammer...
Re:Why is it that people think regexps are hard? (Score:4, Funny)
Just like the bozo who just finished a Formal Computation course, yet mixed up the meanings of "+" and "*" ? ;-)
From man grep:
I hear they're serving humble pie at the school cafeteria today. ;-)
Re:Why is it that people think regexps are hard? (Score:2)
Re:Why is it that people think regexps are hard? (Score:2)
LOL! That compiler course is BASED on formal language theory, so I think you probably have the relationship a little confused there. In fact, a compiler course really only gives you a light dusting of the real theory behind formal languages (I would know, I've taken a course in both... I'm certainly no expert in either, but at least I have a little perspective). A class in formal languages not only discusses Chomsky's hierarchy and their associat
Re:Why is it that people think regexps are hard? (Score:2, Insightful)
Re:Why is it that people think regexps are hard? (Score:3, Insightful)
But then, I'm not a compiler god, just a network guy who happens to have to use the fscking things once in a while.
Post is from a troll template (see below) (Score:2, Informative)
To be honest, that this exact same post template has been moderated highly again and again in recent book reviews is becoming more humorous than anything. Unfortunately, and this is addressed to certain moderators, I believe it would be correct to say the laughing is 'at you' and your misfortune rather than 'with you'.
If you would like to confirm that you are being 'taken in', cli
Re:+4 Informative? He doesn't even have to own... (Score:2, Funny)