Mastering Regular Expressions 252

Posted by timothy on Tuesday June 24, 2003 @12:15PM from the bondage-and-domination dept.

gianluca writes "Having always been a heedful guy, I always duly did my homework, going through the lengthy manual pages of a number of regular expressions (regex) crunching tools. You name it: be it PERL, awk, emacs, sed or even one of the .NET framework languages -- any such program provides support for the same regex expressions (or at least, so they seem to the occasional observer). After some years of regex practice with these tools, I had the pretentious conviction that I knew my way through the intricacies of patterns, grouping, greediness, and the like. When I first stepped into Mastering Regular Expressions, looking at the nearly 500 pages which build up Friedl's book, I wondered what could someone ever have to say about regexes to fill so many pages." Gianluca ended up finding plenty of worthwhile content; read below for his review.

Mastering Regular Expressions, 2nd edition
author	Jeffrey E. Friedl
pages	460
publisher	O'Reilly
rating	9.5
reviewer	Gianluca Insolvibile
ISBN	0596002890
summary	An in-depth guide to lead the apprentice to mastering regular expressions' wizardry

My first suspicion, I admit, was that I was facing one of the countless "man page reprints" that you find these days. It was only after reading the book that I eventually understood: before then, I had had no idea of what regexes were really about.

What it's about

The book is logically divided into three parts: the first one (Chapters 1, 2 and 3) introduces the reader to the basic concepts of regexes, building a common ground upon which the subsequent chapters will be based. The introduction is clear and straightforward, and lets the readers quickly grasp the key points in the regex business. This part is more or less a good summary, presenting information that can be found also in existing manual pages (albeit presented in a distilled form, which lets you perceive that the author has very clear ideas about the matter). If you already know something about regexes, you could skip this part entirely -- even if reading it turns out to be a nice occasion to brush up and overhaul your knowledge.

The second part (Chapters 4, 5 and 6), is the one that struck me most for the depth of provided information and the richness of though. Rather than throwing at the reader usage dictates on one or another regex flavour, the author explains with a wealth of details the inward mechanisms which make regexes run and how you can exploit such knowledge to write better expressions.

Chapter 4 presents the different families of regex processing engines (namely, DFA, traditional and POSIX NFA), whose internal behavior differs so greatly that writing a regex in the appropriate way can make a substantial difference in both efficacy and efficiency. If you thought you knew it all about greedy and lazy regex operators, possessive quantifiers, backreferences and lookaround, you'd better think again: I was pleasantly surprised to discover how ignorant I was (to be honest, I had never heard of lookaround operators before!).

Chapter 5 slows down a little bit to let the reader absorb the massive previous chapter. Some simple (but still tricky) examples are presented, showing how to apply the techniques explained up to this point. A couple of examples are perhaps too contrived (ever needed to match aligned groups of 5 digits in an unspaced stream of characters?), but it is instructive anyway to follow the reasoning behind the construction of a complex regex.

Chapter 6 focuses on efficiency, considering how backtracking and matching can drive your regex engine to exponential complexities. Optimization techniques are then presented, first by explaining the automatic optimizations performed by the most common regex engines and then by giving a practical list of hints that you can follow to be sure that your expression will run as fast as possible. Again, I was quite surprised to find out how small changes in a regex can make such a big difference to the engine (and give rise to noticeable performance penalties if ignored).

What I absolutely liked most was that the author explains exactly why a certain optimization works, based on the information given in Chapter 4 (and provided that you have been able to assimilate it in the first pass). Finally, a paragraph entitled "Unrolling the loop" really put me in a good mood, reminding me of the past times of "old school" asm programming.

The third part of the book devotes three chapters to PERL, Java and .NET, respectively. Each chapter goes through the syntax and features of regexes for each language: while the information provided on Java and (VB).NET is quite commonplace, in the case of PERL the author deals with aspects rarely covered elsewhere, like dynamic regexes, embedded-code constructs, regex-literal overloading and specific optimization techniques.

What's to like

In one word: insight. The author is definitely knowledgeable of regular expressions and the whole book is filled with thoughtful suggestions and hints. Still, a friendly and straightforward writing style makes reading pleasant and seldom boring (well, you wanted details, didn't you?) while you learn internal regex mechanics rarely available elsewhere.

A further nice point is the broad view offered to the reader, starting from regexes in general and focusing on specific flavours only in the final part of the book. The second edition also offers up-to-date information, covering the .NET framework and the latest versions of PERL (5.8) and Java (1.4).

What's to consider

Despite the book's reassuring conversational tone, dealing with such a specific topic with so many in-depth details might sometimes become boring, especially if you do not have a strong interest in getting the most out of regular expressions or in knowing how they internally work. If you are just an occasional regex user and dwell in manual pages, you can probably live without this book. Also, it is a pity that specific sections on Tcl, emacs and awk have disappeared in the second edition (maybe they were not as current as the .NET framework ?) and that pcre (a C regex library) is barely mentioned.

The summary

Regular expressions are tied so strongly to the *nix culture that everyone who has been exposed to that culture has come to use them in a more or less conscious way. Still, most of the documentation around lags on basic features and presents only the most common regex operators. Mastering Regular Expressions is the book to read if you want to go further and get serious about regexes: even if extreme optimization might not be a big concern today, understanding how regex engines work under the hood greatly helps also in creating everyday small expressions.

Preface
Chapter 1. Introduction to Regular Expressions
Chapter 2. Extended Introductory Examples
Chapter 3. Overview of Regular Expression Features and Flavors
Chapter 4. The Mechanics of Expression Processing
Chapter 5. Practical regex techniques
Chapter 6. Crafting a Regular Expression
Chapter 7. Perl
Chapter 8. Java
Chapter 9. .NET

You can purchase the Mastering Regular Expressions, 2nd edition from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

Mastering Regular Expressions

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 252 Comments Log In/Create an Account

Comments Filter:

i mastered regular expressions (Score:5, Funny)

by Anonymous Coward writes: on Tuesday June 24, 2003 @12:16PM (#6285266)

when figuring out the lameness filter

Perl, Java, .NET.. oh my! (Score:4, Interesting)

by Gortbusters.org ( 637314 ) writes: on Tuesday June 24, 2003 @12:19PM (#6285302) Homepage Journal

This sounds like a nifty tool for those who have to switch programming environments quite often. I always find myself going back to the books when I either have to write a regex myself or decypher someone elses crazy looking expression.

- Regexp's almost consistent across languages (Score:2)
  
  by GGardner ( 97375 ) writes:
  
  But what drives me nuts about using regexps is how they differ slightly from implementation to implementation. Even though the perl regexp's tend to be the de-facto standard, the perl people are frequently adding stuff to their regexps. Some regexp implementations require you to escape open-paren to get the special meaning, and not escaped to match an open paren. Others require just the opposite. Madness!
  - Re:Regexp's almost consistent across languages (Score:2)
    
    by akeru ( 15942 ) writes:
    
    While it may not help in the confusion, what you're seeing with the escape-open-paren vs. not is the difference between Basic and Extended regular expressions in POSIX parlance. Or, to quote the GNU grep man page:
    In basic regular expressions the metacharacters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, $, and $.
  - Re:Regexp's almost consistent across languages (Score:4, Insightful)
    
    by IpalindromeI ( 515070 ) writes: on Tuesday June 24, 2003 @02:57PM (#6287009) Journal
    
    Even though the perl regexp's tend to be the de-facto standard, the perl people are frequently adding stuff to their regexps.
    
    Damn those Perl people and their innovations. Why can't they just be happy doing everything the familiar, crappy way? Why must they push the envelope to make things easier and better? I hate that.
    
    PS. I hope you haven't seen this yet [perl.org]. It'll really boil your blood.
    
    - Re:Regexp's almost consistent across languages (Score:3, Insightful)
      
      by Anthony Boyd ( 242971 ) writes:
      
      I hope you haven't seen this yet [perl.org]. It'll really boil your blood.
      
      Can I just say that I really like Larry Wall? I mean, reading that document, I realize that he is sooo good for Perl culture. You won't hear "that's how it has always been done" from him. His focus is on how to build a better system, not politics, not grandstanding. I would be very happy to see this kind of openness and disarmingly reasonable attitude influence certain other people in the Perl community.
      Of course, I could be extrapolatin
  - - - Re:tend to be the de-facto standard - dream on! (Score:3, Informative)
        
        by GGardner ( 97375 ) writes:
        
        Perl-style regexps tend to be used on things that post-date perl.
        True, but things get tricky quickly -- plain-old Unix awk predates perl. But GNU-awk (gawk) does not, so it has some perl-style regexp features, like \w, which are missing from Unix awk.
- Re:Perl, Java, .NET.. oh my! (Score:2)
  
  by ErikZ ( 55491 ) writes:
  
  I'm surprised PHP isn't in there. I guess you can just use perl compatible regular expression functions.
  - Re:Perl, Java, .NET.. oh my! (Score:2)
    
    by mackstann ( 586043 ) writes:
    
    Why use PHP's ereg* functions when preg* functions are faster and more powerful?
  - Re:Perl, Java, .NET.. oh my! (Score:2, Informative)
    
    by sketerpot ( 454020 ) writes:
    The big part of regular expressions is learning how to read and write them well. After that, just find some documentation for your language of choice.
    
    PHP Perl-compatible [phpfreaks.com]
    Python re module [python.org]
    Python re howto [www.amk.ca]
  - Re:Perl, Java, .NET.. oh my! (Score:2)
    
    by melonman ( 608440 ) writes:
    
    The big problem with PHP and regexes is that the C-like syntax makes no concessions to the needs of regular expressions. I ported some regexes from Perl to PHP using preg a while back, and while the regexes themselves didn't change, the guff around them was a lot more opaque in PHP. I guess this is the price PHP users pay for a 'consistent' language: pity the syntax was designed for writing operating systems at quasi-assembler level, not applications...
- Funny you should say that... (Score:3, Informative)
  
  by devphil ( 51341 ) writes:
  
  ...about switching programming environments. Right now there's some discussion about problems in regex engines which follow you around as you switch environments, due to problems in the engines.
  Curent versions of glibc (apparently) made some inefficient design choices in their regex engine. When other tools such as sed switched to using glibc's version, their performance dropped quite a bit, leading to a couple [debian.org] of bug reports [debian.org].
  The interesting thing is, one of the messages in the bug report mentions thi
Don't go overboard (Score:3, Interesting)

by apsmith ( 17989 ) * writes: on Tuesday June 24, 2003 @12:20PM (#6285311) Homepage

I read the first edition of this book - it was great, and completely changed the way I handled (and understood) perl regular expressions. It's tempting, after reading this book, to try to apply regex's to everything! Friedl had an example of a huge, horrible (but efficient) regex to parse mail headers in the first edition - my advice on that is, don't try that at home! Interspersing procedural logic with the regex's tends to make much cleaner and more readable code...

- Re:Don't go overboard (Score:5, Insightful)
  
  by sharlskdy ( 460886 ) writes: <scottman@t[ ]s.net ['elu' in gap]> on Tuesday June 24, 2003 @12:27PM (#6285395) Homepage
  
  When all you have is a hammer, everything looks like a nail. And, REGEX is one HUGE hammer!
  
- Re:Don't go overboard (Score:5, Funny)
  
  by tshak ( 173364 ) writes: on Tuesday June 24, 2003 @03:46PM (#6287584) Homepage
  
  Friedl had an example of a huge, horrible (but efficient) regex to parse mail headers in the first edition
  
  And I'm pissed that it's NOT in the second edition (at least it couldn't easily be found). I was trying to impress this chick at B&N the other day by showing her how I understood that longass expression and low-and-behold, the back page where it's SUPPOSED to be is filled with a 3 line regex - not very impressive after you've made a huge deal about a full-page regex. Fortunately it all worked out since I had the original at home, and I was like "well, you'll just have to come over to MY place to check out the big regex". ;-)
  
- Re:Don't go overboard (Score:3, Funny)
  
  by kmellis ( 442405 ) writes:
  
  And I'm pissed that it's NOT in the second edition (at least it couldn't easily be found). I was trying to impress this chick at B&N the other day by showing her how I understood that longass expression and low-and-behold, the back page where it's SUPPOSED to be is filled with a 3 line regex - not very impressive after you've made a huge deal about a full-page regex. Fortunately it all worked out since I had the original at home, and I was like "well, you'll just have to come over to MY place to check o
Regular Expressions (Score:2, Insightful)

by $calar ( 590356 ) writes:

I am so happy that this book is out. I love regular expressions (first saw them in Perl and JavaScript), and I considered buying the first edition from O'Reilly last year, but I thought that it would be best to wait and get the next edition (plus I had about 5 other O'Reilly titles to read at the time). I wish that there was better support for regular expressions in languages like C/C++. Does anyone know of a good library for it because there is no support for it in the language that I know of? Thanks.
- Re:Regular Expressions (Score:3, Informative)
  
  by qorkfiend ( 550713 ) writes:
  
  GNU regex [gnu.org]
- Re:Regular Expressions (Score:4, Informative)
  
  by rkz ( 667993 ) writes: on Tuesday June 24, 2003 @12:25PM (#6285368) Homepage Journal
  
  try this [caldera.com]
  
  Its caldera's c++ portable regex lib.
  
  - Re:Regular Expressions (Score:3, Funny)
    
    by pi_rules ( 123171 ) writes:
    
    Its caldera's c++ portable regex lib.
    
    Don't! It's probably got a Unix kernel in it. Beware the lawyers.
  - - Re:Regular Expressions (Score:2)
      
      by Marc2k ( 221814 ) writes:
      
      Really? Curt never took loose money from me before, he used to be such a nice boy..
- pcre (Score:2)
  
  by crow ( 16139 ) writes:
  
  The article mentions pcre (I believe that's the Posix C Regular Expression library).
  
  On most systems, use `man regcomp` to see how to use regcomp, regexec, regerror, and regfree.
  
  Essentially, you first compile the regular expressioin into a binary format with regcomp(), then use regexec() to match it against a string. It's all a little awkward to use until you get used to it.
- C++ Regular Expressions (Score:5, Informative)
  
  by TheOldBear ( 681288 ) writes: on Tuesday June 24, 2003 @12:48PM (#6285619)
  
  The Boost C++ libraries have a regular expression package. Take a look at http://www.boost.org/libs/regex/index.htm
  
Different than 1st Edition? (Score:3, Interesting)

by khef ( 681832 ) writes: on Tuesday June 24, 2003 @12:22PM (#6285332)

Can anyone that's read this describe what's changed from the first edition? Is it worth shelling out the cash if you already have the first one?

- Re:Different than 1st Edition? (Score:5, Informative)
  
  by sharlskdy ( 460886 ) writes: <scottman@t[ ]s.net ['elu' in gap]> on Tuesday June 24, 2003 @12:31PM (#6285443) Homepage
  
  You can read about the differences by clicking here [oreillynet.com], which is an article by the author outlining the differences.
  
I was going to read this (Score:5, Funny)

by L. VeGas ( 580015 ) writes: on Tuesday June 24, 2003 @12:25PM (#6285367) Homepage Journal

but instead I *

- I just can't fathom this (Score:4, Funny)
  
  by Anonymous Coward writes: on Tuesday June 24, 2003 @12:31PM (#6285444)
  
  Now, I thought I was reading a simple article about a programming book review. And here I come across this thread of epic mirth. Somehow you have single-handedly crafted a finely-tuned piece fun-joy from what was a rather mundane topic. I just have to page my boss back to the office to see this! Gather round the water cooler old salts and let me spin a comedic yarn I saw this day on Slashdot. Using an asterix to finish a sentence we would have all seen as being finished in a different manner? Well sir, someone set you up the bomb. You have taken that bomb, added the asterix into the mix and exploded laugh-shrapnel into Slashdot proper. I couldn't even scroll down without getting struck in the eye with a piece of your fun-bomb. Mods, mod this man's excursion into the comedy arena as +5 StopItHurts. Here we sit, emotionally spent and basking in the aftermath of your comedic genius. Thank you kind sir, thank you.
  
  - Re:I just can't fathom this (Score:2)
    
    by L. VeGas ( 580015 ) writes:
    
    You're welcome.
    
    But you owe me.
  - - Re:I just can't fathom this (Score:2)
      
      by Requiem ( 12551 ) writes:
      
      Kleene star. Back to your algorithms class, whelp.
- Re:I was going to read this (Score:3, Funny)
  
  by nick_urbanik ( 534101 ) writes:
  
  but instead I *
  ...read spaces to the end of the line, or the next non-space character :-)
  - - Re:I was going to read this (Score:2, Insightful)
      
      by nick_urbanik ( 534101 ) writes:
      
      In Perl, no need to escape spaces. You just added the requirement that there must be at least one space. If you want to be pedantic, at least please be correct!
      - Re:I was going to read this (Score:2)
        
        by hackstraw ( 262471 ) * writes:
        
        In Perl, no need to escape spaces. You just added the requirement that there must be at least one space. If you want to be pedantic, at least please be correct!
        
        OK, then please specify what version of Perl you are talking about. Version 6 regexps default to using the /x option, so you would need to escape the whitespace.
Cheap prices on Half.com (Score:5, Informative)

by cybermint ( 255744 ) writes: on Tuesday June 24, 2003 @12:26PM (#6285381)

I just purchased an almost new copy on Half.com for under $15 including shipping. There are still a few left at prices far lower than amazon.com or bn.com. Here is the half/ebay link [ebay.com].

- that's the first edition (Score:4, Informative)
  
  by SweetAndSourJesus ( 555410 ) writes: <JesusAndTheRobot ... m ['o.c' in gap]> on Tuesday June 24, 2003 @12:33PM (#6285454)
  
  Which isn't a big deal, I guess.
  
  Mastering Regular Expressions [oreilly.com] is now in its second edition. Mr. Friedl has posted a nice writeup [oreillynet.com] about what's different in the second edition.
  
- Re:Cheap prices on Half.com (Score:2, Informative)
  
  by cybermint ( 255744 ) writes:
  
  DOH! I didn't notice. I wish slashdot would let you edit posts.
  
  At $15 compared to $30, I'm not going to cancel my order even if it is just 1st edition. The only parts I'll miss is the extra info on new Perl 5.8 features, and maybe the unicode stuff. Guess I'll be reading perldoc.com for that.
- The best place for buying technical books is... (Score:3, Informative)
  
  by Draxinusom ( 82930 ) writes:
  
  www.bookpool.com
  
  Mastering Regular Expressions, 2nd Edition
  Our Price: $24.50
  
  Bookpool is consistently the cheapest place to buy technical books. And no, I am not affiliated with them in any way.
Obligatory crap regexp joke (Score:5, Funny)

by BabyDave ( 575083 ) writes: on Tuesday June 24, 2003 @12:26PM (#6285385)

Regular expressions are tied so strongly to the *nix culture

Shouldn't that be .*nix instead?

- Re:Obligatory crap regexp joke (Score:2)
  
  by simetra ( 155655 ) writes:
  
  No
  Probably something more like:
  .\{1,5\}[nN]\{1\}[aeiouAEIOU]\{1\}[xX]
- Re:Obligatory crap regexp joke (Score:2)
  
  by GoRK ( 10018 ) * writes:
  
  how about /([a-z]?[a-z](ni|i|nu)x|[a-z]*bsd)/i
  - Re:Obligatory crap regexp joke (Score:2)
    
    by RevMike ( 632002 ) writes:
    
    So HP-UX and Minix are out?
- - - Re:Obligatory crap regexp joke (Score:2)
      
      by iabervon ( 1971 ) writes:
      
      I'll say. Nobody uses globbing for any serious work any more...
What's new in this edition? (Score:2)

by kbeer ( 21963 ) writes:

I read the first edition and loved it. Can anyone who has read both editions say if it's worth buying the second edition?

My only complaint about the book is that non-techies looked at the title when I was reading and said, "Aren't 'Hi there' are 'How are you?' regular expressions?"
Perl, not "PERL" (Score:5, Informative)

by carl67lp ( 465321 ) writes: on Tuesday June 24, 2003 @12:28PM (#6285399) Journal

It's always surprised me when I see intelligent people write "PERL" when they refer to Larry Wall's programming language.

From the Perl FAQ, General Questions About Perl:

What's the difference between "perl" and "Perl"?
One bit. Oh, you weren't talking ASCII? :-) Larry now uses ``Perl'' to signify the language proper and ``perl'' the implementation of it, i.e. the current interpreter. Hence Tom's quip that ``Nothing but perl can parse Perl.'' You may or may not choose to follow this usage. For example, parallelism means ``awk and perl'' and ``Python and Perl'' look ok, while ``awk and Perl'' and ``Python and perl'' do not. But never write ``PERL'', because perl isn't really an acronym, aprocryphal folklore and post-facto expansions notwithstanding.

You can read the entire FAQ [perl.com] if you like.

- Re:Perl, not "PERL" (Score:5, Informative)
  
  by br0ck ( 237309 ) writes: on Tuesday June 24, 2003 @01:26PM (#6286045)
  
  From an interesting interview with Larry Wall [linuxjournal.com] - 1999..
  
  Marjorie: Well, that certainly answered the question fully. I must admit I didn't expect you to go back as far as the beginning of the Universe. :-) How'd you come up with that name?
  
  Larry: I wanted a short name with positive connotations. (I would never name a language ``Scheme'' or ``Python'', for instance.) I actually looked at every three- and four-letter word in the dictionary and rejected them all. I briefly toyed with the idea of naming it after my wife, Gloria, but that promised to be confusing on the domestic front. Eventually I came up with the name ``pearl'', with the gloss Practical Extraction and Report Language. The ``a'' was still in the name when I made that one up. But I heard rumors of some obscure graphics language named ``pearl'', so I shortened it to ``perl''. (The ``a'' had already disappeared by the time I gave Perl its alternate gloss, Pathologically Eclectic Rubbish Lister.)
  
  Another interesting tidbit is that the name ``perl'' wasn't capitalized at first. UNIX was still very much a lower-case-only OS at the time. In fact, I think you could call it an anti-upper-case OS. It's a bit like the folks who start posting on the Net and affect not to capitalize anything. Eventually, most of them come back to the point where they realize occasional capitalization is useful for efficient communication. In Perl's case, we realized about the time of Perl 4 that it was useful to distinguish between ``perl'' the program and ``Perl'' the language. If you find a first edition of the Camel Book, you'll see that the title was Programming perl, with a small ``p''. Nowadays, the title is Programming Perl.
  
- - Re:WHAT?!?! (Score:2)
    
    by carl67lp ( 465321 ) writes:
    
    Nope. From what I've read, someone thought up that expansion and pegged Perl as an acronym.
    
    There's your history lesson for the day, folks.
I concur (Score:5, Insightful)

by Speare ( 84249 ) writes: on Tuesday June 24, 2003 @12:29PM (#6285413) Homepage Journal

I completely concur with the poster's prejudices and pleasant surprise at the scope of the book. Having learned and used regex since 1986, and having worked on the internals of a couple lightweight C regex engines, I figured I knew all I needed to know. Having seen how many people just get hung up on the basic concept and syntax of regex, I assumed this was going to be a rehash.
This is no "Learn Regex in 21 Days" or "Regex for Dummies" book with lots of tips on page 400 about how the | is useful for finding Jones OR Smith. If you haven't gotten that down yet, this book's not for you.
As the reviewer says, this is a very worthwhile cover-to-cover read which will turn your empirical experiences with regex into a more structured understanding of the science and engineering of advanced regex. As a reference on my shelf, it sits comfortably next to Knuth's AoCP and Foley & van Damme.

netLibrary (Score:5, Informative)

by dboyles ( 65512 ) writes: on Tuesday June 24, 2003 @12:29PM (#6285415) Homepage

I first started reading this book via netLibrary [netlibrary.com] through my school's library. Just the first two chapters are enough to explain regular expressions to the point where one can use them effectively in programs. The remaining chapters expand on this information and discuss language specifics. I bought a paper copy to have on my shelf, and I constantly find myself referencing it.

To those at universities, see if your school offers netLibrary-based books. It's easy to read and it's free.

Soviet Russia Regex (Score:5, Funny)

by TheFlyingGoat ( 161967 ) writes: on Tuesday June 24, 2003 @12:37PM (#6285503) Homepage Journal

s/\A(.*?)\s+(.*)\Z/In soviet Russia, $2 $1s you!/i;

They can be hard (Score:5, Informative)

by DeadSea ( 69598 ) * writes: on Tuesday June 24, 2003 @12:37PM (#6285508) Homepage Journal

I know from my own experiences that writing a regular expression to describe something is not always as easy as it would seem at first glance. I found it difficult to write a regular expression to define a c-style comment: /* comment */ Well, not impossible, just more difficult that I thought it would be. I posted my thought process about how I constructed a regular expression to pick out a c-style comment [ostermiller.org] on my website. It's the kind of thing I like to ask interview candidates.

- Re:They can be hard (Score:2)
  
  by jandrese ( 485 ) * writes:
  
  Your perl example is far too complicated. Why not just say something like: m#(/\*.*?\*/)#s; to grab the comment?
- Re:They can be hard (Score:4, Insightful)
  
  by dargaud ( 518470 ) * writes: <slashdot2 AT gdargaud DOT net> on Tuesday June 24, 2003 @01:04PM (#6285771) Homepage
  
  Not to nitpick too much, but I think your regexp finds the following when it's actually not a comment:
  printf("Comments in C are written like /* this */ although I prefer the // C++ style");
  That's why we use parsers to write compilers and not regexps. I came back from Perl after a few months using it, being very disillusionned by its read-onlyness.
  
  - Re:They can be hard (Score:3, Informative)
    
    by DeadSea ( 69598 ) * writes:
    
    You make an excellent point. The regular expression I came up with would not do the right thing in that situation when finding comments in your text editor.
    Parsers are, however, based on regular expressions. I orginally wrote this regular expression when I was writing a lexer (using JFlex [jflex.de]) for Java. The examples that I saw used a state machine and I wanted to do it with a regex. When combined with regular expression to find sting literals (and all the regular expressions for other junk), it does the r
- Re:They can be hard (Score:4, Informative)
  
  by Otter ( 3800 ) writes: on Tuesday June 24, 2003 @01:05PM (#6285779) Journal
  
  It's probably worth mentioning: KDE comes with a GUI regexp constructor [blackie.dk]. Googling for alternatives shows a similar Windows app [riesterer.free.fr].
  
- Re:They can be hard (Score:2)
  
  by stefanb ( 21140 ) * writes:
  
  Looks like the book is for you :-)
  
  /\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/
  
  Using a non-greedy quantifier, this could be as easy as /\*.*?\*/, if you can make your implementation have . match newlines.
- Re:They can be hard (Score:2)
  
  by Kashif Shaikh ( 575991 ) writes:
  
  And your website proved a point of mine:
  
  It's difficult to know if your regex is really correct for the stuff your're parsing.
  
  I mean, it might work for the 9 cases of input you have. But for the 10th case, bam! your regex doesn't parse the 10th input properly. And reading regexs is worse than reading assembly IMO, when you want to fix a bug in some regex 6 months after you've written it.
  
  But if you know your regex is correct, you can reduce 100 lines down to a mere two lines of code. So it's beautiful o
  - Re:They can be hard (Score:2, Informative)
    
    by Mr. Droopy Drawers ( 215436 ) writes:
    
    Technically, all regex's are lexer's, not parsers. Parsers must be able to be recursive.
- - - Re:They can be hard (Score:3, Informative)
      
      by FroMan ( 111520 ) writes:
      
      Nope, it wouldn't. Give it a try. I don't have access to a unix box here right now. But atleast the little java app I put together works correctly.
      
      Assuming you wanted to capture "/* hello */" out of "/* hello */ hello */"
      
      You see what you are missing is the '?' modifier that will cause the "(.*\r?\n)*" to not be greedy. Same with the ".*".
      
      I think you are just missing the some of the functionality of regexes. You might want to pick up this book. ;-)
      - Re:They can be hard (Score:2)
        
        by DeadSea ( 69598 ) * writes:
        
        Ah yes, if you have that feature of regular expressions, that is true. However, non-greedy matching is a regex feature that I found is not reliably implemented everywhere that I need it.
        Hmmm, I just tried it in my text editor and it worked. Maybe its more widely implemented than I thought. I could have sworn this didn't work last time I tried it there. Maybe it was added in a recent version. :-)
        
        Re:They can be hard (Score:2)
        
        by DeadSea ( 69598 ) * writes:
        
        I was just looking over the documentation for JFlex [jflex.de] and non-greedy regular expression matching isn't mentioned. I'll be that means that it isn't implemented, but I haven't tested it. Since I write parsers using JFlex, its a good thing I can come up with the non-greedy syntax when I need to.
- - Re:They can be hard (Score:2)
    
    by DeadSea ( 69598 ) * writes:
    
    You are very welcome. I used regular expressions to build the parser, so you arn't too far offtopic. ;-)
- - Re:They can be hard (Score:2)
    
    by DeadSea ( 69598 ) * writes:
    
    In some regular expression packages (not all) [\x00-\x7f] can be written as [^]. That is, all the characters that are not in the empty set of characters. Very nice shorthand. I also like the [A]|[^A] does the same thing.
Regex Learning Tool (Score:5, Informative)

by johndiii ( 229824 ) * writes: on Tuesday June 24, 2003 @12:39PM (#6285522) Journal

Regex Coach [weitz.de] is a great free tool for learning about regular expressions and constructing them interactively. Both Linux and Windows versions are available.

Online resource (Score:4, Informative)

by dema ( 103780 ) writes: on Tuesday June 24, 2003 @12:42PM (#6285548) Homepage

I'd be interested to check that book out as I use reg expressions a lot in PHP. But for those of you looking for a resouce online check out RegExLib [regexlib.com]. I use it often when I'm having trouble putting an expression together and have found it extremely helpful.

- - Re:Online resource (Score:2)
    
    by tarquin_fim_bim ( 649994 ) writes:
    
    I'd guess that as you can use both POSIX and Perl type regex style in PHP, this would be unecessary duplication.
From Windows (Score:4, Insightful)

by Quill_28 ( 553921 ) writes: on Tuesday June 24, 2003 @12:42PM (#6285549) Journal

Going from windows to unix one of the things I liked most about unix was the wide spread usage of regex in various applications. Quite powerful.

All i have to say is: (Score:5, Funny)

by jdew ( 644405 ) writes: on Tuesday June 24, 2003 @12:46PM (#6285594)

Thats a big regex [ex-parrot.com]
stupid filter wouldn't let me paste the regex here XD

- Re:All i have to say is: (Score:2)
  
  by Suppafly ( 179830 ) writes:
  
  whoa.. the mail checking regex's i write generally look for an @ and the presences of a . followed by 3 letters.. I don't think I'd want to try and recreate that one..
  - - Re:All i have to say is: (Score:2)
      
      by elemental23 ( 322479 ) writes:
      
      Well I, for one, have never seen mail from a *.info address that wasn't spam, so I say throw them all in the bit bucket.
- Re:All i have to say is: (Score:2, Interesting)
  
  by Bedrock ( 660829 ) writes:
  
  Another fun one is the REX shallow XML parser algorithm that's been around for some time. Check out http://www.cs.sfu.ca/~cameron/REX.html and scroll to appendix A for a Perl implementation. I recently had to reverse-engineer this approach and write a stack-based parser to run in an environment where Perl's :?$foo construct was broken. Much fun...
REGEX for Brazilians (Score:2, Informative)

by maizena ( 640458 ) writes:

Regex rules, but I wouldn't know anything if it wasn't for this book in portuguese: http://guia-er.sourceforge.net/ [sourceforge.net]. The printed version is always with me wherever I go.
Here's my challenge... (Score:2)

by Equuleus42 ( 723 ) writes:

Does anyone here know how to do multi-line regexes in perl? I've seen the notation on how to do it (Mastering Regular Expressions has one paragraph for it), but nothing seems to work...
- Re:Here's my challenge... (Score:2)
  
  by phorm ( 591458 ) writes:
  
  I usually use s/old/new/gs or m/expr/gs in perl
  
  Codebits used to have a lot on this, but the page has since moved [codebits.com] and seems to be having permissions errors at this moment.
  
  If you want, you could always email me (phormix at phormix dot com) and I can attempt to help you with your regexp woes - I've used a lot of multiline perl regexps for HTML processors, etc.
- - Re:Here's my challenge... (Score:2)
    
    by Equuleus42 ( 723 ) writes:
    
    Um, RTFM?
    
    I have read the perlre pod file and Mastering Regular Expressions for sections pertaining to multi-line regular expressions -- and none of the approaches given work for the files I work with. Programming Perl doesn't touch the topic. While the logical answer would be to switch files, sadly that is not an option for me.
Regex rant (Score:5, Insightful)

by Tablizer ( 95088 ) writes: on Tuesday June 24, 2003 @12:52PM (#6285664) Journal

The problem with regex's is that if you don't use them often, you forget a lot of the finer details. They are not self-documenting at all. I think something like "generators" used in some of the compiler tools floating around are more intuative. For example, you can define a "LISP-lite" language like this:

statement -> (command params)
statement -> (command)
params -> params params
params -> constant
params -> variable
params -> statement

Newbie review (Score:3, Informative)

by Telastyn ( 206146 ) writes: on Tuesday June 24, 2003 @12:55PM (#6285702)

I also have this book [actually right next to me]. I'd put off learning perl [and indirectly regexes] for some time, because... well, I was a windows admin by trade. Now that I do other [actual] work, time came to pickup on some other tools.

Even having not dealt with regexes pretty much at all, the book was very easy to get into. The first few chapters go through the basic matching structures, along with requisite history. All of the points are done with understandable real life examples, with diagrams and [a small amount] of actual code. The later chapters go through individual languages, and goes through which features are there, what the nuances are, and a few of the gotchas. I must admit that I probably learned more useful things about perl from this book than from any other source. There is also a large section [which I did not read, and caanot comment on] which actually details the nuts and guts of regexes.

All and all, it's easily the best instructional [as opposed to reference] text I've ever purchased.

in a nutshell (Score:2)

by Suppafly ( 179830 ) writes:

My first suspicion, I admit, was that I was facing one of the countless "man page reprints" that you find these days.

No, that would be OReilly's in a nutshell series of books..
errata (Score:4, Informative)

by Anonymous Coward writes: on Tuesday June 24, 2003 @01:09PM (#6285839)

The reviewer forgot to mention the wonderful errata list of the book! Can be found here [regex.info].

And he's Qualified to review this book???? (Score:4, Funny)

by CSG_SurferDude ( 96615 ) writes: <`wedaa' `at' `wedaa.com'> on Tuesday June 24, 2003 @01:11PM (#6285859) Homepage Journal

(to be honest, I had never heard of lookaround operators before!).
Gezzzz, This guy hasn't even heard of lookaround operators before? What a clueless fool! He should be driven from /. after being tarred and feathered!
Everyone knows that a lookaround operator is that guy that goes into the bank first to make sure that there aren't any armed guards or policemen/women getting their paychecks deposited.
/me runs and hides now! ;-)

Interpretting parser (Score:3, Informative)

by Frans Faase ( 648933 ) writes: on Tuesday June 24, 2003 @01:21PM (#6285976) Homepage

If you want to have something more powerful than regexprs, and still have it as an interpretter, you might have a look at an interpretting parser that I wrote: IParse [planet.nl].

definitely a good read (Score:2)

by cheesyfru ( 99893 ) writes:

I never really thought you could fill a book about regular expressions, but this one manages to accomplish this while at the same time being very interesting. This is absolutely required reading if you know "enough to get by" with regular expressions. Chances are, until you read this, you're making a ton of common mistakes and you don't even know about it.
Or without a book... (Score:3, Informative)

by Iscariot_ ( 166362 ) writes: on Tuesday June 24, 2003 @01:31PM (#6286096)

For those who don't want to buy a book, here's a nice page with pre-built regexps for doing all sorts of things: RegexLib [regexlib.com].

re-builder for Emacs (Score:3, Informative)

by David Ishee ( 6015 ) writes: on Tuesday June 24, 2003 @01:41PM (#6286211) Homepage

The re-builder mode is great for debugging regexps in Emacs. This is the latest version as far as I can tell: re-builder 1.2 [google.com]

You actually liked this book? (Score:2, Informative)

by Forgery ( 613737 ) writes:

I have a previous version of Friedl's book and found it needlessly confusing. The author's examples often leave much to be desired. I have no doubt that all of the information about regex is somewhere in the book, but it takes an extraordinary amount of work on the reader's part to extract it.
- Re:You actually liked this book? (Score:4, Interesting)
  
  by melonman ( 608440 ) writes: on Tuesday June 24, 2003 @03:15PM (#6287254) Journal
  
  I loved the first edition, probably for the reasons you didn't. I'd read several short overviews of regexes, including Larry Wall's one in the Camel book, and, while they got me doing simple stuff, they left me with lots of unanswered questions, and the more I experimented the more my "why doesn't that work?" list grew. The Friedl book is totally thorough, and, I thought, aggessively pedagogical, if you want to learn about how a regex engine works rather than pick up stuff in a cookbook fashion.
  
  That said, I do wonder about the guy. The colophon was astounding: he wrote half the book using regexes on a computer on the other side of the world, using a 37.5 bit/hour connection by the sound of it, and then he proceeded to write his own typesetting system so he could produce a phoenetically alphabetical index in English, Japanese and probably some other languages that I missed. I think he ought to get out more...
  
There are no ".NET Framework" languages (Score:3, Informative)

by ClubStew ( 113954 ) writes: on Tuesday June 24, 2003 @02:09PM (#6286499) Homepage

...or even one of the .NET framework languages
There are no ".NET framework" languages. There are languages that target the Common Language Runtime, or the CLR. The .NET Framework is nearly a class library like the JDK/JRE. If he doesn't even know that, why should I trust his book review?

My Version... (Score:5, Funny)

by BinaryCodedDecimal ( 646968 ) writes: on Tuesday June 24, 2003 @03:03PM (#6287084)

Mastering Regular Expressions:

Repeat after me:

"I'm so hungry, I could eat a horse."

"It's been raining cats and dogs."

"I'll sleep with you when Hell freezes over."

And my personal favourite:

"Oh look, Hell just froze over!"

I've read the first edition and... (Score:5, Interesting)

by RevMike ( 632002 ) writes: <[revMike] [at] [gmail.com]> on Tuesday June 24, 2003 @03:14PM (#6287235) Journal

I have to agree that this is a book that should be on everyone's shelf.
The very fact that both vi and emacs support regular expressions must mean they are a best-in-breed tool, because if there was a way for those two communities to disagree, they would have done it.
I love the fact that I can use the same expressions with grep, sed, vim, Perl, and Java. that being said, however, the critics are who warn that regex can be over used are correct: regex's are difficult to debug and to maintain, so don't go overboard.

contrived examples? (Score:5, Interesting)

by anonymous loser ( 58627 ) writes: on Tuesday June 24, 2003 @03:25PM (#6287368)

(ever needed to match aligned groups of 5 digits in an unspaced stream of characters?)

Yes, actually. Older FORTRAN codes (that have been slowly added to/modified over time) especially exhibit this kind of behavior thanks to formats that allow you to specify columns for output. The numbers actually run into each other on the line, and the only way to read the file is to know which column the data you want is in. I would never discount any regular expression example as contrived. Somewhere, someone has developed a program that uses that formatting in an input or output file, and someone else might need to be able to speak it's language in an automated fashion.

what did one regular expression say to the other? (Score:3, Funny)

by jdew ( 644405 ) writes: on Tuesday June 24, 2003 @03:33PM (#6287451)

what did one regular expression say to the other?
.*

Sample Chapters (Score:3, Informative)

by darkpurpleblob ( 180550 ) writes: on Tuesday June 24, 2003 @04:11PM (#6287893)

A sample chapters from the book, Java and .NET are available in PDF format from the book page on O'Reilly's site [oreilly.com].

- Re:My problem with regular expressions... (Score:5, Interesting)
  
  by Gabe Garza ( 535203 ) writes: on Tuesday June 24, 2003 @01:06PM (#6285794)
  Amen!
  I think a lot of the people who use RE's a lot would be well-served by brushing up on their recursive-descent parser writing skills. For only a little more time then it takes to write a regular expression, you can (if you know how) write a simple recursive-descent parser that:
  
  Is more readable (and thus maintainable)
  
  Is more efficient
  
  Has the potential to have much better error handling (e.g., a descriptive message instead of just "RE doesn't match! Ack!")
  
  Is much more scalable: recursive descent parsers can easily scale up to parsing an entire language (witness g++, which uses one to parse C++)
  
  Is likely to be a great deal more correct, because it forces you to actually define a language, instead of just iteratively building up an RE
  - Parsers and regex (Score:2, Interesting)
    
    by Beltway Prophet ( 453247 ) writes:
    
    And recursion is lots of fun, but I use REs to recognize and extract tokens and boundaries, because it's so easy to write and change simple REs.
    
    There is a middle way between overly complex REs which mere mortals cannot read nor safely modify, and overly complex parsers that never take advantage anything more functional than getc().
- Re:Why is it that people think regexps are hard? (Score:4, Insightful)
  
  by Abcd1234 ( 188840 ) writes: on Tuesday June 24, 2003 @01:27PM (#6286053) Homepage
  
  Someone just took a course on formal languages...
  
  If you need a 500 page book on regexps, you might want to have a look at a good compiler book (red dragon, etc.) first.
  
  And why would I want to learn about all the various automata (finite state machines, push-down automata, and Turing machines) not to mention all that language parsing crap (top-down versus bottom-up parsing, parse trees, etc, etc), when all I really want to learn is how to exploit a regular expression engine efficiently so I can solve real world problems?
  
  Full non-CFG languages are so much more powerful than any regexp could ever dream of being, and more importantly they can have state.
  
  Yeah, that's called a programming language. And yeah, I could implement any regular expression using a standard programming language, but why would I bother when a regular expression is far more concise and better suited to the job?
  
  Geez, give someone a hammer...
  
- Re:Why is it that people think regexps are hard? (Score:4, Funny)
  
  by muonzoo ( 106581 ) writes: on Tuesday June 24, 2003 @02:00PM (#6286402)
  
  SkewlD00d writes:
  
  Why is it that people think regexps are hard
  
  All you have are zero-or-more "+", one-or-more "*", conditional "? or sometimes ...
  
  ...these bozos that think "regexp" sounds cool...
  
  Just like the bozo who just finished a Formal Computation course, yet mixed up the meanings of "+" and "*" ? ;-)
  
  From man grep:
  
  A regular expression may be followed by one of several repetition operators: ? The preceding item is optional and matched at most once. * The preceding item will be matched zero or more times.
  
  I hear they're serving humble pie at the school cafeteria today. ;-)
  
  - Re:Why is it that people think regexps are hard? (Score:2)
    
    by SkewlD00d ( 314017 ) writes:
    
    That's what revision 0.1 is supposed to fix. Geez, gimme a break. You think i proof-read anything I post? Nawh. I didnt need a formal lang course though, no... I took a compiler course, the superset of all that shit. " Implement a C compiler in hardware." We talked about it. Though C is not a CFG. ;) Humble pie? You mean, I have foot-in-mouth disease? I always have known that. It's funny getting modded down to 0 when I know what I'm talking about but transpose a few chars because I have lysdexia. M
    - Re:Why is it that people think regexps are hard? (Score:2)
      
      by Abcd1234 ( 188840 ) writes:
      
      I took a compiler course, the superset of all that shit.
      
      LOL! That compiler course is BASED on formal language theory, so I think you probably have the relationship a little confused there. In fact, a compiler course really only gives you a light dusting of the real theory behind formal languages (I would know, I've taken a course in both... I'm certainly no expert in either, but at least I have a little perspective). A class in formal languages not only discusses Chomsky's hierarchy and their associat
- Re:Why is it that people think regexps are hard? (Score:2, Insightful)
  
  by sk8king ( 573108 ) writes:
  
  And you go the 'zero-or-more' and the 'one-or-more' mixed up [in Perl anyway]....that's why they're not as easy as you claim.
- Re:Why is it that people think regexps are hard? (Score:3, Insightful)
  
  by BigBadBri ( 595126 ) writes:
  
  In my case, they're hard because I only use them once in a blue moon, and it's nice to have a simple look-up and a few examples.
  But then, I'm not a compiler god, just a network guy who happens to have to use the fscking things once in a while.
- Post is from a troll template (see below) (Score:2, Informative)
  
  by Chad E Dirks ( 681955 ) writes:
  
  "Today I got roughly 4 first posts but then slashdot wouldn't let me post anymore. So thats enough trolling for one day." - rkz
  
  To be honest, that this exact same post template has been moderated highly again and again in recent book reviews is becoming more humorous than anything. Unfortunately, and this is addressed to certain moderators, I believe it would be correct to say the laughing is 'at you' and your misfortune rather than 'with you'.
  
  If you would like to confirm that you are being 'taken in', cli
- - - Re:+4 Informative? He doesn't even have to own... (Score:2, Funny)
      
      by carlos_benj ( 140796 ) writes:
      
      I suppose on /. that would be considered a regular expression....

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

What it's about

What's to like

What's to consider

The summary

Table of Contents

i mastered regular expressions (Score:5, Funny)

Perl, Java, .NET.. oh my! (Score:4, Interesting)

Regexp's almost consistent across languages (Score:2)

Re:Regexp's almost consistent across languages (Score:2)

Re:Regexp's almost consistent across languages (Score:4, Insightful)

Re:Regexp's almost consistent across languages (Score:3, Insightful)

Re:tend to be the de-facto standard - dream on! (Score:3, Informative)

Re:Perl, Java, .NET.. oh my! (Score:2)

Re:Perl, Java, .NET.. oh my! (Score:2)

Re:Perl, Java, .NET.. oh my! (Score:2, Informative)

Re:Perl, Java, .NET.. oh my! (Score:2)

Funny you should say that... (Score:3, Informative)

Don't go overboard (Score:3, Interesting)

Re:Don't go overboard (Score:5, Insightful)

Re:Don't go overboard (Score:5, Funny)

Re:Don't go overboard (Score:3, Funny)

Regular Expressions (Score:2, Insightful)

Re:Regular Expressions (Score:3, Informative)

Re:Regular Expressions (Score:4, Informative)

Re:Regular Expressions (Score:3, Funny)

Re:Regular Expressions (Score:2)

pcre (Score:2)

C++ Regular Expressions (Score:5, Informative)

Different than 1st Edition? (Score:3, Interesting)

Re:Different than 1st Edition? (Score:5, Informative)

I was going to read this (Score:5, Funny)

I just can't fathom this (Score:4, Funny)

Re:I just can't fathom this (Score:2)

Re:I just can't fathom this (Score:2)

Re:I was going to read this (Score:3, Funny)

Re:I was going to read this (Score:2, Insightful)

Re:I was going to read this (Score:2)

Cheap prices on Half.com (Score:5, Informative)

that's the first edition (Score:4, Informative)

Re:Cheap prices on Half.com (Score:2, Informative)

The best place for buying technical books is... (Score:3, Informative)

Obligatory crap regexp joke (Score:5, Funny)

Re:Obligatory crap regexp joke (Score:2)

Re:Obligatory crap regexp joke (Score:2)

Re:Obligatory crap regexp joke (Score:2)

Re:Obligatory crap regexp joke (Score:2)

What's new in this edition? (Score:2)

Perl, not "PERL" (Score:5, Informative)

Re:Perl, not "PERL" (Score:5, Informative)

Re:WHAT?!?! (Score:2)

I concur (Score:5, Insightful)

netLibrary (Score:5, Informative)

Soviet Russia Regex (Score:5, Funny)

They can be hard (Score:5, Informative)

Re:They can be hard (Score:2)

Re:They can be hard (Score:4, Insightful)

Re:They can be hard (Score:3, Informative)

Re:They can be hard (Score:4, Informative)

Re:They can be hard (Score:2)

Re:They can be hard (Score:2)

Re:They can be hard (Score:2, Informative)

Re:They can be hard (Score:3, Informative)

Re:They can be hard (Score:2)

Re:They can be hard (Score:2)

Re:They can be hard (Score:2)

Re:They can be hard (Score:2)

Regex Learning Tool (Score:5, Informative)

Online resource (Score:4, Informative)

Re:Online resource (Score:2)

From Windows (Score:4, Insightful)

All i have to say is: (Score:5, Funny)

Re:All i have to say is: (Score:2)

Re:All i have to say is: (Score:2)

Re:All i have to say is: (Score:2, Interesting)

REGEX for Brazilians (Score:2, Informative)

Here's my challenge... (Score:2)

Re:Here's my challenge... (Score:2)

Re:Here's my challenge... (Score:2)

Regex rant (Score:5, Insightful)

Newbie review (Score:3, Informative)