Data Crunching 94

Posted by timothy on Monday June 20, 2005 @04:15PM from the constructive-critique dept.

Vern Ceder writes "I really expected to love Data Crunching. The Pragmatic Bookshelf has come up with some very good and, well, "pragmatic" texts in the past so I was looking for more of the same. Even better, the subject of the book was the routine data extraction, massaging and formatting that I (and a lot of other coders) spend so much time on. I was really looking forward to adding a couple more pragmatic tools to my coding toolbox. Unfortunately (as you may have guessed), I really can't say I love Data Crunching. It's a good book, but there are several minor points that keep if from being a truly great book." Read on for the rest of Ceder's review.

Data Crunching: Solve Everyday Problems Using Java, Python, and more.
author	Greg Wilson
pages	176
publisher	Pragmatic Bookshelf
rating	7
reviewer	Vern Ceder
ISBN	0974514071
summary	A good introduction to data crunching, but watch the examples.

On the positive side, there is a lot of good stuff in this book. I would even go so far as to recommend it to everyone who writes code to extract or manipulate data, particularly those less experienced. Greg Wilson should be praised for taking the idea of data crunching seriously and for systematically dealing with its patterns and pitfalls. A lot of important work gets done every day with one-off programs and behind the scenes scripts and Wilson is right that the techniques that go into this sort of coding are different, but just as important, as those that go into full-blown application development.

The strength of this book is that it offers useful approaches and patterns for dealing with a variety of common programming situations and types of data, while also pointing out their common traps and pitfalls. Wilson starts with techniques for crunching text data, moves on to the use of regular expressions, XML, binary data, and SQL databases before concluding with a special section on "horseshoe nails," various little techniques which just might save help save the day. Quite often he uses examples in both Python, which he calls an "agile" language and Java, a "sturdy" language. The basic advice offered is sound, if not shocking -- keep things simple, test as you develop, don't duplicate code, use existing scripts and utilities when possible, and so on. The combination of such sound advice with a wealth of practical examples is makes for a very effective handbook, particularly for someone new to data crunching.

So is Data Crunching a good book? Definitely. Should you read it if you regularly do routine data manipulation and extraction? Absolutely. And yet...

And yet there are number of things that just aren't quite right. The text and binary sections are the best, while I would say that the XML and SQL sections are the weakest, partly because those topics are too broad to cover in a single slim chapter. If you already have an idea of how you might want to use XML or how to extract data from a SQL database, you're likely find something handy in those chapters. On the other hand, if you're unfamiliar with them, this book probably doesn't have enough detail to get you writing useful code. I should say it doesn't have enough detail to get you writing useful code knowing what you're doing. And data crunching without knowing what you're doing is a bad idea. Trust me on that one.

I have another problem with the section on SQL. Several of the slicker SQL recipes rely on nested queries (page 147-151). MySQL, clearly a very popular SQL database, has nested queries only in its latest versions, so many, if not the majority, of MySQL installations do not yet have that capability. Yet the text carries on as if nested queries were universal, without so much as parenthetical mention that some things might not work on all SQL implementations. It seems to me that this is exactly the sort of pitfall a book like this should inform the reader of.

There are also several coding examples that bother me. Since I tend to both learn and teach by paying close attention to examples, I get uncomfortable with examples that seem to suggest something other than what they should.

For instance, the very first pieces of sample code (pages 9-10) in the text chapter are Python and Java programs to reverse the order of lines in a text file. I don't have a problem with the exercise itself, I've often assigned it to beginning programmers. However, this book is about quick and reliable solutions to common data handling problems, not leading people through basic programming exercises. Ironically, the very same chapter discusses the advantages of using the Unix command-line and its wealth of little tools. So wouldn't it be reasonable to expect at least a brief note or example showing that the REALLY easy way to solve the problem is with a single line: $ tac filename > filename2? Yet tac is not even in the list of "Useful Commands" on page 24. If reversing lines is just a programming example, it shouldn't be the lead example in a book like this, and if it is important, then you should mention that the problem has already been solved.

In the same vein, Wilson spends a fair amount of time in the text chapter illustrating code to parse command-line parameters, before admitting that libraries for the task abound in most languages. Granted, being able to snag a parameter or two off of the command-line without using a library can sometimes be handy; but implementing a more involved command-line parser is a problem that has already been abundantly solved.

Similarly, one of the examples in the chapter on regular expressions uses a regular expression to check to see if a string contains a valid IP address (pages 65-66). After showing how to use a regular expression to scan a dotted quad of digits, the text then admits that using a regular expression alone would lead to too much complexity, since it's hard to use a regular expression to check to see if a 1 to 3 digit number is less than 255 (or 127, which is what he uses in his code). So the example on page 66 ends up compiling and matching a regular expression like this:

pat = re.compile("(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})\\ .(\\d{1,3})")
. . .
m = pat.match(text)
for g in m.groups():
. . .
when a Python coder would more naturally just use:

quads = text.split('.')
for number in quads:

Sure, it's a good example of how to extract matched items, but the implication is that using a regular expression is the best way to extract extract numbers separated by dots, when in fact the Python has a simpler, easier and more reliable way to deal with it. Again a quick mention of the "easy" way to solve the problem would have been appropriate.

These kinds of issues are what keeps Data Crunching from being a great book. In spite of them, it is still a very good and useful book and Mark Wilson has done a good job with a topic all too often ignored. The general idea is great, and the principles, problems and solutions are well-explained and relevant. If data crunching is something you do, I would certainly recommend that you read this book, but with a somewhat critical eye.

You can purchase Data Crunching: Solve Everyday Problems Using Java, Python, and more. from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

Data Crunching

This discussion has been archived. No new comments can be posted.

Search 94 Comments Log In/Create an Account

Comments Filter:

Reviewer catches himself. (Score:5, Insightful)

by juuri ( 7678 ) writes: on Monday June 20, 2005 @04:21PM (#12866360) Homepage

Don't berate the author for his examples using nested SQL when a paragraph later you call him out for not using "tac" because you assumed it is universal.

Like nested queries, tac, isn't standard across all unix platforms.

quads = text.split('.') (Score:5, Insightful)

by Evro ( 18923 ) * writes: <evandhoffman&gmail,com> on Monday June 20, 2005 @04:25PM (#12866398) Homepage Journal

quads = text.split('.')

This assumes valid data and not something mangled like "1.2.3" or "U.S.A.". Using the numeric regex match that the book's author suggested would be more reliable in matching IP addresses only.

nested queries are a problem? (Score:5, Insightful)

by stoolpigeon ( 454276 ) * writes: <bittercode@gmail> on Monday June 20, 2005 @04:26PM (#12866401) Homepage Journal

If a book uses nested queries and some rdbms doesn't -- the problem lies with the rdbms. I've never used mysql and I've avoided the flames about it not being a real database.... but come on. That is weak.

Re:nested queries are a problem? (Score:2, Insightful)

by jthayden ( 811997 ) writes: on Monday June 20, 2005 @04:30PM (#12866447)

Granted the ANSI SQL standard isn't followed as closely as perhaps other standards are, but if Nested Queries are in the standard, then I would have to say the RDBMS is at fault and not the book.

Regex method is better (Score:2, Insightful)

by Anonymous Coward writes: on Monday June 20, 2005 @04:32PM (#12866478)

Your oversimplification of his solution for validating ip addresses is a fine example of a poor review by someone who thinks he knows more than the author.

Try passing in a string such as "I.like puppies!!!". A regex like the one the author provided will easily reject this, so there's no need to worry about checking for numericness, or any other strange characters at all. The regex in fact filters out EVERYthing so that all that has to be done is to check the actual numeric values for the right value range. I would not like to see the remainder of the alternate example (I'm sure it wouldn't be simple)

I'm all for KISS but there is definitely is such a thing as too simple.

Reviewing the book or showing off geekiness? (Score:4, Insightful)

by zanderredux ( 564003 ) * writes: on Monday June 20, 2005 @04:35PM (#12866496)

Similarly, one of the examples in the chapter on regular expressions uses a regular expression to check to see if a string contains a valid IP address (pages 65-66). After showing how to use a regular expression to scan a dotted quad of digits, the text then admits that using a regular expression alone would lead to too much complexity, since it's hard to use a regular expression to check to see if a 1 to 3 digit number is less than 255 (or 127, which is what he uses in his code). So the example on page 66 ends up compiling and matching a regular expression like this:
pat = re.compile("(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})\\ .(\\d{1,3})")

Actually, that example is safer than just invoking text.split, as that long regex can shield you from injection attacks and help you enforce numeric IPs in one single command.
In the end, it is a matter of style, but just invoking text.split and trusting user input is... naive?!

Re:nested queries are a problem? (Score:3, Insightful)

by poot_rootbeer ( 188613 ) writes: on Monday June 20, 2005 @05:26PM (#12866959)

I may be wrong, but I believe that an RDBMS must support nested subqueries to be conformant to the ANSI SQL92 Entry-Level specification (maybe even SQL89?).

Not to fan the flames of another advocacy flamewar, but if MySQL hasn't caught up to a 13-year-old standard yet, it shouldn't be treated as a fully-functional SQL RDBMS.

If you're running MySQL you should be aware of its limitations yourself; it's not the book's job to bring them to your attention for you.

Not mentioning tac is not a dealbreaker (Score:4, Insightful)

by illumin8 ( 148082 ) writes: on Monday June 20, 2005 @05:28PM (#12866977) Journal

I don't fault the author for not mentioning tac. It is part of the GNU textutils package, and although it might be standard on every Linux distro, it's most likely not in ANY enterprise Unix. I just checked my Sun boxes and it's not installed there, except for the ones that I've installed GNU textutils on.

I really wish a lot of Open Source developers would stop assuming that all of us have every GNU utility ever invented on our system. I can't tell you how difficult it is to get the average GNU autoconf program to compile correctly on Solaris or any flavor of enterprise Unix, simply because most authors assume they're writing platform-independent code, without realizing that GNU's M4 is different from System V M4. Also, differences between lex, flex, tar, and GNU tar abound. Please, for the love of god, don't assume that the tools you know and love on your Linux box at home are available or even installable on enterprise kit at work. Most company policies prevent the installation of these type of tools.

Re:MySQL (Score:3, Insightful)

by DogDude ( 805747 ) writes: on Monday June 20, 2005 @06:24PM (#12867385)

What I can't believe (and I'm replying more to myself than anything else, because I just realized...) is that if MySQL hasn't been supporting something as basic as sub-queries until recently that means that there have been tons and tons of complex applications written without subqueries! Holy mother of christ... How would something as simple as even Slashdot get written without subqueries? There must be thousands upon thousands of apps out there that were written with almost -no- understanding of what a modern RDBMS is designed to do even though they're manipulating data. I can only imagine the middle layer of all of these apps doing many, many, many, many unnecessary database connections and queries. Wow. There are truly a LOT of bad programmers out there.

MySQL and data crunching (Score:3, Insightful)

by angio ( 33504 ) writes: on Monday June 20, 2005 @11:10PM (#12869111) Homepage

MySQL's lack of support for some of the ANSI SQL features is annoying. But, that said, I do a lot of data crunching on a terabyte or so of Internet measurement data, and MySQL remains my database of choice. In a data-mining application like mine, I need speed and a compact on-disk representation of the data and the indices before anything. Our inserts are batched a couple of times a day; having them fast is important, but having them run concurrently with queries isn't. I don't need transactions, I can deal with table-level locking, and I'm willing to give up a couple of things like nested selects to get that speed.
Given that MySQL is the best fit for some types of data crunching applications, the earlier comment about assuming nested queries has merit.

My requirements arise in a research setting, so perhaps they're less common. Companies like wal-mart can afford big iron on which to do their data mining. Smaller data crunching tasks don't make the same kind of performance demands on their RDBMS. Of course, one thing to consider is that the standard RDBMS model isn't all that well suited to huge-scale data-mining in general, so there may be no silver bullet here for any of us to get religious about yet.

Re:MySQL (Score:3, Insightful)

by quasi_steller ( 539538 ) writes: <Benjamin DOT Cutler AT gmail DOT com> on Monday June 20, 2005 @11:48PM (#12869311)
DBAs and database developers do not consider MySQL a database.

You have got to be kidding me. Of course MySQL is a database. A database is simply a collection of data organized so that a computer program can access pieces of that data, something a MySQL database certainly does. This would make MySQL as a whole, a DBMS (DataBase Management System), as it is a collection of programs used for managing a database. Now, Is MySQL a RDBMS (Relational DBMS)? Well, that depends on your definition of RDBMS. If you define a RDBMS as a DBMS that stores it's data in the form of related tables, then MySQL is most certainly a RDBMS. However, if your a strict follower of Codd, then you might not consider MySQL a RDBMS, as it doesn't follow all of Codd's rules. However, under this strict definition, no SQL DBMS is a RDBMS, as SQL breaks some of Codd's rules.

Perhaps what you meant to say was: "DBA's don't consider MySQL a true SQL database." (Or at least until very recently, as MySQL has gained a lot of functionality.)

Don't get me wrong, I don't disagree with you completely. While I believe MySQL has is uses, I also believe there are many applications where it just shouldn't be used. I just think that we need to be a little more careful when we choose our wording here, so we don't sound like we're trying to flame, or even worse troll. (By the way, I don't believe you were doing either. I'm sure that when you said database, you were thinking SQL.) MySQL is a database, it just is (was? I'm not sure about the newest version) not an SQL compliant database.

References:
- http://en.wikipedia.org/wiki/RDBMS
- http://www.webopedia.com/TERM/R/RDBMS.html
Re:MySQL (Score:3, Insightful)

by Matje ( 183300 ) writes: on Tuesday June 21, 2005 @02:23AM (#12869931)

So from the fact that MySQL lacked subquery support you derive that there are a lot of bad programmers? me thinks there is only evidence here that you're a bad logician. Now that is a skill a good programmer must have ;). A couple of remarks:

- if you're building a simple website, chances are you won't need any subqueries. Websites were (are?) the bread and butter of MySQL.

- the fact that the dbms lacks subquery support does not imply that the programmer lacks knowledge about them, nor does it imply that programmers generally use unnecessary db connections or queries!

- The MySQL manual states, correctly in my opinion, that in many situations subqueries can be rewritten to joins. Could it be possible that all those bad programmers out there were aware of this and you weren't?

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Data Crunching 94

Data Crunching More Login

Data Crunching

Reviewer catches himself. (Score:5, Insightful)

quads = text.split('.') (Score:5, Insightful)

nested queries are a problem? (Score:5, Insightful)

Re:nested queries are a problem? (Score:2, Insightful)

Regex method is better (Score:2, Insightful)

Reviewing the book or showing off geekiness? (Score:4, Insightful)

Re:nested queries are a problem? (Score:3, Insightful)

Not mentioning tac is not a dealbreaker (Score:4, Insightful)

Re:MySQL (Score:3, Insightful)

MySQL and data crunching (Score:3, Insightful)

Re:MySQL (Score:3, Insightful)

Re:MySQL (Score:3, Insightful)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot