Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Spam Books Media Book Reviews

Ending Spam 184

Shalendra Chhabra writes "Jonathan Zdziarski has been fighting spam since before the first MIT spam conference in 2003, and has now released a full-on technical book, Ending Spam, on spam filtering. Ending Spam covers how the current and near-future crop of heuristic and statistical filters actually work under the hood, and how you can most effectively use such filters to protect your inbox." Read on for the rest of Chhabra's review.
Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification
author Jonathan A. Zdziarski
pages 312
publisher No Starch Press
rating 8
reviewer Shalendra Chhabra
ISBN 1593270526
summary Very Good Book Covering Statistical Models and Techniques Implemented in Current Spam Filters


Spam (unsolicited commercial email) and phishing (fraudulent emails) are causing losses of billions of dollars to businesses. Many initiatives are currently underway for fighting this challenge. On the legal front, a Virginia court recently sentenced a prolific spammer, Jeremy Jaynes, to nine years in prison, and a Nigerian court sentenced a woman to two and a half years for phishing. Michigan and Utah have both passed laws creating "do-not-contact" registries in July/August 2005, covering e-mail addresses, instant messaging addresses and telephone numbers. Technical initiatives to fight spam include server- or client-side spam filtering, using Lists (Blacklists, Whitelists, Greylists), Email Authentication Standards (IIM, DK, DKIM, SPF, SenderID), and emerging sender reputation and accreditation services.

Ending Spam is the first book explaining the fine details of the theoretical models and machine-learning algorithms implemented in these filters. The book is divided into three parts: introduction to spam filtering, fundamentals of statistical filtering, and advanced concepts of statistical filtering.

The first section of the book discusses the history of spam, spam kings, different approaches for fighting spam such as blacklisting, whitelisting, heuristic filtering, challenge response, throttling, collaborative filtering, Authenticated SMTP, Sender Policy Framework and SenderID, spammer fingerprinting, etc. However, the author omitted any mention of locally-sensitive hash functions (such as Nilsimsa Hash) to counter spammers' random insertion of words, the use of CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart), Greylisting, Identified Internet Mail, and Domain Keys (now Domain Keys Identified Mail).

In the next chapter, the author clearly explains various components of a Language Classifier Pipeline, including the Historical Dataset (aka wordlist, database, dictionary, filter memory), Tokenizer, and the Analysis Engine with its feedback loop. However, the process flow of a language classifier could have been more generalized, e.g. incorporating an initial text-to-text transformer. This chapter also covers the advantages and disadvantages of various training modes for filters, such as Train Everything (TEFT), Train-on-Error (TOE), and Train Until No Errors (TUNE). This part concludes with the description of Paul Graham's famous spam-filtering technique using Bayesian classification (as described in "A Plan for Spam"), Gary Robinson's Geometric Mean Test, Fisher-Robinsons Inverse Chi Square (including the source code for the inversion function), and some other tricks for optimizing spam- filtering accuracy.

The second part of this book deals with the fundamentals of statistical filtering. The author explains HTML and Base64 encoding, followed by a detailed description of tokenization techniques (e.g. Sparse Binary Polynomial Hashing). Then there's a discussion of the various tricks that spammers use for penetrating filters. Although these tactics are mentioned in John Graham-Cumming's "Spammers Compendium," Jonathan has very elegantly explained why some tricks work for spammers and some don't. This part concludes by addressing some of the resource, storage and scaling concerns raised by the large number of features generated from tokenization techniques.

The third part of this book deals with advanced concepts of statistical filtering. This includes the testing criteria for measuring accuracy of an email filter, and some advanced tokenization concepts, e.g. chained tokens (taking word-pairs and phrases into account, instead of individual words) generated using a sliding 5-byte window as mentioned in Sparse Binary Polynomial Hashing. The next chapter describes the Markovian Model implemented in the CRM114 Discriminator, but the author fails to describe different weighting schemes for features implemented in the Markovian-based version of CRM114. The author then describes the Bayesian Noise Reduction Technique for purging "out of context" data from the mail text. This chapter concludes with a very nice summary of collaborative algorithms and techniques, such as Message Innoculation, Streamlined Blackhole List, Fingerprinting, Automatic Whitelisting, URL Blacklisting, and Honeypot email addresses for snaring spammers' address harvesting bots.

The most interesting part of this book is the appendix, where the author presents interviews with John Graham-Cumming of POPFile, Brian Burton of SpamProbe, Marty Lamb of TarProxy, Bill Yerazunis of CRM114 Discriminator, and Jonathan Zdziarski of DSPAM (himself). I loved this section.

The salient points of the book: it's very easy to read; each chapter begins with a very thought-provoking introduction, and concludes with a crisp "final thoughts" section. The number of technical errors are very few in this print, and the illustrations are of good quality. Since the book is geared more toward the Bayesian and statistical generation of spam filters, the absence of certain spam-busting technologies is acceptable. However, a noticeable omission is the lack of discussion about measuring spam-filter accuracy, and what impact this has on setting filtration thresholds. A section on the economics of tradeoffs, and the use of a Receiver Operating Characteristic curve (ROC) would have been very helpful.

Overall, by putting together Ending Spam, Jonathan Zdziarski has made another significant contribution (after DSPAM) to the anti-spam community. Whether you are a system administrator, anti-spam researcher, engineer or a newbie interested in fighting spam, this book is a great reference.


William S Yerazunis and Richard Jowsey also contributed to this review. Shalendra Chhabra is a Graduate Student in Department of Computer Science and Engineering at University of California, Riverside. He is on the development team of CRM114 Discriminator and has presented his work at MIT Spam Conference 2005, Cisco Systems, and Stanford University. You can purchase Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
This discussion has been archived. No new comments can be posted.

Ending Spam

Comments Filter:
  • fantastic advice (Score:2, Interesting)

    by Anonymous Spammer ( 700974 ) on Monday August 15, 2005 @05:40PM (#13325376)
    We spammers love you idiots who use spam filters. You were never going to buy from us or fall for our scheems anyway, so you do extra work to filter your e-mail and that way we are not bothered by you reporting us or attacking us. We are free to continue to waste your bandwidth and overflow your inbox, but you never see the spam and you leave us alone, to keep spamming those too ignorant to protect themselves. The complaints die down and we get what we want, the unknowing victims. What a great system.

    Heck, our lobby group even points out to Congress how spam laws are not really needed, since people who really don't want the spam are free to filter it. That and a litte payola and we are free to phish for more victims.

    Yea, keep "fighting spam" with lame filters, we love it. Thanks!

  • by mcrbids ( 148650 ) on Monday August 15, 2005 @05:42PM (#13325397) Journal
    Email, as a system, is fundamentally broken. It's this broken design that allows SPAM to happen in the first place.

    Current anti-spam solutions are to email what an Antivirus package is to Windows - a hack add-on that increases complexity and costs without solving the underlying problem(s).

    Rather than fight viruses, we should be engineering an O/S that's inherently resistent to them. How many of you Linux/BSD/MacOS users EVER use antivirus, or need to?

    Rather than build ever-better antispam filters for Email, we should be engineering an email solution that's inherenly resistant to SPAM.

    The answer lies in authentication - who is sending the email. Some of the best technologies now available use degrees of authentication without actually *saying* it outright. Examples are: refusing invalid domains, greylisting, challenge-response, SenderID - all of these are some form of authentication.

    As these are, one-by-one bypassed by the spammers, the need for authentication of senders will continue to increase, until the dolts who will invariably reply with that "your solution will not work because... (check the options)" are shown to simply be.... wrong.

    Give it time. It's already happening whatever the originators of the SMTP protocol desired.
  • by Some Random Username ( 873177 ) on Monday August 15, 2005 @05:44PM (#13325419) Journal
    Read some of his essays. He genuinely believes that all evidence clearly shows that the earth cannot possibly be more than 10,000 years old.

    The contract between being a logical minded person like a programmer, and being so easily brainwashed into believing comeplete nonsense is startling.
  • by MightyMartian ( 840721 ) on Monday August 15, 2005 @05:46PM (#13325437) Journal
    The root problem is with SMTP. We can try to patch it up with SPF and SenderID, we can try to find ways of putting identifiers on emails, but at the end of the day the protocol itself was built in a simpler age.

    The ultimate solution will come when we move to a new-generation mail delivery system. But the day is a long ways off, because the sheer cost of implementing such a system and the necessity of having it integrate with older SMTP systems for the years required for large-scale adoption means that spammers have a healthy length of time to irritate us.
  • by MightyMartian ( 840721 ) on Monday August 15, 2005 @05:53PM (#13325498) Journal
    The problem with these is that they're all duct-tape jobs on the SMTP protocol. The SMTP protocol has fundemental problems in that it essentially has no sender verification and has been configured as much by tradition as anything else to allow MTAs and MUAs to be effective equivalents. To some extent SPF and SenderID try to overcome the verification problems, but at least SPF has serious problems when it comes to forwarding unless header rewriting is done.

    I suppose the "legitimate" spam (not generated by zombies through various sorts of attacks) may always be around, because I can think of no efficient and streamlined means of allowing a user to configure automatic settings saying "Don't send me commercial spam". With a properly designed transport system, at least it should be possible to easily blacklist abusive domains.

  • by -brazil- ( 111867 ) on Monday August 15, 2005 @05:55PM (#13325512) Homepage
    Bad analogy. Spam is not an organism or infection. It is a business model. It does not "survive" in computers, but in a combination of economical, technical and legal conditions. Once those conditions become strongly unfavorable to the business model, there isn't really much that adaption can do. Selling "snake-oil" wonder cures used to be a really big, widespread business model. Better-informed consumers and increased regulation of the market for medicine have all but eradicated this practice. It survives, but in a much-changed and diminished form.
  • No good publisher (Score:2, Interesting)

    by SW6 ( 140530 ) <abuse@cabal.org.uk> on Monday August 15, 2005 @06:00PM (#13325554) Homepage
    It's by "No Starch Press" who seem to churn out books that look good on initial inspection, but don't seem to deliver on content.

    If this was published by O'Reilly, I'd have bought it on sight as they bother to edit their books. As it is, I'll give it a wide berth.

  • by plover ( 150551 ) * on Monday August 15, 2005 @06:07PM (#13325614) Homepage Journal
    You've missed the last two years in spammer technology, haven't you?

    Spam is no longer simply the domain of a giant server with a huge database. It's increasingly being sent out by zombie PCs, infected with viruses or trojans. Spammers pay the zombie-farmers to send their crap. Zombies send the email masquerading as the PC owner, using their credentials. Sender-ID? No problem, he's got one. SMTP? Sure, use the victim's server.

    Zombies mean that no matter what technology is used for sending validated, signed, pre-paid, whatever email, the zombies will have access to those resources and will still spew their crap. No anti-spam server technologies are going to prevent Windows machines from getting infested.

  • by WillAffleckUW ( 858324 ) on Monday August 15, 2005 @06:13PM (#13325651) Homepage Journal
    Bad analogy. Spam is not an organism or infection. It is a business model. It does not "survive" in computers, but in a combination of economical, technical and legal conditions.

    True and False.

    Spam acts like a parasitic organism, due to the favorable conditions for the business model. It does, in some cases, actually "survive" in certain computers, which are spam zombies that spew out spam from a spam source - in fact, there are a few at the other UW (in Wisconsin) which utilize the identified computers there to get thru the filters here (in Seattle).

    Informing consumers is highly unlikely to stop this behaviour - or else AIDS/HIV would have been halted. Some consumers are highly resistant to changing their behaviour, don't think it's important, or it's such a good deal what would it hurt.

    And, like the malarial mosquito, spam uses those responders (infected persons) to download more spam zombie software, since they tend not to be technical enough to remove the infection.
  • by billstewart ( 78916 ) on Monday August 15, 2005 @06:15PM (#13325670) Journal
    Sure, some details will change, and spammers and anti-spammers will pick up new tricks and abandon old ones, and the percentages of email that are spam will keep changing (normally up, but I saw one recent article saying it had dropped significantly in the last year.) But most of the fundamentals don't change much, or at least not very fast. Filtering techniques, Bayesian analysis, collaborative filtering, etc. are a solid core of knowledge that will continue to be useful.

    Rule 1 (Spammers always lie) won't change, though occasionally they'll think of new things to lie about. Rule 2 (Spammers are Stupid) won't change, though of course some spammers violate this rule, and some spammers can hire smart people to work for them, and enough of them are sufficiently persistent skr1pt k1dd13z that it sometimes makes up for stupidity.

    The latest and greatest spam-blocking technique will last a while before spammers find a way around it - it's somewhat of a losing game, because if it works well enough to be widely popular, it becomes a target for spammers to work around, though if it's effective and obscure, it'll work for you and your friends for a lot longer.

    PC users will continue to run insecure operating systems without administering them well, so there'll always be zombies for spammers to abuse. Windows automatic updates will gradually help this, but not only will new OS bugs get discovered frequently, but users will insist on running trojan horses that pretend to be new amusing programs, breaking any semblance of security.

  • by MightyMartian ( 840721 ) on Monday August 15, 2005 @06:17PM (#13325684) Journal
    I'm well aware of the zombie problem (having been the recipient of very nasty distributed dictionary attacks). The way that mail ought to work is that any system without an MX record ought not to be permitted to send email to an MTA. Unfortunately for a variety of reasons (from legitimate to pure incompetence or laziness) many mail servers do not have MX or reverse records, and because sufficient amounts of legitimate email come from such servers, and because there is no line drawn between MTA and MUA (all go through port 25TCP), zombies can quite happily spread havoc.

    The first step to a new mail system is to assure that only legitimate and properly configured mail servers honoring MX records on outgoing mail (or whatever ends up replacing MX records) can expect delivery. Mail admins' hands are tied by stealth systems or badly configured ones, but if we do try to implement the no-MX rule, which would eliminate the zombie attacks, we end up shutting out systems that, for whatever reason, don't publish an MX record for outgoing servers.

    Zombies ought to be the easiest thing to shut down by a) not permitting non-MTA machines to push anything beyond the network via port 25 and b) publishing both incoming and outgoing mail servers.

  • by -brazil- ( 111867 ) on Monday August 15, 2005 @06:27PM (#13325773) Homepage
    It does, in some cases, actually "survive" in certain computers, which are spam zombies that spew out spam from a spam source

    That's not survival in the "organism" analogy, since a zombie will not send spam without a source, which will be gone when the business model is not workable, and especially not cause new source to appear.

    like the malarial mosquito, spam uses those responders (infected persons) to download more spam zombie software, since they tend not to be technical enough to remove the infection.

    You're mixing up the spreading of "zombie" software that is used to send spam with the spreading of spam itself.

    I totally agree that computer worms/viruses work very much like an infectious disease. But they are merely one tool that spammers use, not identical with the phenomenon of spam as such.
  • by WillAffleckUW ( 858324 ) on Monday August 15, 2005 @06:40PM (#13325871) Homepage Journal
    I totally agree that computer worms/viruses work very much like an infectious disease. But they are merely one tool that spammers use, not identical with the phenomenon of spam as such.

    Just as a mosquito is merely a tool the malarial parasite uses to spread itself.

    Let's say we knock out something that permits mosquitos to infect human hosts. Chances are that it might only partially impact malarial infections of non-human hosts. The impacted malarial bug, provided it survives and breeds, may then decide to use another vector to complete the infection.

    Same with spam - we can knock out the zombies. We can knock out the spam kingpins. We can make the email transmission more secure - it migrates to cell phones or text messages or video messages. Unless we go for species extinction, it is likely that it won't die, but will instead change.

    Nowadays I rarely see pop-under ads any more - due to using different browsers - but now ads show up that are movies, which really burn up my bandwidth. To kill off those ads, I would have to disable the very useful site portions that i do want.

    So long as the evolutionary niche exists that permits spamsters to make a buck or two from sending spam, so long as people don't turn in most spam, so long as some people buy from spamsters, and so long as most spamsters don't serve long jail sentences and are never caught, it is highly unlikely that spam will cease to exist.
  • by jonbryce ( 703250 ) on Monday August 15, 2005 @06:44PM (#13325896) Homepage
    Spam may not be an organism or an infection, but the people who send it are. So I think it is a perfect analogy.
  • by Anonymous Coward on Monday August 15, 2005 @07:28PM (#13326228)
    While at defcon I found this book called "Spam Cartel" which is very very interesting and revealing.

    I also know an acquaintence who developed a very unique and effective program to "finger" every Spam bot infected PC and with a "secret" program under trial, it shut down more than 550,000 spam sending infected PC's.

    reports from the SPAM CHAT Channels indicate it was very effective in nailing down and eliminating Spam bots.

    The experiment was ongoing for about 4 months last year, and WOW! I had no idea there were that many spam bots...

    Word I've gotten is that a few "Checks and Balances" need to be deployed to prevent abuse... but I can imagine what would happen of more mail servers would deploy such a system.

    J

8 Catfish = 1 Octo-puss

Working...