Ending Spam

Ending Spam 184

Posted by timothy on Monday August 15, 2005 @05:25PM from the overdue dept.

Shalendra Chhabra writes "Jonathan Zdziarski has been fighting spam since before the first MIT spam conference in 2003, and has now released a full-on technical book, Ending Spam, on spam filtering. Ending Spam covers how the current and near-future crop of heuristic and statistical filters actually work under the hood, and how you can most effectively use such filters to protect your inbox." Read on for the rest of Chhabra's review.

Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification
author	Jonathan A. Zdziarski
pages	312
publisher	No Starch Press
rating	8
reviewer	Shalendra Chhabra
ISBN	1593270526
summary	Very Good Book Covering Statistical Models and Techniques Implemented in Current Spam Filters

Spam (unsolicited commercial email) and phishing (fraudulent emails) are causing losses of billions of dollars to businesses. Many initiatives are currently underway for fighting this challenge. On the legal front, a Virginia court recently sentenced a prolific spammer, Jeremy Jaynes, to nine years in prison, and a Nigerian court sentenced a woman to two and a half years for phishing. Michigan and Utah have both passed laws creating "do-not-contact" registries in July/August 2005, covering e-mail addresses, instant messaging addresses and telephone numbers. Technical initiatives to fight spam include server- or client-side spam filtering, using Lists (Blacklists, Whitelists, Greylists), Email Authentication Standards (IIM, DK, DKIM, SPF, SenderID), and emerging sender reputation and accreditation services.

Ending Spam is the first book explaining the fine details of the theoretical models and machine-learning algorithms implemented in these filters. The book is divided into three parts: introduction to spam filtering, fundamentals of statistical filtering, and advanced concepts of statistical filtering.

The first section of the book discusses the history of spam, spam kings, different approaches for fighting spam such as blacklisting, whitelisting, heuristic filtering, challenge response, throttling, collaborative filtering, Authenticated SMTP, Sender Policy Framework and SenderID, spammer fingerprinting, etc. However, the author omitted any mention of locally-sensitive hash functions (such as Nilsimsa Hash) to counter spammers' random insertion of words, the use of CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart), Greylisting, Identified Internet Mail, and Domain Keys (now Domain Keys Identified Mail).

In the next chapter, the author clearly explains various components of a Language Classifier Pipeline, including the Historical Dataset (aka wordlist, database, dictionary, filter memory), Tokenizer, and the Analysis Engine with its feedback loop. However, the process flow of a language classifier could have been more generalized, e.g. incorporating an initial text-to-text transformer. This chapter also covers the advantages and disadvantages of various training modes for filters, such as Train Everything (TEFT), Train-on-Error (TOE), and Train Until No Errors (TUNE). This part concludes with the description of Paul Graham's famous spam-filtering technique using Bayesian classification (as described in "A Plan for Spam"), Gary Robinson's Geometric Mean Test, Fisher-Robinsons Inverse Chi Square (including the source code for the inversion function), and some other tricks for optimizing spam- filtering accuracy.

The second part of this book deals with the fundamentals of statistical filtering. The author explains HTML and Base64 encoding, followed by a detailed description of tokenization techniques (e.g. Sparse Binary Polynomial Hashing). Then there's a discussion of the various tricks that spammers use for penetrating filters. Although these tactics are mentioned in John Graham-Cumming's "Spammers Compendium," Jonathan has very elegantly explained why some tricks work for spammers and some don't. This part concludes by addressing some of the resource, storage and scaling concerns raised by the large number of features generated from tokenization techniques.

The third part of this book deals with advanced concepts of statistical filtering. This includes the testing criteria for measuring accuracy of an email filter, and some advanced tokenization concepts, e.g. chained tokens (taking word-pairs and phrases into account, instead of individual words) generated using a sliding 5-byte window as mentioned in Sparse Binary Polynomial Hashing. The next chapter describes the Markovian Model implemented in the CRM114 Discriminator, but the author fails to describe different weighting schemes for features implemented in the Markovian-based version of CRM114. The author then describes the Bayesian Noise Reduction Technique for purging "out of context" data from the mail text. This chapter concludes with a very nice summary of collaborative algorithms and techniques, such as Message Innoculation, Streamlined Blackhole List, Fingerprinting, Automatic Whitelisting, URL Blacklisting, and Honeypot email addresses for snaring spammers' address harvesting bots.

The most interesting part of this book is the appendix, where the author presents interviews with John Graham-Cumming of POPFile, Brian Burton of SpamProbe, Marty Lamb of TarProxy, Bill Yerazunis of CRM114 Discriminator, and Jonathan Zdziarski of DSPAM (himself). I loved this section.

The salient points of the book: it's very easy to read; each chapter begins with a very thought-provoking introduction, and concludes with a crisp "final thoughts" section. The number of technical errors are very few in this print, and the illustrations are of good quality. Since the book is geared more toward the Bayesian and statistical generation of spam filters, the absence of certain spam-busting technologies is acceptable. However, a noticeable omission is the lack of discussion about measuring spam-filter accuracy, and what impact this has on setting filtration thresholds. A section on the economics of tradeoffs, and the use of a Receiver Operating Characteristic curve (ROC) would have been very helpful.

Overall, by putting together Ending Spam, Jonathan Zdziarski has made another significant contribution (after DSPAM) to the anti-spam community. Whether you are a system administrator, anti-spam researcher, engineer or a newbie interested in fighting spam, this book is a great reference.

William S Yerazunis and Richard Jowsey also contributed to this review. Shalendra Chhabra is a Graduate Student in Department of Computer Science and Engineering at University of California, Riverside. He is on the development team of CRM114 Discriminator and has presented his work at MIT Spam Conference 2005, Cisco Systems, and Stanford University. You can purchase Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

Ending Spam

This discussion has been archived. No new comments can be posted.

Search 184 Comments Log In/Create an Account

Comments Filter:

Re:Sorry for the flamebait but (Score:5, Informative)

by Stanistani ( 808333 ) writes: on Monday August 15, 2005 @05:37PM (#13325354) Homepage Journal

From:
HERE [castlecops.com]

"ABOUT THE AUTHOR:
Jonathan A. Zdziarski has been fighting spam for eight years, and has spent a significant portion of the past two years working on the next generation spam filter DSPAM. His research in algorithmic theory and neural networking has led to the development of many new approaches in language classification, and he has played a key role in designing some popular algorithms in use today, including Message Inoculation, Bayesian Noise Reduction, and the first functional Neural Networking algorithm for spam filters. Zdziarski lectures widely on the topic of spam and was a speaker at the 2004 and 2005 MIT Spam Conference.
"

Re:Email is mostly broken (Score:4, Informative)

by MrAnnoyanceToYou ( 654053 ) writes: <dylan AT dylanbrams DOT com> on Monday August 15, 2005 @06:15PM (#13325667) Homepage Journal

You asked for it, Here It Is. You have officially scored the lowest I have ever personally seen, and I had to actually ADD negative things to the checklist just for you.

Yes, it's a possibility. Unfortunately, in this case the 'dolts who invariably reply with the survey' are actually right. The survey is funny, but it serves a very important purpose in this case - it shows that completely re-engineering the entire e-mail system means that the problems we have are masked temporarily and then reemerge. Identity, no identity, in the end the 'stopgaps' are actually better than the 'build it from the ground up' solution.

You Personally advocate a

(x) technical (x) legislative (x) market-based ( ) vigilante

approach to fighting spam. Your idea will not work. Here is why it won't work. (One or more of the following may apply to your particular idea, and it may have other flaws which used to vary from state to state before a bad federal law was passed.)

(x) Spammers can easily use it to harvest email addresses
(x) Mailing lists and other legitimate email uses would be affected
(x) No one will be able to find the guy or collect the money
( ) It is defenseless against brute force attacks
(x) It will stop spam for two weeks and then we'll be stuck with it
(x) Users of email will not put up with it
( ) Microsoft will not put up with it
( ) The police will not put up with it
(x) Requires too much cooperation from spammers
(x) Requires immediate total cooperation from everybody at once
(x) Many email users cannot afford to lose business or alienate potential employers
(x) Spammers don't care about invalid addresses in their lists
(x) Anyone could anonymously destroy anyone else's career or business

Specifically, your plan fails to account for

( ) Laws expressly prohibiting it
(N/A) Lack of centrally controlling authority for email
(x) Open relays in foreign countries
( ) Ease of searching tiny alphanumeric address space of all email addresses
(x) Asshats
(x) Jurisdictional problems
(x) Unpopularity of weird new taxes
( ) Public reluctance to accept weird new forms of money
(x) Huge existing software investment in SMTP
(x) Susceptibility of protocols other than SMTP to attack
(x) Willingness of users to install OS patches received by email
(x) Armies of worm riddled broadband-connected Windows boxes
(x) Eternal arms race involved in all filtering approaches
(x) Extreme profitability of spam
( ) Joe jobs and/or identity theft
(x) Technically illiterate politicians
(x) Extreme stupidity on the part of people who do business with spammers
(x) Extreme stupidity on the part of people who do business with Microsoft
(x) Extreme stupidity on the part of people who do business with Yahoo
(x) Dishonesty on the part of spammers themselves
( ) Bandwidth costs that are unaffected by client filtering
( ) Outlook

and the following philosophical objections may also apply:

(x) Ideas similar to yours are easy to come up with, yet none have ever been shown practical
(x) Any scheme based on opt-out is unacceptable
(x) SMTP headers should not be the subject of legislation
(x) Blacklists suck
( ) Whitelists suck
( ) We should be able to talk about Viagra without being censored
(x) Countermeasures should not involve wire fraud or credit card fraud
(x) Countermeasures should not involve sabotage of public networks
( ) Countermeasures must work if phased in gradually
(x) Sending email should be free
(x) Why should we have to trust you and your servers?
(x) Incompatiblity with open source or open source licenses
( ) Feel-good measures do nothing to solve the problem
( ) Temporary/one-time email addresses are cumbersome
(x) I don't want the government reading my email
( ) Killing them that way is not slow and painful enough

Furthermore, this is what I think about you:

( ) Sorry dude, but I don't think it would work.
(x) This is a stupid idea, and you're a fascist for suggesting it.
( ) Nice try, assh0le! I'm going to find out where you live and burn your house down!

Read the rest of this comment...

Re:Gotta use it right (Score:3, Informative)

by Antique Geekmeister ( 740220 ) writes: on Monday August 15, 2005 @08:32PM (#13326635)

No, SenderID tags have to be purchased from Microsoft, and can only be parsed by mail software from Microsoft due to the encumbering XML patents it uses. Take a look at the patent issues surrounding the RFC's for SPF, which Microsoft tried to "embrace and extend" into patented and proprietary uselessness. The current result is that the SenderID keys are not purchased by spammers: they're usually stolen by using the SenderID key's machine as a spam zombie, and it serve the admins of Microsoft mail servers right for believing in such a stupid approach.

Re:Fundamentals Don't Change Much/Fast (Score:2, Informative)

by PeeCee ( 678651 ) writes: on Monday August 15, 2005 @08:32PM (#13326636)

The other email is your personal email, never put that email anywhere on the net.
Right... except you don't need to. If you ever actually use your account to, well, email people, it means that address is out there somewhere. And it will get out as soon as your aunt sends you your next "FREE" birthday e-card, or some virus/worm takes over her computer and harvests her address book.
Note that this is not wild speculation, I have followed this same technique, and while it is undoubtedly one of the most effective ones available, I still have gotten a bunch of spam on addresses which were nowhere near "public". As a matter of fact, some messages I have sent only to close friends have ended up on random places around the web, with my address on them, because it got forwarded many many times by people who won't even bother to remove the headers.
And that's not to mention other possibilities, like your ISP's customer list getting stolen, their boxes getting hacked into, or simple dictionary attacks which can get you without you realizing or even moving a finger.
- PeeCee

Greylisting solves 95% for me (Score:2, Informative)

by bad_outlook ( 868902 ) writes: on Monday August 15, 2005 @09:32PM (#13326931) Homepage

Greylisting solves 95% for me - seriously. Try Postgrey for an easy, built-in solution to use with Postfix - it works like crazy.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Ending Spam 184

Ending Spam More Login

Re:Sorry for the flamebait but (Score:5, Informative)

Re:Email is mostly broken (Score:4, Informative)

Re:Gotta use it right (Score:3, Informative)

Re:Fundamentals Don't Change Much/Fast (Score:2, Informative)

Greylisting solves 95% for me (Score:2, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot