Ending Spam 184

Posted by timothy on Monday August 15, 2005 @05:25PM from the overdue dept.

Shalendra Chhabra writes "Jonathan Zdziarski has been fighting spam since before the first MIT spam conference in 2003, and has now released a full-on technical book, Ending Spam, on spam filtering. Ending Spam covers how the current and near-future crop of heuristic and statistical filters actually work under the hood, and how you can most effectively use such filters to protect your inbox." Read on for the rest of Chhabra's review.

Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification
author	Jonathan A. Zdziarski
pages	312
publisher	No Starch Press
rating	8
reviewer	Shalendra Chhabra
ISBN	1593270526
summary	Very Good Book Covering Statistical Models and Techniques Implemented in Current Spam Filters

Spam (unsolicited commercial email) and phishing (fraudulent emails) are causing losses of billions of dollars to businesses. Many initiatives are currently underway for fighting this challenge. On the legal front, a Virginia court recently sentenced a prolific spammer, Jeremy Jaynes, to nine years in prison, and a Nigerian court sentenced a woman to two and a half years for phishing. Michigan and Utah have both passed laws creating "do-not-contact" registries in July/August 2005, covering e-mail addresses, instant messaging addresses and telephone numbers. Technical initiatives to fight spam include server- or client-side spam filtering, using Lists (Blacklists, Whitelists, Greylists), Email Authentication Standards (IIM, DK, DKIM, SPF, SenderID), and emerging sender reputation and accreditation services.

Ending Spam is the first book explaining the fine details of the theoretical models and machine-learning algorithms implemented in these filters. The book is divided into three parts: introduction to spam filtering, fundamentals of statistical filtering, and advanced concepts of statistical filtering.

The first section of the book discusses the history of spam, spam kings, different approaches for fighting spam such as blacklisting, whitelisting, heuristic filtering, challenge response, throttling, collaborative filtering, Authenticated SMTP, Sender Policy Framework and SenderID, spammer fingerprinting, etc. However, the author omitted any mention of locally-sensitive hash functions (such as Nilsimsa Hash) to counter spammers' random insertion of words, the use of CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart), Greylisting, Identified Internet Mail, and Domain Keys (now Domain Keys Identified Mail).

In the next chapter, the author clearly explains various components of a Language Classifier Pipeline, including the Historical Dataset (aka wordlist, database, dictionary, filter memory), Tokenizer, and the Analysis Engine with its feedback loop. However, the process flow of a language classifier could have been more generalized, e.g. incorporating an initial text-to-text transformer. This chapter also covers the advantages and disadvantages of various training modes for filters, such as Train Everything (TEFT), Train-on-Error (TOE), and Train Until No Errors (TUNE). This part concludes with the description of Paul Graham's famous spam-filtering technique using Bayesian classification (as described in "A Plan for Spam"), Gary Robinson's Geometric Mean Test, Fisher-Robinsons Inverse Chi Square (including the source code for the inversion function), and some other tricks for optimizing spam- filtering accuracy.

The second part of this book deals with the fundamentals of statistical filtering. The author explains HTML and Base64 encoding, followed by a detailed description of tokenization techniques (e.g. Sparse Binary Polynomial Hashing). Then there's a discussion of the various tricks that spammers use for penetrating filters. Although these tactics are mentioned in John Graham-Cumming's "Spammers Compendium," Jonathan has very elegantly explained why some tricks work for spammers and some don't. This part concludes by addressing some of the resource, storage and scaling concerns raised by the large number of features generated from tokenization techniques.

The third part of this book deals with advanced concepts of statistical filtering. This includes the testing criteria for measuring accuracy of an email filter, and some advanced tokenization concepts, e.g. chained tokens (taking word-pairs and phrases into account, instead of individual words) generated using a sliding 5-byte window as mentioned in Sparse Binary Polynomial Hashing. The next chapter describes the Markovian Model implemented in the CRM114 Discriminator, but the author fails to describe different weighting schemes for features implemented in the Markovian-based version of CRM114. The author then describes the Bayesian Noise Reduction Technique for purging "out of context" data from the mail text. This chapter concludes with a very nice summary of collaborative algorithms and techniques, such as Message Innoculation, Streamlined Blackhole List, Fingerprinting, Automatic Whitelisting, URL Blacklisting, and Honeypot email addresses for snaring spammers' address harvesting bots.

The most interesting part of this book is the appendix, where the author presents interviews with John Graham-Cumming of POPFile, Brian Burton of SpamProbe, Marty Lamb of TarProxy, Bill Yerazunis of CRM114 Discriminator, and Jonathan Zdziarski of DSPAM (himself). I loved this section.

The salient points of the book: it's very easy to read; each chapter begins with a very thought-provoking introduction, and concludes with a crisp "final thoughts" section. The number of technical errors are very few in this print, and the illustrations are of good quality. Since the book is geared more toward the Bayesian and statistical generation of spam filters, the absence of certain spam-busting technologies is acceptable. However, a noticeable omission is the lack of discussion about measuring spam-filter accuracy, and what impact this has on setting filtration thresholds. A section on the economics of tradeoffs, and the use of a Receiver Operating Characteristic curve (ROC) would have been very helpful.

Overall, by putting together Ending Spam, Jonathan Zdziarski has made another significant contribution (after DSPAM) to the anti-spam community. Whether you are a system administrator, anti-spam researcher, engineer or a newbie interested in fighting spam, this book is a great reference.

William S Yerazunis and Richard Jowsey also contributed to this review. Shalendra Chhabra is a Graduate Student in Department of Computer Science and Engineering at University of California, Riverside. He is on the development team of CRM114 Discriminator and has presented his work at MIT Spam Conference 2005, Cisco Systems, and Stanford University. You can purchase Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

Ending Spam

This discussion has been archived. No new comments can be posted.

Search 184 Comments Log In/Create an Account

Comments Filter:

You can't have both... (Score:3, Insightful)

by TarryTops ( 888130 ) writes: on Monday August 15, 2005 @05:29PM (#13325276) Homepage

The openness eill have to pay it's cost. and spam is one such pest. You can develop better strategies for pest control. But in the end it's a trade off.

Score -5 Outdated. (Score:2, Insightful)

by Anonymous Coward writes: on Monday August 15, 2005 @05:35PM (#13325332)

As with any book of this type, it is outdated by the time it reaches the shelves. The spam battlefield changes on a daily basis and the tools used to fight the battle, change with it daily.

By the time a book has been written edited, proof read(though many publishers skip this part), type set, printed, distributed and sold, it no longer resembles the technology.

You can't catch it all (Score:2, Insightful)

by solodex2151 ( 700977 ) writes: on Monday August 15, 2005 @05:37PM (#13325350)

Spam will continue to disguise itself as legit email. You can try to filter it out and set more strict filters but catching legitimate mail is far more likely to happen. In the end, you have to make a trade off and practically accept some spam.

Ending Spam? (Score:5, Insightful)

by demonbug ( 309515 ) writes: on Monday August 15, 2005 @05:39PM (#13325371) Journal

Does anyone else find it funny that a book called "Ending Spam" talks about spam filtering? Maybe I'll go write a book; "Ending World Hunger: How To Filter Sally Struthers From Your Television".
If you can't see it, it ain't there?

Is spam a parasitic malady and, if so, what next? (Score:4, Insightful)

by WillAffleckUW ( 858324 ) writes: on Monday August 15, 2005 @05:41PM (#13325390) Homepage Journal

I'm wondering... will UCE (Spam) be like malaria... controllable in most areas but impossible to eradicate?

Or will these dedicated folks and others be able to eliminate it, perhaps by changes to the mail protocols?

Interesting question that, considering my work involves malaria.

My guess is that, like malaria and most parasitic infestations, we will at some point develop a "cure". The "cure" will work for a few years, after which the parasite (spam) will have adapted, surviving until then in different hosts (old windows machines donated to Africa, who knows). Then, having developed a new trick, it will come back as strong as ever.

Biology teaches us that organisms adapt to changing environments, thru selective breeding (natural), point mutations, and unforseen combinations (see the H51N avian influenza). We can develop cures, but once we do so, we can be fairly sure that, baring species extinction, it will develop methods to cope with our cures.

An easy solution would be to move to IPv6 - but this, like authentication, will only kill off the spam which doesn't use "trusted email clients that are identified" while the spam that can survive will be encouraged to spread like wildfire.

So long as the fiscal, legal, and societal penalties for spamming are fairly low and the rewards are high, and while most people do nothing about it, it will spread.

Re:Email is mostly broken (Score:5, Insightful)

by MichaelSmith ( 789609 ) writes: on Monday August 15, 2005 @05:47PM (#13325452) Homepage Journal

The answer lies in authentication
And it requires central control. Is this what you want?

Re:Ending Spam? (Score:3, Insightful)

by DogDude ( 805747 ) writes: on Monday August 15, 2005 @05:51PM (#13325484)

Well, I think that most rational people would understand the title to mean "Ending spam as it pertains to ME". In which case, as far as most people are concerned, if they don't see spam, then the spam problem is solved. I really don't think that that is an inordinate amount of literacy license.

And yes, if you don't see it, then unless you're a system administrator (can't be more than 0.001% of the population), the problem IS solved. The problem isn't spam per se, but that spam clogs up MY inbox.

It's just like anything else. Nobody is going to end spam altogether... that's just naive. But if you don't see it any more, then the problem (again, spam filling up MY inbox), then it's fixed. I don't give two shits as to what some upstream sysadmin has to do to stop it. I have my own problems, and that's part fo his job. Just stop spam from getting to ME, and I'm all good.

Effecitve filtering will end spam (Score:5, Insightful)

by Sycraft-fu ( 314770 ) writes: on Monday August 15, 2005 @05:52PM (#13325491)

The reason spammers do it is that their message reaches people, enough of them to make it worthwhile. So, the more effective and widespread the filters, the less messages that reach people, and the less it's worth. If the filters were really effective, nearly 100%, it would simply not be worth it to spam, you wouldn't make any money because no one would see your message.

I don't think we'll ever get there, but yes filtering really could end spam.

Re:Ending Spam? (Score:4, Insightful)

by pomo monster ( 873962 ) writes: on Monday August 15, 2005 @05:56PM (#13325523)

Well, in a way, and I don't mean philosophically. If nobody can see the spam, then it really will dry up--spammers won't even bother.

There's no such thing as a perfect filtering system, but for every message blocked, that's extra effort for the spammer to get through, making it less and less worthwhile to spam at all.

Or maybe they'll just send more and more, hoping at least one gets through.

Re:Jonathan Zdziarski is out of his mind. (Score:5, Insightful)

by david.given ( 6740 ) writes: <dg@cowlark.com> on Monday August 15, 2005 @06:03PM (#13325582) Homepage Journal

Read some of his essays. He genuinely believes that all evidence clearly shows that the earth cannot possibly be more than 10,000 years old.
This may be the case; however, that doesn't invalidate his work on spam. Remember, Sir Isaac Newton was a firm believer in the more exotic aspects of mystical alchemy, and the vast bulk of his 'research' was complete gibberish. That doesn't make his work on gravity any less valuable.

I know it's a cliché movie, but I can't help (Score:3, Insightful)

by Idealius ( 688975 ) writes: on Monday August 15, 2005 @06:03PM (#13325586) Journal

Reminds me of the conversation at the end of Batman Begins with Gordon and the Bat:

Gordon: "Batman making a stand as he has will only escalate the problem."

If suddenly the masses are educated on spam filtering, wouldn't spammers just adobt tactics to avoid them?

I mean it is afterall a "spammers market". They have increased resources because they're getting all the money. I'm sure the spammers are much smarter than most techies who use filters, they just don't care. They think, "If this techie is going to use a filter to stop my spam so be it, there's a 100 people for each one of him that won't."

No we need to think of new techniques outside of filtering. Filtering is mostly nonsense, manual work. We need something philisophically different than filtering which affects how spam comes through in-transit, or something that affects the financial backing of spammers.

We should be breaking down their lines of communications, etc - not expecting granny to take up spam filtering techniques.

This should really be entitled "Hiding Spam" (Score:3, Insightful)

by wernst ( 536414 ) writes: on Monday August 15, 2005 @06:13PM (#13325646) Homepage

Not to quibble, but even the best filters don't "end" spam.
Even a manservant reading all of my mail and hand-carying printouts of nothing but personal messages to my Jamacian bungalow doesn't "end" spam.
It would seem that These Guys [slashdot.org] are actually making an attempt to "end" spam.
All this guy is just talking about is hiding it from view. Big deal...

Gotta use it right (Score:3, Insightful)

by jfengel ( 409917 ) writes: on Monday August 15, 2005 @06:47PM (#13325917) Homepage Journal

If they're adopting SenderID, it makes it easy to filter them. You can't filter just on the existence of SenderID; you need to check who the sender is and ignore email from known spammers.

That's a good thing. It lets them spew all of the email they want; let's call it freedom of speech (since I don't want any legal limitations on spam also being used to prevent legitimate speech). And I get to ignore them; I can filter them at the SMTP layer even before they get to send the whole message.

It may not be successful yet, if people are misusing the technology by trusting the existence of a Sender ID record to mean it's not spam. But don't blame the technology for being misused.

Re:Is spam a parasitic malady and, if so, what nex (Score:2, Insightful)

by -brazil- ( 111867 ) writes: on Monday August 15, 2005 @06:59PM (#13326011) Homepage

No, because the anti-spam measures do not aim to kill those people, only to make them stop sending spam. Furthermore, spammers are not a separate species and do not reproduce (as spammers).

Re:If it's a business model, where's the underwear (Score:2, Insightful)

by -brazil- ( 111867 ) writes: on Monday August 15, 2005 @07:10PM (#13326098) Homepage

Just as a mosquito is merely a tool the malarial parasite uses to spread itself.

Except that spam does not use zombies to spread itself, SPAMMERS use zombies to spread spam.

Your analogy is simply flawed. Spam is NOT an organism. It does NOT "survive" somewhere, adapt and spread from the places where it survived.

And we certainly DO go for "species extinction", by eliminating the conditions that make spam practicable and profitable. You enumerate some of those conditions yourself in the end.

Re:Gotta use it right (Score:3, Insightful)

by jfengel ( 409917 ) writes: on Monday August 15, 2005 @07:21PM (#13326170) Homepage Journal

We'll probably still end up with some IP-based blacklists. You can imagine a spammer who spews out an infinite number of verified IDs. You can't blacklist just the IDs because they're one-shots. Instead, eventually you'll end up saying, "Hey, this server seems perfectly willing to grant IDs to any jackass; let's blacklist the IPs and encourage non-jackasses on that server to get a new one."

Basically, there will have to be layers of responsibility, and we can encourage the various layers to be responsible for the layers below them. Otherwise, a layer which mixes legitimate and asinine uses will risk having its legitimate users tarred with the same brush. The legitimate users will flee, and the spammers will no longer be able to hide among them.

Re:If it's a business model, where's the underwear (Score:2, Insightful)

by WillAffleckUW ( 858324 ) writes: on Monday August 15, 2005 @07:29PM (#13326236) Homepage Journal

Except that spam does not use zombies to spread itself, SPAMMERS use zombies to spread spam.

Your analogy is simply flawed. Spam is NOT an organism. It does NOT "survive" somewhere, adapt and spread from the places where it survived.

And we certainly DO go for "species extinction", by eliminating the conditions that make spam practicable and profitable. You enumerate some of those conditions yourself in the end.

If it looks like a duck, and it quacks like a duck, and it paddles like a duck, you want me to check to see if it's a robotic assembly of nanobots pretending to be a duck.

Nah. My point is/was - not that I brought up the biological equivalency of spam to malaria (someone else did, and i said it isn't, but it could be thought of that way) - that even should we find a "cure" for spam, it would come back so long as the underlying model rewarded the spamsters in some way to continue to perpetuate.

So long as up to half the population won't report spam - in fact, it's more like 99 percent;

So long as enough people buy from spamsters to make it economically rewarding - which it is;

So long as the penalty is remote enough or far enough in the future to be ignored - which it is;

And so long as society encourages the pursuit of wealth above moral/ethical standards - which it does;

This won't change.

Sure, you can plug up a hole in the dike. I can - and do - turn in spamsters. But they will migrate and adapt.

Are they infectious diseases? Sometimes, see the use of zombies.

Can we truly eradicate them - no, because people will replace the prior spamsters so long as the afore-mentioned conditions perpetuate.

Want to cut down malaria? First, find easy methods of improving sanitation that allows it to perpetuate. Then find ways to interfere with the malarial infection of humans. If you do it backwards, it's likely that many places will still spread it. Because not everyone is rich like we are.

Same goes for spam - find ways to make it unrewarding for people to buy from spamsters (e.g. sell Viagra etc cheap, offer open source versions of office cheap - that's what they sell), find ways to make it bad to be a spamster, and then batten down the hatches with new protocols.

Easy Solution to Spam (Score:2, Insightful)

by VonSkippy ( 892467 ) writes: on Monday August 15, 2005 @07:29PM (#13326242) Homepage

Blacklist everyone, then whitelist only those people who you really want to communicate with. I've been doing it for years and get ZERO spam. People argue that they will miss important messages - nope, I never have. Email is not the only form of communication. All my family, friends, business clients know how to use the phone if their emails bounce. I have a web form (and phone number) for new clients (and once verified they are whitelisted), and I don't give a shit about the few messages that might not make it (although after several years of using this method I have no evidence that I've missed even one).

Re:You can't catch it all (Score:3, Insightful)

by farnz ( 625056 ) writes: <slashdot@fa r n z . o r g . uk> on Monday August 15, 2005 @08:30PM (#13326624) Homepage Journal

Trouble is that a zombie has access to the user's legitimate mail system, which they can abuse.
In the end, no technical solution is really going to solve it; you're using "is this machine meant to send mail?" as a heuristic for "is this mail junk mail?". As you can't define junk mail objectively, in computer-friendly criteria, any filter is inevitably going to make mistakes. The only question is whether your filter tends towards false positives or false negatives.

Re:Esprit d'Corps (Score:2, Insightful)

by DavidTC ( 10147 ) writes: <slas45dxsvadiv,vadiv&neverbox,com> on Monday August 15, 2005 @10:58PM (#13327421) Homepage

Don't be silly.
Mobs attacking spammers should only be armed with plastic spoons. All fourteen million of them.
Remember, if you only poke them once, it's not only not murder, it's not even assault, and perfectly legal under the CAN-POKE-SPAMMERS act, as long as they have a 'business relationship' with you, which they obviously created by spamming you.
And, to make it fair, they are allowed to opt out of any member of the mob poking them. One at a time, in writing, and we'll even waive the 48 hours to process it can traditionally take to process. (Of course, that person is free to go out and get some more people to stand in line, or even get back in line under another name.)

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Ending Spam 184

Ending Spam More Login

Ending Spam

You can't have both... (Score:3, Insightful)

Score -5 Outdated. (Score:2, Insightful)

You can't catch it all (Score:2, Insightful)

Ending Spam? (Score:5, Insightful)

Is spam a parasitic malady and, if so, what next? (Score:4, Insightful)

Re:Email is mostly broken (Score:5, Insightful)

Re:Ending Spam? (Score:3, Insightful)

Effecitve filtering will end spam (Score:5, Insightful)

Re:Ending Spam? (Score:4, Insightful)

Re:Jonathan Zdziarski is out of his mind. (Score:5, Insightful)

I know it's a cliché movie, but I can't help (Score:3, Insightful)

This should really be entitled "Hiding Spam" (Score:3, Insightful)

Gotta use it right (Score:3, Insightful)

Re:Is spam a parasitic malady and, if so, what nex (Score:2, Insightful)

Re:If it's a business model, where's the underwear (Score:2, Insightful)

Re:Gotta use it right (Score:3, Insightful)

Re:If it's a business model, where's the underwear (Score:2, Insightful)

Easy Solution to Spam (Score:2, Insightful)

Re:You can't catch it all (Score:3, Insightful)

Re:Esprit d'Corps (Score:2, Insightful)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot