Forgot your password?
typodupeerror
Slashdot.org News

Handling the Loads 890

Posted by CmdrTaco
from the when-it-all-hits-the-fan dept.
On Tuesday, something terrible happened. The effects rippled through the world. And Slashdot was hit with more traffic than ever before as people grabbed at any open line of communication. When many news sites collapsed under the load, we managed to keep stumbling along. Countless people have asked me questions about how Slashdot handled the gigantic load spike. I'm going to try to answer a few of these questions now. Keep reading if you're interested.

I woke up and it seemed like a normal day. Around 8:30 I got to the office and made a pot of coffee. I hopped on IRC, started rummaging through the submissions bin, and of course, began reading my mail. Within minutes someone told me on IRC what had happened just moments after the impact of the first plane. Just a minute or 2 later, submissions started streaming into the bin. And at 9:12 a.m. Eastern Time, I made the decision to cancel Slashdot's normal daily coverage of "News for Nerds, Stuff that Matters," and instead focus on something more important then anything we had ever covered.

I couldn't get to CNN, and MSBNC loaded only enough to show me my first picture of the tragedy. I posted whatever facts we had: these were coming from random links over the net, and from Howard Stern who syndicates live from NY, even to my town. Over the next hour I updated the story as events happened. I updated when the towers collapsed. And the number of comments exploded as readers expressed their outrage, sadness, and confusion following the tragedy.

Not surprisingly, the load on Slashdot began to swell dramatically. Normally at 9:30 a.m., Slashdot is serving 18-20 pages a second. By 10 we were up to 30 and spiking to 40. This is when we started having problems.

At this point Jamie and Pudge were online and we started trying to sort out what we could do. The database crashed and Jamie went into action bringing it back up. I called Krow: he's on Western time, but he knows the DB best, and I had to wake him up. But worst of all, I had to tell him what had happened in New York. It was one of the strangest things I've ever done: it still hadn't settled in. I had seen a few grainy photos but I don't have a TV in my office and hadn't yet seen any of the footage. After I hung up the phone I almost broke down. It was the first time, but not the last.

The DB problem was a known bug and the decision was made to switch to the backup box. This machine was a replicated mirror of Slashdot, but running a newer version of MySQL. We hadn't switched the live box simply because it meant taking the site down for a few minutes. Well we were down anyway, and the box was a complete replica of the live DB, so we quickly moved.

At this point the DB stopped being a bottleneck, and we started to notice new rate limits on the performance of the 6 web servers themselves. Recently we fixed a glitch with Apache::SizeLimit: Functionally, it kills httpd processes that use more then a certain amount of memory, but the size limit was to low and processes were dying after serving just a few requests. This was complicated by the fact that the first story quickly swelled to more than a thousand comments ... we've tuned our caching to Slashdot's normal traffic: 5000-6000 comments a day, with stories having 200-500 comments. And this was definitely not the normal story. Our cache simply wasn't ready to handle this.

Our httpd processes cache a lot of data: this reduces hits to the database and just generally makes everything better. We turned down the number of httpd processes (From 60 on each machine, to 40) and increased the RAM that each process could use up (From 30 to 40 and later 45 megs) We also turned off reverse hostname lookups which we use for geotargetting ads: The time required to do the rdns is fine under normal load, but under huge loads we need that extra second to keep up with the primary job: spitting out pages as fast as possible.

This was around noon or so. I was keeping a close eye on the DB and we noticed a few queries that were taking a little too long. Jamie went in and switched our search from our own internal search, to hitting Google: Search is a somewhat expensive call on our end right now, and this was necessary just to make sure that we could keep up. We were serving 40-50 pages/second ... twice our usual peak loads of around "Just" 25 pages a second. I drove the 10 minutes to get home so I could watch CNN and keep up better with what was happening.

We trimmed a few minor functions out temporarily just to reduce the number of updates going to frequently read tables. But it was just not enough: The database was now beginning to be overworked and page views were slowing down. The homepage was full of discussions that were 3-4x the average size. The solution was to drop a few boxes from generating dynamic pages to serving static ones.

Let me explain: most people (around 60-70%) view the same content. They read the homepage and the 15 or so stories on the homepage. And they never mess with thresholds and filters and logins. In fact, when we have technical problems, we serve static pages. They don't require any database load, and the apache processes use very little memory. So for the next few hours, we ran with 4 of our boxes serving dynamic pages, and 2 serving static. This meant that 60-70% of people would never notice, and the others would only be affected when they tried to save something ... and then they would only notice if they hit a static box, which would happen only one in 3 times. It's not the ideal solution, but at this point we were serving 60-70 pages a second: 3x our usual traffic, and twice what we designed the system for. We got a lot of good data and found a lot of bottlenecks, so next time something that causes our traffic to triple, we'll be much more prepared.

At the end of the day we had served nearly 3 million pages -- almost twice our previous record of 1.6M, and far more then our daily average of 1.4M. During the peak hours, average page serving time slowed by just 2 seconds per page ... and over 8000 comments were posted in about 12 hours, and 15,000 in 48 hours.

On Wed. we started to put additional web servers into the pool, but that ended up not being necessary. We stayed dynamic and had no real problems on all 6 boxes all day. We peaked at around 35-40 pages/second. We served about 2 million pages. Thursday traffic loads were high, but relatively normal.

Summary So here is what we learned from the experience.

  • We have great readers. I had only one single flame emailed to me in 24 hours, and countless notes of thanks and appreciation. We were all frazzled over here and your words of encouragement meant so much. You'll never know.
  • Slashteam kicks butt. Jamie, Pudge, Krow, Yazz, Cliff, Michael, Jamie, Timothy, CowboyNeal, you guys all rocked. From collecting links to monitoring servers, to fixing bits of code in real time. It was good seeing the team function together so well ... I can't begin to describe the strangess of seeing 2 seperate discussions in our channel: one about keeping servers working, and another about bombs, terrorists, and war. But through it all these guys each did their part.
  • Slash is getting really excellent. With tweaks that we learned from this, I think that our setup will soon be able to handle a quarter million pages an hour. In other words, it should handle 3x Slashdot's usual load, without any additional hardware. And with a more monstrous database, who knows how far it could scale.
  • Watch out for Apache::SizeLimit if you are doing Caching.
  • Writing and reading to the same innodb MySQL tables can be done since it does row-level locking. But as load increases, it can start being less then desirable.
  • A layer of proxy is desirable so we could send static requests to a box tuned for static pages. For a long time now we've known that this was important, but its a tricky task. But it is super necessary for us to increase the size of caches in order to ease DB load and speed up page generation time ... but along with that we need to make sure that pages that don't use those caches don't hog precious apache forks that have them. Currently only images are served seperately, but anonymous homepages, xml, rdf, and many other pages could easily be handled by a stripped down process.

What happened on Tuesday was a terrible tragedy. I'm not a very emotional person but I still keep getting choked up when I see some new heart breaking photo, or a new camera angle, learn some new bit of heart breaking information, or read about something wonderful that somebody has done. This whole thing has shook me like nothing I can remember. But I'm proud of everyone involved with Slashdot for working together to keep a line of communication open for a lot of people during a crisis. I'm not kidding myself by thinking that what we did is as important as participating in the rescue effort, but I think our contribution was still important. And thanks to the countless readers who have written me over the last few days to thank us for providing them with what, for many, was their only source of news during this whole thing. And thanks to the whole team who made it happen. I'm proud of all of you.

This discussion has been archived. No new comments can be posted.

Handling the Loads

Comments Filter:
  • Enemies of the USA (Score:0, Informative)

    by CmdrTaco on (468152) on Friday September 14, 2001 @01:07PM (#2299091) Homepage
    The following list is of slashdot users who are terrorist sympathisers and enemies of freedom and democracy:

    greenrd
    abdousi
    IntlHarvester
    Angry White Guy
    delmoi
    Cederic

  • Great Job, Cmdr Taco (Score:2, Informative)

    by Petrus (17053) on Friday September 14, 2001 @01:10PM (#2299107)
    I was getting all my early informatins and initial links to working news sites from slashdot. Everybody in the office was surprised, where do I get working connection, since they could not get through any major news channels.

    Regards,

    Petrus
  • Some Good News (Score:5, Informative)

    by bahtama (252146) on Friday September 14, 2001 @01:18PM (#2299161) Homepage
    A reminder and fyi, the current totals at Amazon.com [amazon.com] [amazon.com] are:

    Total Collected: $4,528,374.96
    # of Payments: 124408

    I think that is truly amazing and by the time you go there it will be even more. I donated my $100, did you? Even 10 dollars could help buy all these guys [time.com] [time.com] a cup of coffee, what's a couple bucks compared to the cause.

  • by bodin (2097) on Friday September 14, 2001 @01:21PM (#2299189) Homepage

    There's no need to reverselookup just to be able to geotarget ads. Build up a reverse-database, and you are all set.


    See http://www.ipindex.net/ [ipindex.net] for an updated index.


    You just need country or so location anyway, right? I mean there are a lot of .com-domains in europe now, and that's when reverse-lookups does WRONG instead of looking at where the actual nets are allocated.

  • by tcyun (80828) on Friday September 14, 2001 @01:22PM (#2299194) Journal
    One of the things that I noticed were that many of the major sites reduced the content size on their home pages to the smallest size possible. I know that the NYT, MS-NBC and other sites removed most all of their images and went to fairly small home pages with a few lines of text.

    I think it is a credit to Slashcode, the Slash coders and great up-front planning that Slashdot was able to handle the load as well as it did. I know that Slashdot was one of the few sites where I could get a collection of information when many of the other sites were down.

    Kudos to all of you.

  • DNS & mod_gzip (Score:4, Informative)

    by drwho (4190) on Friday September 14, 2001 @01:24PM (#2299206) Homepage Journal
    Everyone knows that you should turn off hostname lookups. I was wondering why slashdot would often be some damned slow first thing in the morning -- well there's why. Because the PTR record had expired overnight. Another way we suffer for advertisers. Oh well.

    static content can be stored and transmitted in gzip format, to be uncompressed by the browser (all modern browsers support this). HTML coompressed very well -- pages here end up averaging 28% of their original size! This not only saves slashdot bandwidth, but saves it for the end user as well. Some people out there are still using crufty old 28.8 modems, and need every bit of help they can get. Anyhow, do a search for apache mod_gzip and you'll find all you need to know.
  • Re:Some Good News (Score:2, Informative)

    by wurp (51446) on Friday September 14, 2001 @01:31PM (#2299267) Homepage
    And this is but a fraction of the money donated to the Red Cross. I donated directly [yahoo.com] ($250 - so there!) as I'm sure many others did. Note that this link is to their Yahoo store; you can get there even when www.redcross.org is overloaded.

    I'll repeat that more plainly...

    EVEN IF THE RED CROSS WEB SITE IS DOWN, YOU CAN DONATE MONEY HERE [yahoo.com] (http://store.yahoo.com/redcross-wtc/).
  • Re:DNS & mod_gzip (Score:2, Informative)

    by syates21 (78378) on Friday September 14, 2001 @01:45PM (#2299355)
    Uh, gzip is all fine and good, except in this case it probably would have made the problem *worse*.

    I don't recall them saying that bandwidth was ever a bottleneck. Causing the slash servers to do even *more* processing (ie compression) doesn't seem like it would have helped much.
  • Re:CNN's problems (Score:3, Informative)

    by crow (16139) on Friday September 14, 2001 @01:59PM (#2299450) Homepage Journal
    Yes, I am sure about this.

    CNN re-akamaized Tuesday; that's why they were up again Tuesday afternoon. I read the internal email sent out to Akamai employees asking certain groups to stay on to help with the process if they could.
  • by Anonymous Coward on Friday September 14, 2001 @02:13PM (#2299550)
    Finally someone with some sense. I do not work for CNN or MSNBC, however, I do work for a large news media company with many sites regularly linked to from slashdot. These sites (each) on a regular day receive about the same traffic as slashdot. What happened on Tuesday was by far the largest immediate spike in internet traffic ever. Things that you never expect to go wrong, like actually filling up all your (significant amount of) bandwidth, are things that are hard to anticipate.

    30/40 pg/s are very good numbers, but you can be sure that CNN and MSNBC were (and still are if the load here is any indication) doing at least 4-5 times that amount of traffic and probaly more. For example, one site we have peaked at ~100,000 page views in 15 minutes with a sustained rate of ~90,000 for 6 hours straight! I can only imagine what a national, well known, news site like CNN was faced with.
  • by Anonymous Coward on Friday September 14, 2001 @02:18PM (#2299582)
    I Work at CNN and while I give Kudos to Slashdot for "staying" up. Their traffic was not even close to ours.

    It's pretty obvious from posts here why that is.

    Statements like:
    I tried CNN, MSNBC then to Slashdot or BBC, and you wonder why those sites could stay up? Imagine every single person with an internet connection at least in the United States hitting CNN to see what is going on.

    I am not going to give numbers, but lets say we server 50 pages / sec on a regular "slow" news day at our lowest peak. Now times that by 1000 and you might get close to how many pages we served per second.
  • Correction (Score:4, Informative)

    by tedd (30053) <slashdot.deathcult@com> on Friday September 14, 2001 @02:21PM (#2299611) Homepage

    "I'm a loving guy. And I am also someone, however, who's got a job to do and I intend to do it. And this is a terrible moment," Bush said.

    http://www.foxnews.com/story/0,2933,34322,00.html [foxnews.com]

    I hope you're wrong about the nuke.
  • Re:A request (Score:3, Informative)

    by AugstWest (79042) on Friday September 14, 2001 @02:38PM (#2299691)
    There's nothing quite as pagan as dressing up in robes, then chanting as a group to turn a wafer into the body of a man who died around 2000 years ago.

    Here, drink some of his blood.

    If God has reason to be angry with this country, it's because we continue to support people like Falwell and Robertson.
  • by NMerriam (15122) <NMerriam@artboy.org> on Friday September 14, 2001 @03:24PM (#2299953) Homepage
    The government of the United States of America has been bullying and harassing nations for a very long time, flaunting themselves as a superpower which is untouchable. They've stuck their noses in other nations' business too many times and someone had decided to cut it off.

    I barely know where to begin when I read crap like this. The simple truth is that people hate us because we're the biggest kid on the block.

    yes, they care about individual policy decisions, but there isn't a nation on earth that doesn't make the exact same decisions every day. WE're criticized for butting our noses into foreign affairs, then criticized for being isolationist if we DON'T get involved in foreign affairs.

    We're criticized for supporting side A against B, but if we switch sides, we're criticized for supporting B against A.

    There is no policy we could possibly have that would make other nations happy with us. If we withdraw, we're "ignoring our responsibilities" but if we get involved we're "flaunting our power".

    Well fuck you all very much, planet earth. We didn't ask to be the only superpower. We're not itching to feed your hungry or shelter your homeless or finance your economic devastations, but we're the ones you call on first when you need those things done.

    You complain because american hegemony is destroying your cultures, then you go out and buy coca-cola and watch Friends on TV. You complain about our imperialism while ignoring the fact that Germany and Japan are our biggest competitors exactly BECAUSE we rebuilt them at OUR EXPENSE after we could have conquered them.

    We're damned if we do and damned if we don't, so don't give me any shit that we had it coming because of our policies. NO FUCKING POLICY WILL MAKE EVERYONE HAPPY!

    We're like the prettiest girl at a party -- all the women want to be her, all the men want to fuck her. There is not a country on earth that wouldn't trade places with us in a second, and on days like today i'd almost be happy to do it.
  • by namespan (225296) <namespan@elitema i l . org> on Friday September 14, 2001 @03:44PM (#2300058) Journal
    I barely know where to begin when I read crap like this. The simple truth is that people hate us because we're the biggest kid on the block.

    yes, they care about individual policy decisions, but there isn't a nation on earth that doesn't make the exact same decisions every day. WE're criticized for butting our noses into foreign affairs, then criticized for being isolationist if we DON'T get involved in foreign affairs.


    There's some degree of truth to your statement -- not all the reasons america is disliked are legitimate, and the prettiest girl at the party metaphor was great. We've done some great things in the world.

    But indications are that we've been responsible for some pretty terrible things at times in our nations history. We've supported and trained regimes who've used terrorism and torture because it suited us. Israel is a prime example. Several central american governements are an example.

    There's a book called "What Uncle Sam Really Wants" by Noam Chomsky with some details. I'm still trying to figure out how much of it to swallow, but he paints a grim picture, and if even a fraction of it is true, we have a lot to own up to here in the US.

    That said, attacking such a large civilian target as the WTC in peacetime is unprecedented and completely wrong. The people who did/would do that need to be fought.
  • Re:A request (Score:2, Informative)

    by ArticulateArne (139558) on Friday September 14, 2001 @04:18PM (#2300215)
    Yikes. I would agree with most of your post, but I have to say that comparing Jerry Falwell with Osama bin Laden is incredibly irresponsible. I too am a white Christian, and while I don't agree with everything Jerry Falwell says and does, he has never done anything even CLOSE to what Osama bin Laden has done, even if bin Laden wasn't responsible for the destruction of the WTC. To my knowledge, Falwell isn't even responsible for one person having died, ever. Correct me if I'm wrong, but I've never heard anything to that effect. (And no, calling homosexuality a sin != creating a climate of hate != responsibility for someone getting killed). Falwell participates within the realm of ideas, and while you may vehemently disagree with his ideas, he stays within that realm. Osama bin Laden moves out of the realm of the ideas and murders those with whom he disagrees. There's absolutely no comparison.

    Well, there goes all my karma.

  • by dszd0g (127522) on Friday September 14, 2001 @05:22PM (#2300396) Homepage
    I find it humorous that some people actually believe our foreign policy is based on helping others.

    I am not saying that the US does not help other countries. We quite often come to the support of our allies and troubled nations.

    However, one can not ignore the horrors we commit. The 150,000 deaths we are responsible for in Guatemala. Supporting De Beers and the diamond cartel and the slavery and thousands of deaths involved there. The whole cold war and how the USSR and the US used entire other nations as pawns in the war against each other. Our intelligence leaking information to US companies and helping to eliminate foreign competition.

    There are a number of books on these subjects. A couple good reads are "The Good Citizen" and "Corporate Media and the Threat to Democracy."

    A lot of Americans seem to believe that others get pissed off at us just because we are doing well. What people are pissed off at us for are the horrid things we do that aren't covered by American media.
  • Apache::SizeLimit (Score:2, Informative)

    by consumer (9588) on Friday September 14, 2001 @05:58PM (#2300579)
    I'm the maintainer of Apache::SizeLimit. I suggest you use the MAX_UNSHARED_SIZE setting. It's the most effective for heavilly loaded sites. If you have suggestions or questions about usage, send them to the mod_perl mailing list. I monitor it and will see them and respond.
  • by bodin (2097) on Saturday September 15, 2001 @02:12AM (#2301991) Homepage
    But hey. They ARE ALREADY hitting another database today as they do the DNS-reverselookup. Making that lookup more simple by reducing and grouping nets into a small .cdb or other LOCAL FIXED DATABASE, will speed up that process indeed. It's just a matter of interpreting the ipindex in an intelligent way in respect to what the result is used for. Not name-mapping but actual geotargeting.

"Don't discount flying pigs before you have good air defense." -- jvh@clinet.FI

Working...