Handling the Loads 890
I woke up and it seemed like a normal day. Around 8:30 I got to the office and made a pot of coffee. I hopped on IRC, started rummaging through the submissions bin, and of course, began reading my mail. Within minutes someone told me on IRC what had happened just moments after the impact of the first plane. Just a minute or 2 later, submissions started streaming into the bin. And at 9:12 a.m. Eastern Time, I made the decision to cancel Slashdot's normal daily coverage of "News for Nerds, Stuff that Matters," and instead focus on something more important then anything we had ever covered.
I couldn't get to CNN, and MSBNC loaded only enough to show me my first picture of the tragedy. I posted whatever facts we had: these were coming from random links over the net, and from Howard Stern who syndicates live from NY, even to my town. Over the next hour I updated the story as events happened. I updated when the towers collapsed. And the number of comments exploded as readers expressed their outrage, sadness, and confusion following the tragedy.
Not surprisingly, the load on Slashdot began to swell dramatically. Normally at 9:30 a.m., Slashdot is serving 18-20 pages a second. By 10 we were up to 30 and spiking to 40. This is when we started having problems.
At this point Jamie and Pudge were online and we started trying to sort out what we could do. The database crashed and Jamie went into action bringing it back up. I called Krow: he's on Western time, but he knows the DB best, and I had to wake him up. But worst of all, I had to tell him what had happened in New York. It was one of the strangest things I've ever done: it still hadn't settled in. I had seen a few grainy photos but I don't have a TV in my office and hadn't yet seen any of the footage. After I hung up the phone I almost broke down. It was the first time, but not the last.
The DB problem was a known bug and the decision was made to switch to the backup box. This machine was a replicated mirror of Slashdot, but running a newer version of MySQL. We hadn't switched the live box simply because it meant taking the site down for a few minutes. Well we were down anyway, and the box was a complete replica of the live DB, so we quickly moved.
At this point the DB stopped being a bottleneck, and we started to notice new rate limits on the performance of the 6 web servers themselves. Recently we fixed a glitch with Apache::SizeLimit: Functionally, it kills httpd processes that use more then a certain amount of memory, but the size limit was to low and processes were dying after serving just a few requests. This was complicated by the fact that the first story quickly swelled to more than a thousand comments ... we've tuned our caching to Slashdot's normal traffic: 5000-6000 comments a day, with stories having 200-500 comments. And this was definitely not the normal story. Our cache simply wasn't ready to handle this.
Our httpd processes cache a lot of data: this reduces hits to the database and just generally makes everything better. We turned down the number of httpd processes (From 60 on each machine, to 40) and increased the RAM that each process could use up (From 30 to 40 and later 45 megs) We also turned off reverse hostname lookups which we use for geotargetting ads: The time required to do the rdns is fine under normal load, but under huge loads we need that extra second to keep up with the primary job: spitting out pages as fast as possible.
This was around noon or so. I was keeping a close eye on the DB and we noticed a few queries that were taking a little too long. Jamie went in and switched our search from our own internal search, to hitting Google: Search is a somewhat expensive call on our end right now, and this was necessary just to make sure that we could keep up. We were serving 40-50 pages/second ... twice our usual peak loads of around "Just" 25 pages a second. I drove the 10 minutes to get home so I could watch CNN and keep up better with what was happening.
We trimmed a few minor functions out temporarily just to reduce the number of updates going to frequently read tables. But it was just not enough: The database was now beginning to be overworked and page views were slowing down. The homepage was full of discussions that were 3-4x the average size. The solution was to drop a few boxes from generating dynamic pages to serving static ones.
Let me explain: most people (around 60-70%) view the same content. They read the homepage and the 15 or so stories on the homepage. And they never mess with thresholds and filters and logins. In fact, when we have technical problems, we serve static pages. They don't require any database load, and the apache processes use very little memory. So for the next few hours, we ran with 4 of our boxes serving dynamic pages, and 2 serving static. This meant that 60-70% of people would never notice, and the others would only be affected when they tried to save something ... and then they would only notice if they hit a static box, which would happen only one in 3 times. It's not the ideal solution, but at this point we were serving 60-70 pages a second: 3x our usual traffic, and twice what we designed the system for. We got a lot of good data and found a lot of bottlenecks, so next time something that causes our traffic to triple, we'll be much more prepared.
At the end of the day we had served nearly 3 million pages -- almost twice our previous record of 1.6M, and far more then our daily average of 1.4M. During the peak hours, average page serving time slowed by just 2 seconds per page ... and over 8000 comments were posted in about 12 hours, and 15,000 in 48 hours.
On Wed. we started to put additional web servers into the pool, but that ended up not being necessary. We stayed dynamic and had no real problems on all 6 boxes all day. We peaked at around 35-40 pages/second. We served about 2 million pages. Thursday traffic loads were high, but relatively normal.
Summary So here is what we learned from the experience.
- We have great readers. I had only one single flame emailed to me in 24 hours, and countless notes of thanks and appreciation. We were all frazzled over here and your words of encouragement meant so much. You'll never know.
- Slashteam kicks butt. Jamie, Pudge, Krow, Yazz, Cliff, Michael, Jamie, Timothy, CowboyNeal, you guys all rocked. From collecting links to monitoring servers, to fixing bits of code in real time. It was good seeing the team function together so well ... I can't begin to describe the strangess of seeing 2 seperate discussions in our channel: one about keeping servers working, and another about bombs, terrorists, and war. But through it all these guys each did their part.
- Slash is getting really excellent. With tweaks that we learned from this, I think that our setup will soon be able to handle a quarter million pages an hour. In other words, it should handle 3x Slashdot's usual load, without any additional hardware. And with a more monstrous database, who knows how far it could scale.
- Watch out for Apache::SizeLimit if you are doing Caching.
- Writing and reading to the same innodb MySQL tables can be done since it does row-level locking. But as load increases, it can start being less then desirable.
- A layer of proxy is desirable so we could send static requests to a box tuned for static pages. For a long time now we've known that this was important, but its a tricky task. But it is super necessary for us to increase the size of caches in order to ease DB load and speed up page generation time ... but along with that we need to make sure that pages that don't use those caches don't hog precious apache forks that have them. Currently only images are served seperately, but anonymous homepages, xml, rdf, and many other pages could easily be handled by a stripped down process.
What happened on Tuesday was a terrible tragedy. I'm not a very emotional person but I still keep getting choked up when I see some new heart breaking photo, or a new camera angle, learn some new bit of heart breaking information, or read about something wonderful that somebody has done. This whole thing has shook me like nothing I can remember. But I'm proud of everyone involved with Slashdot for working together to keep a line of communication open for a lot of people during a crisis. I'm not kidding myself by thinking that what we did is as important as participating in the rescue effort, but I think our contribution was still important. And thanks to the countless readers who have written me over the last few days to thank us for providing them with what, for many, was their only source of news during this whole thing. And thanks to the whole team who made it happen. I'm proud of all of you.
Enemies of the USA (Score:0, Informative)
greenrd
abdousi
IntlHarvester
Angry White Guy
delmoi
Cederic
Great Job, Cmdr Taco (Score:2, Informative)
Regards,
Petrus
Some Good News (Score:5, Informative)
Total Collected: $4,528,374.96
# of Payments: 124408
I think that is truly amazing and by the time you go there it will be even more. I donated my $100, did you? Even 10 dollars could help buy all these guys [time.com] [time.com] a cup of coffee, what's a couple bucks compared to the cause.
No need to reverselookup hostnames for geotarget (Score:5, Informative)
There's no need to reverselookup just to be able to geotarget ads. Build up a reverse-database, and you are all set.
See http://www.ipindex.net/ [ipindex.net] for an updated index.
You just need country or so location anyway, right? I mean there are a lot of
Other sites went to stripped down initial pages (Score:2, Informative)
I think it is a credit to Slashcode, the Slash coders and great up-front planning that Slashdot was able to handle the load as well as it did. I know that Slashdot was one of the few sites where I could get a collection of information when many of the other sites were down.
Kudos to all of you.
DNS & mod_gzip (Score:4, Informative)
static content can be stored and transmitted in gzip format, to be uncompressed by the browser (all modern browsers support this). HTML coompressed very well -- pages here end up averaging 28% of their original size! This not only saves slashdot bandwidth, but saves it for the end user as well. Some people out there are still using crufty old 28.8 modems, and need every bit of help they can get. Anyhow, do a search for apache mod_gzip and you'll find all you need to know.
Re:Some Good News (Score:2, Informative)
I'll repeat that more plainly...
EVEN IF THE RED CROSS WEB SITE IS DOWN, YOU CAN DONATE MONEY HERE [yahoo.com] (http://store.yahoo.com/redcross-wtc/).
Re:DNS & mod_gzip (Score:2, Informative)
I don't recall them saying that bandwidth was ever a bottleneck. Causing the slash servers to do even *more* processing (ie compression) doesn't seem like it would have helped much.
Re:CNN's problems (Score:3, Informative)
CNN re-akamaized Tuesday; that's why they were up again Tuesday afternoon. I read the internal email sent out to Akamai employees asking certain groups to stay on to help with the process if they could.
Re:Good job to /., but forgive CNN and MSNBC (Score:1, Informative)
30/40 pg/s are very good numbers, but you can be sure that CNN and MSNBC were (and still are if the load here is any indication) doing at least 4-5 times that amount of traffic and probaly more. For example, one site we have peaked at ~100,000 page views in 15 minutes with a sustained rate of ~90,000 for 6 hours straight! I can only imagine what a national, well known, news site like CNN was faced with.
Re:Good job to /., but forgive CNN and MSNBC (Score:1, Informative)
It's pretty obvious from posts here why that is.
Statements like:
I tried CNN, MSNBC then to Slashdot or BBC, and you wonder why those sites could stay up? Imagine every single person with an internet connection at least in the United States hitting CNN to see what is going on.
I am not going to give numbers, but lets say we server 50 pages / sec on a regular "slow" news day at our lowest peak. Now times that by 1000 and you might get close to how many pages we served per second.
Correction (Score:4, Informative)
"I'm a loving guy. And I am also someone, however, who's got a job to do and I intend to do it. And this is a terrible moment," Bush said.
http://www.foxnews.com/story/0,2933,34322,00.html [foxnews.com]
I hope you're wrong about the nuke.
Re:A request (Score:3, Informative)
Here, drink some of his blood.
If God has reason to be angry with this country, it's because we continue to support people like Falwell and Robertson.
Re:Time for some highly unpopular opinion... (Score:4, Informative)
I barely know where to begin when I read crap like this. The simple truth is that people hate us because we're the biggest kid on the block.
yes, they care about individual policy decisions, but there isn't a nation on earth that doesn't make the exact same decisions every day. WE're criticized for butting our noses into foreign affairs, then criticized for being isolationist if we DON'T get involved in foreign affairs.
We're criticized for supporting side A against B, but if we switch sides, we're criticized for supporting B against A.
There is no policy we could possibly have that would make other nations happy with us. If we withdraw, we're "ignoring our responsibilities" but if we get involved we're "flaunting our power".
Well fuck you all very much, planet earth. We didn't ask to be the only superpower. We're not itching to feed your hungry or shelter your homeless or finance your economic devastations, but we're the ones you call on first when you need those things done.
You complain because american hegemony is destroying your cultures, then you go out and buy coca-cola and watch Friends on TV. You complain about our imperialism while ignoring the fact that Germany and Japan are our biggest competitors exactly BECAUSE we rebuilt them at OUR EXPENSE after we could have conquered them.
We're damned if we do and damned if we don't, so don't give me any shit that we had it coming because of our policies. NO FUCKING POLICY WILL MAKE EVERYONE HAPPY!
We're like the prettiest girl at a party -- all the women want to be her, all the men want to fuck her. There is not a country on earth that wouldn't trade places with us in a second, and on days like today i'd almost be happy to do it.
Re:Time for some highly unpopular opinion... (Score:4, Informative)
yes, they care about individual policy decisions, but there isn't a nation on earth that doesn't make the exact same decisions every day. WE're criticized for butting our noses into foreign affairs, then criticized for being isolationist if we DON'T get involved in foreign affairs.
There's some degree of truth to your statement -- not all the reasons america is disliked are legitimate, and the prettiest girl at the party metaphor was great. We've done some great things in the world.
But indications are that we've been responsible for some pretty terrible things at times in our nations history. We've supported and trained regimes who've used terrorism and torture because it suited us. Israel is a prime example. Several central american governements are an example.
There's a book called "What Uncle Sam Really Wants" by Noam Chomsky with some details. I'm still trying to figure out how much of it to swallow, but he paints a grim picture, and if even a fraction of it is true, we have a lot to own up to here in the US.
That said, attacking such a large civilian target as the WTC in peacetime is unprecedented and completely wrong. The people who did/would do that need to be fought.
Re:A request (Score:2, Informative)
Well, there goes all my karma.
Re:Time for some highly unpopular opinion... (Score:2, Informative)
I am not saying that the US does not help other countries. We quite often come to the support of our allies and troubled nations.
However, one can not ignore the horrors we commit. The 150,000 deaths we are responsible for in Guatemala. Supporting De Beers and the diamond cartel and the slavery and thousands of deaths involved there. The whole cold war and how the USSR and the US used entire other nations as pawns in the war against each other. Our intelligence leaking information to US companies and helping to eliminate foreign competition.
There are a number of books on these subjects. A couple good reads are "The Good Citizen" and "Corporate Media and the Threat to Democracy."
A lot of Americans seem to believe that others get pissed off at us just because we are doing well. What people are pissed off at us for are the horrid things we do that aren't covered by American media.
Apache::SizeLimit (Score:2, Informative)
Re:No need to reverselookup hostnames for geotarge (Score:2, Informative)