Amazon Explains Why S3 Went Down

Amazon Explains Why S3 Went Down 114

Posted by timothy on Saturday July 26, 2008 @05:33PM from the not-mere-sluttiness dept.

Angostura writes "Amazon has provided a decent write-up of the problems that caused its S3 storage service to fail for around 8 hours last Sunday. It providers a timeline of events, the immediate action take to fix it (they pulled the big red switch) and what the company is doing to prevent re-occurrence. In summary: A random bit got flipped in one of the server state messages that the S3 machines continuously pass back and forth. There was no checksum on these messages, and the erroneous information was propagated across the cloud, causing so much inter-server chatter that no customer work got done."

Amazon Explains Why S3 Went Down

This discussion has been archived. No new comments can be posted.

Search 114 Comments Log In/Create an Account

Comments Filter:

Simple (Score:3, Insightful)

by gardyloo ( 512791 ) writes: on Saturday July 26, 2008 @05:46PM (#24351511)

S3 is a total slut.

Re:They need more Erlang. (Score:5, Insightful)

by nacturation ( 646836 ) * writes: <nacturation AT gmail DOT com> on Saturday July 26, 2008 @05:51PM (#24351555) Journal

They need to start using Erlang more. It's designed specifically for building highly-distributed, concurrent systems that must scale to millions of transactions per minute. So it's a natural fit between Erlang and what Amazon is trying to offer with their S3 service.
I think Erlang's cool and all, but it's not the magical bullet that will solve this. It's still possible to have information corrupted during message passing between processes in Erlang (say, as the result of an intermittently failing network switch) as it is in any language.

Other companies could learn from this... (Score:5, Insightful)

by Manip ( 656104 ) writes: on Saturday July 26, 2008 @06:10PM (#24351753)

Other large businesses could learn a lot from Amazon's example.
How often do you have the problem really explained to you, an apology, and a reasonable set of changes to stop it occurring again?
Most businesses would never explain the root of any problem. They simply list "hardware issues." And they NEVER say sorry anymore - supposedly it opens them up to more liability or something.
If I was an Amazon customer I would be happy with their explanation and apology even if obviously the downtime is still an issue.

Re:...make lemonade. (Score:3, Insightful)

by erc ( 38443 ) writes: <erc AT pobox DOT com> on Saturday July 26, 2008 @06:16PM (#24351785) Homepage

Or they could checksum their UDP packets. The entire packet, not just the customer payload. Duh.

Re:Other companies could learn from this... (Score:2, Insightful)

by FalcDot ( 1224920 ) writes: on Saturday July 26, 2008 @06:33PM (#24351957)

Looking back, I feel that the one thing all our technological progress has given us more than anything else, is more and better means of communication.
You can talk to people on the other side of the world, heck, on the other side of the solar system if you don't mind the delay. Video feeds, planes that'll actually get you there in less than a day, ...
And yet, with all of this, it seems that we're not actually doing it. A company explaining what went wrong is the exception. Internet forums without flamers and trolls? Exception.
"Anything you say can and will be used against you."

Re:They need more Erlang. (Score:5, Insightful)

by edsousa ( 1201831 ) writes: on Saturday July 26, 2008 @07:17PM (#24352353) Journal

This message is written by one that writes real parallel, distributed and concurrent code (they are not all the same):
Erlang or any other functional language will not account for lack of good design. If you have a good design with the right concerns you can implement in Java, C, Fortran, ASM and if done right, it will work.
I'm sick of hear "Erlang is THE solution". It is not. Good design and implementation practices are.

Re:...make lemonade. (Score:3, Insightful)

by TheRaven64 ( 641858 ) writes: on Saturday July 26, 2008 @07:48PM (#24352627) Journal

True, but not particularly informative. The point is to detect errors. Error correction is nice, but error detection is enough if the sender can then retransmit. Oh, and your 'proof' is flawed, since you are completely ignoring the fact that any correction scheme contains redundant information, so the while n bits might work instead of n+1 bits, n-1 might not.
If the last bit is missing, then your receiver knows that there is an error. If the last bit is flipped, then it knows that there is an error. A checksum can be very simple and just give the count of the number of bits that are set in the message. This will protect from single-bit errors, but an error in both the message and the checksum can cause an erroneous packet to pass. It's basically a hash, and the idea is to make sure that hash collisions are as infrequent as possible.
You can usually guarantee a maximum number of errors in the network and it's possible to design correction schemes which will detect any n-bit error.
In some cases, it's possible to accurately detect all errors because all failures will be in one direction (0 to 1 or 1 to 0, but not both). In this case, and XOR'd copy of the message will work, because any bit in the message flipping from 0 to 1 needs a corresponding bit in the check flipping from 1 to 0 (there are shorter encoding schemes that work in this case too, but this is a very simple example).

Re:Other companies could learn from this... (Score:3, Insightful)

by SanityInAnarchy ( 655584 ) writes: <ninja@slaphack.com> on Saturday July 26, 2008 @08:30PM (#24353043) Journal

Well, technically speaking, there isn't an apology there:
Finally, we want you to know that we are passionate about providing the best storage service at the best price so that you can spend more time thinking about your business rather than having to focus on building scalable, reliable infrastructure. Though we're proud of our operational performance in operating Amazon S3 for almost 2.5 years, we know that any downtime is unacceptable and we won't be satisfied until performance is statistically indistinguishable from perfect.
Allow me to translate:
We screwed up. We'll do better next time.
Nowhere in the document do the words "I'm sorry" appear. That's entirely implied.

Re:...make lemonade. (Score:5, Insightful)

by spinkham ( 56603 ) writes: on Saturday July 26, 2008 @10:43PM (#24354047)

No, I mean favoring speed and computational simplicity over error detection.
It is often a valid trade off. For example, most filesystems do not validate the stored data at all for size and computational reasons. As hard drives and arrays get bigger, that trade of no longer makes much sense, and most all new filesystems being designed have hash based error detection built in at some level.
Good design takes experience. There aren't that many systems like S3 that have been built in the past, and there are many tricky decisions to be made. No system gets it all correct out of the gate.

Re:Other companies could learn from this... (Score:5, Insightful)

by Alpha830RulZ ( 939527 ) writes: on Saturday July 26, 2008 @11:41PM (#24354497)

The words may not be there, but there is a pretty clear message there for me to see that they are not happy or smug about this event, and are agreeing with the consumer that this shouldn't have happened, and won't happen again if they can help it. That's enough for me.
And I'm actually one of their consumers, compared to some of the dilletantes here. We use S3 and EC2 to manage training and demo instances of our software, and are pretty pleased so far.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Amazon Explains Why S3 Went Down 114

Amazon Explains Why S3 Went Down More Login

Amazon Explains Why S3 Went Down

Simple (Score:3, Insightful)

Re:They need more Erlang. (Score:5, Insightful)

Other companies could learn from this... (Score:5, Insightful)

Re:...make lemonade. (Score:3, Insightful)

Re:Other companies could learn from this... (Score:2, Insightful)

Re:They need more Erlang. (Score:5, Insightful)

Re:...make lemonade. (Score:3, Insightful)

Re:Other companies could learn from this... (Score:3, Insightful)

Re:...make lemonade. (Score:5, Insightful)

Re:Other companies could learn from this... (Score:5, Insightful)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot