Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Bug The Internet IT News

Amazon Explains Why S3 Went Down 114

Angostura writes "Amazon has provided a decent write-up of the problems that caused its S3 storage service to fail for around 8 hours last Sunday. It providers a timeline of events, the immediate action take to fix it (they pulled the big red switch) and what the company is doing to prevent re-occurrence. In summary: A random bit got flipped in one of the server state messages that the S3 machines continuously pass back and forth. There was no checksum on these messages, and the erroneous information was propagated across the cloud, causing so much inter-server chatter that no customer work got done."
This discussion has been archived. No new comments can be posted.

Amazon Explains Why S3 Went Down

Comments Filter:
  • by thrillseeker ( 518224 ) on Saturday July 26, 2008 @05:34PM (#24351385)
    a single bit?! I think there are some serious design deficiencies ...
  • Re:Lost time? (Score:1, Interesting)

    by Anonymous Coward on Saturday July 26, 2008 @06:17PM (#24351793)

    That's how long it took things to come back online.

    Well that explains everything. I suppose I should expect that since it goes to 11....

    Four hours is a Longgg time... and it was five hours for the US.

    Why did it take four hours? Why not 40 minutes or four days?

    There is a big hole in the story.

  • by Anonymous Coward on Saturday July 26, 2008 @06:25PM (#24351869)

    I remember at least one similar incident:

    Early (or earlish) arpanet, the network controllers did not implement checksum on messages. A bit flip caused the whole shebang to keep on forwarding the same message over and over, to the point that nothing else would flow, not even a reset command. Some one had to be sent to the culprit box and manually reset it.

    I remember this being discussed on some technical circles for a while (I think Software Engineering Notes from the ACM had a go on it).

  • Re:...make lemonade. (Score:3, Interesting)

    by spinkham ( 56603 ) on Saturday July 26, 2008 @06:52PM (#24352129)

    There's probably information that changes as the packets move around, and they probably wanted to avoid the overhead. I'm guessing it was a deliberate design decision, but it turned out to be the wrong one. It's easy to see that after a failure, but it's hard to design large distributed systems and foresee every possible way things can break, and where the computation overhead is worth it. The number of interactions between servers here makes any small design flaw a big thing.

  • by James Youngman ( 3732 ) <jay.gnu@org> on Saturday July 26, 2008 @07:17PM (#24352351) Homepage

    Adler-32 wouldn't be a great choice. It's fast but it's weak for short messages [ietf.org] and I've seen it fooled by multi-bit errors on large messages too.

    See Koopman's paper 32-bit cyclic redundancy codes for Internet applications [ieee.org] for some better ideas.

  • by anti-NAT ( 709310 ) on Saturday July 26, 2008 @07:18PM (#24352367) Homepage

    Vulnerabilities of Network Control Protocols: An Example [ietf.org], published in January 1981.

    What do they say about those who ignore history?

  • Re:...make lemonade. (Score:5, Interesting)

    by Sique ( 173459 ) on Saturday July 26, 2008 @08:05PM (#24352803) Homepage

    I see you completely miss the point of the proof. I know that you can minimize the impact of a bit error by checksums, and that you can improve reliability by adding redundance. But what is the consequence of error detection? Normally the protocol then asks for resending the message. But how do you (as the sender) know that the message finally arrived correctly? You wait for an aknowledgement. But what if the aknowledgement gets lost or scrambled? You add redundancy in the handshaking. But how does redundancy help reliability? You can ask for a resend due to detected errors etc.pp.

    Your protocol never finds an end because it has to secure the correctness of the security of the correctness of the security...

  • by James Youngman ( 3732 ) <jay.gnu@org> on Saturday July 26, 2008 @08:22PM (#24352981) Homepage

    I've seen Adler-32 fail twice, so your assumption of "a few times per millennium" doesn't seem to work for me (I'm under 40). In fact in those cases it was even stacked on top of at least one other checksum mechanism, too.

    One of the problems with it is poor spreading of the input bits into s2. There are other algorithms which don't have that weakness but don't (IIRC) cost any more to compute.

  • by SanityInAnarchy ( 655584 ) <ninja@slaphack.com> on Saturday July 26, 2008 @08:33PM (#24353079) Journal

    I think the real lesson here is simply that input over the network cannot ever be trusted.

    It is their network, and S3 is built on tech which is explicitly designed to not have any kind of security built-in. The security is applied at the API level, but any misbehaving machine within the S3 cluster could cause some serious damage.

    I actually agree with this philosophy, to an extent. After all, this is essentially a large number of computers acting as a hard disk. How would you approach talking to a hard disk in your own machine? Do you assume that everything is corrupt, untrusted, or wrong?

  • Re:...make lemonade. (Score:1, Interesting)

    by Anonymous Coward on Saturday July 26, 2008 @08:39PM (#24353137)

    That there's nothing you can do to get the correct message to the other end if the lower level connection doesn't send anything (e.g. someone unplugged the ethernet cable) or corrupts everything sent, is spectacularly uninteresting.

    What people actually care about it given a connection with some statistical level of unreliability you can build a reliable connection out of it. The less reliable the real connection the more overhead you need. Obviously the "reliable connection" isn't actually perfectly reliable if the underlying connection doesn't meet the reliability level that was assumed.

  • by sebastiengiroux ( 1333579 ) on Saturday July 26, 2008 @09:05PM (#24353345) Homepage
    Anybody else noticed that these major problems always occur when no one is around to fix them?
  • You have no idea... (Score:1, Interesting)

    by Anonymous Coward on Saturday July 26, 2008 @09:37PM (#24353579)

    As someone who worked at Amazon as a software engineer for over three years in various backend areas, I can say that without a doubt, Amazon's code and production quality is so horrible, that it's hard to believe.

    Engineers carry pagers and, in many groups, are constantly paged. The only thing that keeps the systems running is a bunch of junior engineers responding in the middle of the night, fixing databases, bouncing services, etc., etc. Engineers are rarely, if ever, given the chance to actually *fix* things, they're just supposed to band-aid them up.

    And, here's a big secret for you: When I left Amazon a little over a year ago, no development groups internally were even using EC2, S3, SQS, or any of the other web services they sell to you. They make it sound like you're using the same high-end services they use to satisfy tens of millions of customers. They're not.

  • by Ctrl-Z ( 28806 ) <tim&timcoleman,com> on Saturday July 26, 2008 @10:13PM (#24353811) Homepage Journal
    Actually, that should have been Northeast Blackout of 1965 [wikipedia.org]. But you already knew that.
  • ECC memory, anyone? (Score:4, Interesting)

    by Maxmin ( 921568 ) on Saturday July 26, 2008 @10:35PM (#24353981)

    I hafta wonder if the bit flipped due to a bad RAM stick?

    We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers' objects. However, we didn't have the same protection in place to detect whether this particular internal state information had been corrupted.

    Nothing specific about *what* caused the bit to flip.

    This comes to mind only because bad RAM on a new server at work caused installation of a stock Perl module to throw excessive errors during the XS compile phase - the same package installed without error on an identical machine 20 minutes earlier. Took over an hour before we realized it was probably hardware. Memtest86 [memtest86.com] quickly turned up the problem.

    Would hashes and the like protect against RAM suddenly going south? Wouldn't any piece of data that passes through main memory be vulnerable to corruption? Makes me wonder why ECC memory [wikipedia.org] isn't being used much anymore... we have various flavors of RAID to protect slow memory from corruption, but not many machines I see have ECC anymore.

  • Byzantine failure (Score:3, Interesting)

    by ge ( 12698 ) on Sunday July 27, 2008 @10:15AM (#24357655)

    So the whole cloud is in trouble if one node starts spewing nonsense? So much for redundancy. Amazon developers would be well advised to read up on the "Byzantine Generals" problem.

"What man has done, man can aspire to do." -- Jerry Pournelle, about space flight

Working...