Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Bug The Internet IT News

Amazon Explains Why S3 Went Down 114

Angostura writes "Amazon has provided a decent write-up of the problems that caused its S3 storage service to fail for around 8 hours last Sunday. It providers a timeline of events, the immediate action take to fix it (they pulled the big red switch) and what the company is doing to prevent re-occurrence. In summary: A random bit got flipped in one of the server state messages that the S3 machines continuously pass back and forth. There was no checksum on these messages, and the erroneous information was propagated across the cloud, causing so much inter-server chatter that no customer work got done."
This discussion has been archived. No new comments can be posted.

Amazon Explains Why S3 Went Down

Comments Filter:
  • Simple (Score:3, Insightful)

    by gardyloo ( 512791 ) on Saturday July 26, 2008 @05:46PM (#24351511)

    S3 is a total slut.

  • by nacturation ( 646836 ) * <nacturation AT gmail DOT com> on Saturday July 26, 2008 @05:51PM (#24351555) Journal

    They need to start using Erlang more. It's designed specifically for building highly-distributed, concurrent systems that must scale to millions of transactions per minute. So it's a natural fit between Erlang and what Amazon is trying to offer with their S3 service.

    I think Erlang's cool and all, but it's not the magical bullet that will solve this. It's still possible to have information corrupted during message passing between processes in Erlang (say, as the result of an intermittently failing network switch) as it is in any language.
     

  • by Manip ( 656104 ) on Saturday July 26, 2008 @06:10PM (#24351753)

    Other large businesses could learn a lot from Amazon's example.

    How often do you have the problem really explained to you, an apology, and a reasonable set of changes to stop it occurring again?

    Most businesses would never explain the root of any problem. They simply list "hardware issues." And they NEVER say sorry anymore - supposedly it opens them up to more liability or something.

    If I was an Amazon customer I would be happy with their explanation and apology even if obviously the downtime is still an issue.

  • by erc ( 38443 ) <erc AT pobox DOT com> on Saturday July 26, 2008 @06:16PM (#24351785) Homepage

    Or they could checksum their UDP packets. The entire packet, not just the customer payload. Duh.

  • by FalcDot ( 1224920 ) on Saturday July 26, 2008 @06:33PM (#24351957)

    Looking back, I feel that the one thing all our technological progress has given us more than anything else, is more and better means of communication.

    You can talk to people on the other side of the world, heck, on the other side of the solar system if you don't mind the delay. Video feeds, planes that'll actually get you there in less than a day, ...

    And yet, with all of this, it seems that we're not actually doing it. A company explaining what went wrong is the exception. Internet forums without flamers and trolls? Exception.

    "Anything you say can and will be used against you."

  • by edsousa ( 1201831 ) on Saturday July 26, 2008 @07:17PM (#24352353) Journal
    This message is written by one that writes real parallel, distributed and concurrent code (they are not all the same):
    Erlang or any other functional language will not account for lack of good design. If you have a good design with the right concerns you can implement in Java, C, Fortran, ASM and if done right, it will work.
    I'm sick of hear "Erlang is THE solution". It is not. Good design and implementation practices are.
  • by TheRaven64 ( 641858 ) on Saturday July 26, 2008 @07:48PM (#24352627) Journal

    True, but not particularly informative. The point is to detect errors. Error correction is nice, but error detection is enough if the sender can then retransmit. Oh, and your 'proof' is flawed, since you are completely ignoring the fact that any correction scheme contains redundant information, so the while n bits might work instead of n+1 bits, n-1 might not.

    If the last bit is missing, then your receiver knows that there is an error. If the last bit is flipped, then it knows that there is an error. A checksum can be very simple and just give the count of the number of bits that are set in the message. This will protect from single-bit errors, but an error in both the message and the checksum can cause an erroneous packet to pass. It's basically a hash, and the idea is to make sure that hash collisions are as infrequent as possible.

    You can usually guarantee a maximum number of errors in the network and it's possible to design correction schemes which will detect any n-bit error.

    In some cases, it's possible to accurately detect all errors because all failures will be in one direction (0 to 1 or 1 to 0, but not both). In this case, and XOR'd copy of the message will work, because any bit in the message flipping from 0 to 1 needs a corresponding bit in the check flipping from 1 to 0 (there are shorter encoding schemes that work in this case too, but this is a very simple example).

  • Well, technically speaking, there isn't an apology there:

    Finally, we want you to know that we are passionate about providing the best storage service at the best price so that you can spend more time thinking about your business rather than having to focus on building scalable, reliable infrastructure. Though we're proud of our operational performance in operating Amazon S3 for almost 2.5 years, we know that any downtime is unacceptable and we won't be satisfied until performance is statistically indistinguishable from perfect.

    Allow me to translate:

    We screwed up. We'll do better next time.

    Nowhere in the document do the words "I'm sorry" appear. That's entirely implied.

  • by spinkham ( 56603 ) on Saturday July 26, 2008 @10:43PM (#24354047)

    No, I mean favoring speed and computational simplicity over error detection.
    It is often a valid trade off. For example, most filesystems do not validate the stored data at all for size and computational reasons. As hard drives and arrays get bigger, that trade of no longer makes much sense, and most all new filesystems being designed have hash based error detection built in at some level.
    Good design takes experience. There aren't that many systems like S3 that have been built in the past, and there are many tricky decisions to be made. No system gets it all correct out of the gate.

  • by Alpha830RulZ ( 939527 ) on Saturday July 26, 2008 @11:41PM (#24354497)

    The words may not be there, but there is a pretty clear message there for me to see that they are not happy or smug about this event, and are agreeing with the consumer that this shouldn't have happened, and won't happen again if they can help it. That's enough for me.

    And I'm actually one of their consumers, compared to some of the dilletantes here. We use S3 and EC2 to manage training and demo instances of our software, and are pretty pleased so far.

Lots of folks confuse bad management with destiny. -- Frank Hubbard

Working...