Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Bug The Internet IT News

Amazon Explains Why S3 Went Down 114

Angostura writes "Amazon has provided a decent write-up of the problems that caused its S3 storage service to fail for around 8 hours last Sunday. It providers a timeline of events, the immediate action take to fix it (they pulled the big red switch) and what the company is doing to prevent re-occurrence. In summary: A random bit got flipped in one of the server state messages that the S3 machines continuously pass back and forth. There was no checksum on these messages, and the erroneous information was propagated across the cloud, causing so much inter-server chatter that no customer work got done."
This discussion has been archived. No new comments can be posted.

Amazon Explains Why S3 Went Down

Comments Filter:
  • by Ctrl-Z ( 28806 ) <timNO@SPAMtimcoleman.com> on Saturday July 26, 2008 @05:43PM (#24351465) Homepage Journal
    Thank you Capt. Obvious. A single bit is enough to cause a cascading failure, and someone overlooked this instance. It's not the first time, nor will it be the last. See New York City blackout of 1977 [slashdot.org], The Crash of the AT&T Network in 1990 [dmine.com], et al.
  • haha (Score:5, Informative)

    by msauve ( 701917 ) on Saturday July 26, 2008 @05:56PM (#24351603)
    For those who don't know what you're referring to, like the AC who commented: search in this for "evil bit" [faqs.org].
  • Re:Lost time? (Score:5, Informative)

    by Anpheus ( 908711 ) on Saturday July 26, 2008 @06:14PM (#24351771)

    Read "the system's state cleared" as "we turned everything off" and they proceeded to turn every server on one by one until around 3PM when the EU location was complete and not showing any symptoms.

  • Re:Lost time? (Score:3, Informative)

    by Anonymous Coward on Saturday July 26, 2008 @06:21PM (#24351825)

    sun enterprise and most other enterprise servers takes 30-45 minutes to pass the prom tests after you start them up. thats per server, before you even boot the os. my guess is they brought up servers in large batches of 25% at a time and took about an hour per batch.

  • by j. andrew rogers ( 774820 ) on Saturday July 26, 2008 @06:44PM (#24352053)

    It has been generally well-known for a number of years now that any time you have a large cluster you cannot count on hardware checksums to catch every bit flip that may occur during copies and transmission, particularly with consumer hardware which has many internal paths with no checksums at all. Google learned this the hard way, like the supercomputing people before them, and now like Amazon after Google. And some of the better database engines also do their own internal software checksums as well to catch uncaught errors introduced as the data gets copied across the silicon, disks, and network -- it is one way they get their very high uptime and low failure rate.

    It does not reflect well on the software community that most people *still* do not know to do this for very large scale system designs. The performance cost of doing a software CRC on your data every time it is moved around is low enough that it is generally worth it these days. If your system is large enough, the probability of getting bitten by this approaches unity. Very fast implementations of Adler-32 and other high-performance checksum algorithms are widely available online.

  • Re:haha (Score:3, Informative)

    by ivoras ( 455934 ) <ivoras AT fer DOT hr> on Saturday July 26, 2008 @07:22PM (#24352391) Homepage
    Not widely known, but the RFC was actually implemented, at least once: http://lists.freebsd.org/pipermail/cvs-all/2003-April/001098.html [freebsd.org] :)
  • by innerweb ( 721995 ) on Saturday July 26, 2008 @10:41PM (#24354031)

    After working in a production environment as a developer, I can assure you that the correct interpretation is "Anybody else noticed that these major problems always occur when no one is around to catch them (before they get out of hand)?"

    InnerWeb

  • by this great guy ( 922511 ) on Saturday July 26, 2008 @11:26PM (#24354399)
    Even ECC memory isn't a panacea. ECC can only correct 1-bit errors. It can't correct 2-bit errors (only detect them) and can't even detect nor correct 3-bit (or more) errors. To the poster that seemed to think a 1-bit error causing a downtime is a sign of a defective design: the truth is that 99.9% of the software out there doesn't even try to work around data corruption issues. One can easily introduce 1-bit errors capable of crashing virtually any app. For example by flipping 1 bit of the first byte of the MBR (master boot record) of an OS, it can make it unbootable (it changes the opcode of what is usually a JMP instruction to something else).
  • by Maxmin ( 921568 ) on Sunday July 27, 2008 @01:13AM (#24355029)

    ECC can only correct 1-bit errors. It can't correct 2-bit errors (only detect them) and can't even detect nor correct 3-bit (or more) errors.

    No, that's just one kind of memory system. There are a number of designs, and recovery also depends on the kind of error. IIRC, one design is somewhat similar to the CD Red Book spec, in that the bits for a given byte are distributed around - a physical byte is composed of bits all from different memory locations. If part or all of one byte goes bad, the rest of the bits and the parity code are unchanged, and the affected bytes can be reconstructed.

    Also like Red Book CDs are multiply redundant memory systems, with -just what it sounds like- multiple copies of each byte, and the memory controller arbitrates differences. CDs effectively contain three copies of the data, striped and parity encoded. That's how scratched CDs can still operate error-free (sometimes.) The space shuttle's computer systems are relatively fault-tolerant - multiple redundant computers all running the same programs and data, with a fourth computer evaluating the output of the other computers, looking for failures.

    Where there's a will, there's a way, but the will in the mainstream x86 server industry to build truly fault-tolerant computers is slim. It's a specialty, and that makes it very expensive. Stratus [stratus.com], for example, makes a line of fault-tolerant servers [stratus.com], with some of the fail-over in hardware, so they make their 99.999% uptime claim (about 5 minutes downtime per year.)

    "Five nines [wikipedia.org]" is a claim I've heard from most top-dollar *nix hosting companies, but have *never* experienced - it's generally been hours of downtime per year. Not even their network infrastructure gets close to 99.999% uptime! Cadillac prices, but downtime contingency planning is all up to the client, even with "managed hosting." They all suck.

  • by iluvcapra ( 782887 ) on Sunday July 27, 2008 @02:53AM (#24355643)

    No, that's just one kind of memory system. There are a number of designs, and recovery also depends on the kind of error. IIRC, one design is somewhat similar to the CD Red Book spec, in that the bits for a given byte are distributed around - a physical byte is composed of bits all from different memory locations.

    Red Book audio CDs, Sony MiniDisks and DATs all use a form of Cross-Interleaved Reed-Solomon [wikipedia.org] coding, which is has the nice characteristic of being able to use the fact that a piece of information is known to be missing when reconstructing the original signal, whereas other systems can't necessarily be improved by being informed of the difference between an "error" and an "erasure." Side information about "known-bad" media areas are a natural fit for physical media, not necessarily for serial data or other things.

    CDs also have parity bits on every (EFM-encoded) byte on the media, which can contribute to the "erasure" side-information along with tracking data from the laser. Also working in the CD's favor is the fact that they carry relatively low-information PCM data, and if there is a complete loss of a sample or two, the decoding device can just do a 1st order interpolation between the surrounding known good samples. This is why CDs can sound excellent until one day it might just not play at all, without an significant period of declining quality, because errors were accumulating until you reached a critical point where your player couldn't spackle over the errors anymore, and it juts gives up.

    only ot fyi :)

  • by afidel ( 530433 ) on Monday July 28, 2008 @02:48AM (#24364891)
    All decent servers use multibit ECC and the better ones are using IBM's chipkill technology which is basically RAID for ram, it uses an extra memory chip to do parity calculations.

An authority is a person who can tell you more about something than you really care to know.

Working...