Catch up on stories from the past week (and beyond) at the Slashdot story archive

Amazon Explains Why S3 Went Down 114

Posted by timothy on Saturday July 26, 2008 @05:33PM from the not-mere-sluttiness dept.

Angostura writes "Amazon has provided a decent write-up of the problems that caused its S3 storage service to fail for around 8 hours last Sunday. It providers a timeline of events, the immediate action take to fix it (they pulled the big red switch) and what the company is doing to prevent re-occurrence. In summary: A random bit got flipped in one of the server state messages that the S3 machines continuously pass back and forth. There was no checksum on these messages, and the erroneous information was propagated across the cloud, causing so much inter-server chatter that no customer work got done."

This discussion has been archived. No new comments can be posted.

Amazon Explains Why S3 Went Down

Load All Comments

Search 114 Comments Log In/Create an Account

Comments Filter:

for want of a nail ... (Score:5, Interesting)

by thrillseeker ( 518224 ) writes: on Saturday July 26, 2008 @05:34PM (#24351385)

a single bit?! I think there are some serious design deficiencies ...

Share
twitter facebook
- Re:for want of a nail ... (Score:5, Funny)
  
  by Daimanta ( 1140543 ) writes: on Saturday July 26, 2008 @05:37PM (#24351415) Journal
  
  It was the evil bit...
  
  Parent Share
  twitter facebook
  - haha (Score:5, Informative)
    
    by msauve ( 701917 ) writes: on Saturday July 26, 2008 @05:56PM (#24351603)
    
    For those who don't know what you're referring to, like the AC who commented: search in this for "evil bit" [faqs.org].
    
    Parent Share
    twitter facebook
    - Re:haha (Score:1)
      
      by Warll ( 1211492 ) writes: on Saturday July 26, 2008 @07:03PM (#24352235) Homepage
      
      Thats just silly! Everyone with any byte of knowledge in this area would know that this is clearly a case of the lazy bit as first discovered by Sir Simon, Innocence, BOFH during his ground breaking work on lazy atoms.
      
      Parent Share
      twitter facebook
    - Re:haha (Score:3, Informative)
      
      by ivoras ( 455934 ) writes: <ivoras@@@fer...hr> on Saturday July 26, 2008 @07:22PM (#24352391) Homepage
      
      Not widely known, but the RFC was actually implemented, at least once: http://lists.freebsd.org/pipermail/cvs-all/2003-April/001098.html [freebsd.org] :)
      
      Parent Share
      twitter facebook
  - Re:for want of a nail ... (Score:0)
    
    by Anonymous Coward writes: on Saturday July 26, 2008 @06:06PM (#24351709)
    
    You know, I understand when submitters or posters who may not speak English as their native tongue mess up some of the finer points of English grammar, but holy fucking sheepshit, "timothy!" Way to not edit, stud. I am sure that taco or somebody will change this gem of a summary soon, so here is a copy of this story *as posted* for posterity:
    "Amazon has provided a decent write-up of the problems that caused its S3 storage service to fail for around 8 hours last Sunday. It providers a timeline of events, the immediate action take to fix it (they pulled the big red switch) and what the company is doing to prevent re-occurrence. In summary: A random bit got flipped in one of the server state messages that the S3 machines continuously pass back and forth. There was no checksum on these messages, and the erroneous information was propagated across the cloud, causing so much inter-server chatter, that no customer work got done."
    I wish I could be a /. editor and get paid for not doing jack shit all day.
    
    Parent Share
    twitter facebook
  - Re:for want of a nail ... (Score:2)
    
    by CalSolt ( 999365 ) writes: on Saturday July 26, 2008 @07:52PM (#24352667)
    
    It's like a self-replicating virus that arose from the result of a random mutation.
    "Ever since the first computers,
    there have always been
    ghosts in the machine.
    Random segments of code that
    have grouped together to
    form unexpected protocols."
    
    Parent Share
    twitter facebook
    - Re:for want of a nail ... (Score:3, Funny)
      
      by mrmeval ( 662166 ) writes: <jcmeval&yahoo,com> on Saturday July 26, 2008 @09:57PM (#24353727) Journal
      
      1 million code monkeys typing out Aleister Crowley?
      
      Parent Share
      twitter facebook
  - Re:for want of a nail ... (Score:1)
    
    by herdingcats ( 21219 ) writes: on Saturday July 26, 2008 @09:04PM (#24353339)
    
    BOFH, obviously. duh. and quite cleverly disguised.
    
    Parent Share
    twitter facebook
  - evil bit defined (Score:1)
    
    by hicksw ( 716194 ) writes: on Sunday July 27, 2008 @07:48AM (#24356841)
    
    See RFC 3514 http://www.ietf.org/rfc/rfc3514.txt?number=3514 [ietf.org]
    
    Parent Share
    twitter facebook
- Re:for want of a nail ... (Score:0)
  
  by Anonymous Coward writes: on Saturday July 26, 2008 @05:38PM (#24351431)
  
  Computer Networks by Andrew S. Tanenbaum [amazon.com]
  Well there is some reading for them.
  
  Parent Share
  twitter facebook
- Re:for want of a nail ... (Score:5, Informative)
  
  by Ctrl-Z ( 28806 ) writes: <tim@timcole[ ].com ['man' in gap]> on Saturday July 26, 2008 @05:43PM (#24351465) Homepage Journal
  
  Thank you Capt. Obvious. A single bit is enough to cause a cascading failure, and someone overlooked this instance. It's not the first time, nor will it be the last. See New York City blackout of 1977 [slashdot.org], The Crash of the AT&T Network in 1990 [dmine.com], et al.
  
  Parent Share
  twitter facebook
  - Re:for want of a nail ... (Score:4, Interesting)
    
    by Ctrl-Z ( 28806 ) writes: <tim@timcole[ ].com ['man' in gap]> on Saturday July 26, 2008 @10:13PM (#24353811) Homepage Journal
    
    Actually, that should have been Northeast Blackout of 1965 [wikipedia.org]. But you already knew that.
    
    Parent Share
    twitter facebook
    - re:Northeast Blackout of 1965 (Score:2)
      
      by mblase ( 200735 ) writes: on Sunday July 27, 2008 @06:52PM (#24361833)
      
      I thought that was caused by the bouncy-ball "gift" from the Great Collector. (He thought it was funny as hell....)
      
      Parent Share
      twitter facebook
      - Re:Northeast Blackout of 1965 (Score:1)
        
        by cizoozic ( 1196001 ) writes: on Tuesday July 29, 2008 @02:01AM (#24381109)
        
        I thought that was caused by the bouncy-ball "gift" from the Great Collector. (He thought it was funny as hell....)
        He apparently thought we WERE hosting an intergalactic kegger down here.
        
        Parent Share
        twitter facebook
  - ECC memory, anyone? (Score:4, Interesting)
    
    by Maxmin ( 921568 ) writes: on Saturday July 26, 2008 @10:35PM (#24353981)
    
    I hafta wonder if the bit flipped due to a bad RAM stick?
    We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers' objects. However, we didn't have the same protection in place to detect whether this particular internal state information had been corrupted.
    Nothing specific about *what* caused the bit to flip.
    This comes to mind only because bad RAM on a new server at work caused installation of a stock Perl module to throw excessive errors during the XS compile phase - the same package installed without error on an identical machine 20 minutes earlier. Took over an hour before we realized it was probably hardware. Memtest86 [memtest86.com] quickly turned up the problem.
    Would hashes and the like protect against RAM suddenly going south? Wouldn't any piece of data that passes through main memory be vulnerable to corruption? Makes me wonder why ECC memory [wikipedia.org] isn't being used much anymore... we have various flavors of RAID to protect slow memory from corruption, but not many machines I see have ECC anymore.
    
    Parent Share
    twitter facebook
    - Re:ECC memory, anyone? (Score:4, Informative)
      
      by this great guy ( 922511 ) writes: on Saturday July 26, 2008 @11:26PM (#24354399)
      
      Even ECC memory isn't a panacea. ECC can only correct 1-bit errors. It can't correct 2-bit errors (only detect them) and can't even detect nor correct 3-bit (or more) errors. To the poster that seemed to think a 1-bit error causing a downtime is a sign of a defective design: the truth is that 99.9% of the software out there doesn't even try to work around data corruption issues. One can easily introduce 1-bit errors capable of crashing virtually any app. For example by flipping 1 bit of the first byte of the MBR (master boot record) of an OS, it can make it unbootable (it changes the opcode of what is usually a JMP instruction to something else).
      
      Parent Share
      twitter facebook
      - Re:ECC memory, anyone? (Score:5, Informative)
        
        by Maxmin ( 921568 ) writes: on Sunday July 27, 2008 @01:13AM (#24355029)
        
        ECC can only correct 1-bit errors. It can't correct 2-bit errors (only detect them) and can't even detect nor correct 3-bit (or more) errors.
        No, that's just one kind of memory system. There are a number of designs, and recovery also depends on the kind of error. IIRC, one design is somewhat similar to the CD Red Book spec, in that the bits for a given byte are distributed around - a physical byte is composed of bits all from different memory locations. If part or all of one byte goes bad, the rest of the bits and the parity code are unchanged, and the affected bytes can be reconstructed.
        Also like Red Book CDs are multiply redundant memory systems, with -just what it sounds like- multiple copies of each byte, and the memory controller arbitrates differences. CDs effectively contain three copies of the data, striped and parity encoded. That's how scratched CDs can still operate error-free (sometimes.) The space shuttle's computer systems are relatively fault-tolerant - multiple redundant computers all running the same programs and data, with a fourth computer evaluating the output of the other computers, looking for failures.
        Where there's a will, there's a way, but the will in the mainstream x86 server industry to build truly fault-tolerant computers is slim. It's a specialty, and that makes it very expensive. Stratus [stratus.com], for example, makes a line of fault-tolerant servers [stratus.com], with some of the fail-over in hardware, so they make their 99.999% uptime claim (about 5 minutes downtime per year.)
        "Five nines [wikipedia.org]" is a claim I've heard from most top-dollar *nix hosting companies, but have *never* experienced - it's generally been hours of downtime per year. Not even their network infrastructure gets close to 99.999% uptime! Cadillac prices, but downtime contingency planning is all up to the client, even with "managed hosting." They all suck.
        
        Parent Share
        twitter facebook
        
        Re:ECC memory, anyone? (Score:3, Informative)
        
        by iluvcapra ( 782887 ) writes: on Sunday July 27, 2008 @02:53AM (#24355643)
        
        No, that's just one kind of memory system. There are a number of designs, and recovery also depends on the kind of error. IIRC, one design is somewhat similar to the CD Red Book spec, in that the bits for a given byte are distributed around - a physical byte is composed of bits all from different memory locations.
        Red Book audio CDs, Sony MiniDisks and DATs all use a form of Cross-Interleaved Reed-Solomon [wikipedia.org] coding, which is has the nice characteristic of being able to use the fact that a piece of information is known to be missing when reconstructing the original signal, whereas other systems can't necessarily be improved by being informed of the difference between an "error" and an "erasure." Side information about "known-bad" media areas are a natural fit for physical media, not necessarily for serial data or other things.
        CDs also have parity bits on every (EFM-encoded) byte on the media, which can contribute to the "erasure" side-information along with tracking data from the laser. Also working in the CD's favor is the fact that they carry relatively low-information PCM data, and if there is a complete loss of a sample or two, the decoding device can just do a 1st order interpolation between the surrounding known good samples. This is why CDs can sound excellent until one day it might just not play at all, without an significant period of declining quality, because errors were accumulating until you reached a critical point where your player couldn't spackle over the errors anymore, and it juts gives up.
        only ot fyi :)
        
        Parent Share
        twitter facebook
        
        Re:ECC memory, anyone? (Score:2)
        
        by Maxmin ( 921568 ) writes: on Sunday July 27, 2008 @05:36AM (#24356247)
        
        That's a really interesting distinction. If I'm reading right at this hour, you're saying that the read device is able to distinguish between on, off and bad media? Something a RAM chip or memory controller can't do, I imagine, because it's receiving a signal from deep inside ... good or bad, the media's read indirectly.
        FSM-damn, there are knowledgeable folk here on /.
        
        Parent Share
        twitter facebook
      - Re:ECC memory, anyone? (Score:3, Informative)
        
        by afidel ( 530433 ) writes: on Monday July 28, 2008 @02:48AM (#24364891)
        
        All decent servers use multibit ECC and the better ones are using IBM's chipkill technology which is basically RAID for ram, it uses an extra memory chip to do parity calculations.
        
        Parent Share
        twitter facebook
- Re:for want of a nail ... (Score:0)
  
  by Anonymous Coward writes: on Saturday July 26, 2008 @05:46PM (#24351505)
  
  It's called a "flag", in case you didn't know that.
  
  Parent Share
  twitter facebook
  - Re:for want of a nail ... (Score:2)
    
    by somersault ( 912633 ) writes: on Saturday July 26, 2008 @08:07PM (#24352823) Homepage Journal
    
    A flag can be stored as a bit, but not all bits are used as flags..
    
    Parent Share
    twitter facebook
- Re:for want of a nail ... (Score:0)
  
  by Anonymous Coward writes: on Saturday July 26, 2008 @06:51PM (#24352127)
  
  I guess a little bit goes a long way.
  
  Parent Share
  twitter facebook
- Re:for want of a nail ... (Score:0)
  
  by Anonymous Coward writes: on Saturday July 26, 2008 @06:57PM (#24352169)
  
  with at least 2 wrong 'bits' just in the summary imagine the horror of the S3 codebase!
  
  Parent Share
  twitter facebook
- Re:for want of a nail ... (Score:1)
  
  by kencf0618 ( 1172441 ) writes: on Saturday July 26, 2008 @07:30PM (#24352451) Homepage
  
  For want of an unflipped bit a server was lost.
  For want of a server gossip was lost.
  For want of gossip clusters were lost.
  For want of clusters revenues ceased.
  All for want of an unflipped bit.
  
  Parent Share
  twitter facebook
- You have no idea... (Score:1, Interesting)
  
  by Anonymous Coward writes: on Saturday July 26, 2008 @09:37PM (#24353579)
  
  As someone who worked at Amazon as a software engineer for over three years in various backend areas, I can say that without a doubt, Amazon's code and production quality is so horrible, that it's hard to believe.
  Engineers carry pagers and, in many groups, are constantly paged. The only thing that keeps the systems running is a bunch of junior engineers responding in the middle of the night, fixing databases, bouncing services, etc., etc. Engineers are rarely, if ever, given the chance to actually *fix* things, they're just supposed to band-aid them up.
  And, here's a big secret for you: When I left Amazon a little over a year ago, no development groups internally were even using EC2, S3, SQS, or any of the other web services they sell to you. They make it sound like you're using the same high-end services they use to satisfy tens of millions of customers. They're not.
  
  Parent Share
  twitter facebook
- Re:for want of a nail ... (Score:5, Funny)
  
  by iamhassi ( 659463 ) writes: on Saturday July 26, 2008 @09:38PM (#24353593) Journal
  
  FTA:
  "On Sunday, we saw a large number of servers that were spending almost all of their time gossiping and a disproportionate amount of servers that had failed while gossiping. With a large number of servers gossiping and failing while gossiping, Amazon S3 wasn't able to successfully process many customer requests."
  
  sounds like a restaurant, gossiping servers were failing to process customer requests
  
  Parent Share
  twitter facebook
They need more Erlang. (Score:0)

by Anonymous Coward writes: on Saturday July 26, 2008 @05:38PM (#24351423)

They need to start using Erlang more. It's designed specifically for building highly-distributed, concurrent systems that must scale to millions of transactions per minute. So it's a natural fit between Erlang and what Amazon is trying to offer with their S3 service.

Share
twitter facebook
- Re:They need more Erlang. (Score:5, Insightful)
  
  by nacturation ( 646836 ) * writes: <nacturation.gmail@com> on Saturday July 26, 2008 @05:51PM (#24351555) Journal
  
  They need to start using Erlang more. It's designed specifically for building highly-distributed, concurrent systems that must scale to millions of transactions per minute. So it's a natural fit between Erlang and what Amazon is trying to offer with their S3 service.
  I think Erlang's cool and all, but it's not the magical bullet that will solve this. It's still possible to have information corrupted during message passing between processes in Erlang (say, as the result of an intermittently failing network switch) as it is in any language.
  
  Parent Share
  twitter facebook
- Re:They need more Erlang. (Score:0)
  
  by Anonymous Coward writes: on Saturday July 26, 2008 @06:06PM (#24351721)
  
  So, are you proposing they rewrite the S3 service in erlang, rather than adding checksums to their messaging protocol? Don't get me wrong Erlang might be the way to go if you or I were creating a service from scratch.
  
  Parent Share
  twitter facebook
- Re:They need more Erlang. (Score:5, Insightful)
  
  by edsousa ( 1201831 ) writes: on Saturday July 26, 2008 @07:17PM (#24352353) Journal
  
  This message is written by one that writes real parallel, distributed and concurrent code (they are not all the same):
  Erlang or any other functional language will not account for lack of good design. If you have a good design with the right concerns you can implement in Java, C, Fortran, ASM and if done right, it will work.
  I'm sick of hear "Erlang is THE solution". It is not. Good design and implementation practices are.
  
  Parent Share
  twitter facebook
Simple (Score:3, Insightful)

by gardyloo ( 512791 ) writes: on Saturday July 26, 2008 @05:46PM (#24351511)

S3 is a total slut.

Share
twitter facebook
- Re:Simple (Score:2)
  
  by gardyloo ( 512791 ) writes: on Saturday July 26, 2008 @05:49PM (#24351541)
  
  Aw. And it appears that this has been posted from the not-mere-sluttiness dept. It appears that slashdot editors already stooped to the lowest from of humor in posting the headline, and I've just fallen into their redundancy trap. Tricky, tricky!
  
  Parent Share
  twitter facebook
It's not the first time... (Score:0)

by Anonymous Coward writes: on Saturday July 26, 2008 @05:51PM (#24351559)

A random bit got flipped
...that a bit on the side has caused problems.
Look at Max Mosely, for example.

Share
twitter facebook
...make lemonade. (Score:4, Funny)

by fahrbot-bot ( 874524 ) writes: on Saturday July 26, 2008 @06:03PM (#24351681)

A random bit got flipped in one of the server state messages...

Cosmic Rays perhaps? I guess they could line the room with lead, or simply re-market S3 as a Neutrino detector [wikipedia.org]. :-)

Share
twitter facebook
- Re:...make lemonade. (Score:3, Insightful)
  
  by erc ( 38443 ) writes: <erc AT pobox DOT com> on Saturday July 26, 2008 @06:16PM (#24351785) Homepage
  
  Or they could checksum their UDP packets. The entire packet, not just the customer payload. Duh.
  
  Parent Share
  twitter facebook
  - Re:...make lemonade. (Score:3, Interesting)
    
    by spinkham ( 56603 ) writes: on Saturday July 26, 2008 @06:52PM (#24352129)
    
    There's probably information that changes as the packets move around, and they probably wanted to avoid the overhead. I'm guessing it was a deliberate design decision, but it turned out to be the wrong one. It's easy to see that after a failure, but it's hard to design large distributed systems and foresee every possible way things can break, and where the computation overhead is worth it. The number of interactions between servers here makes any small design flaw a big thing.
    
    Parent Share
    twitter facebook
    - Re:...make lemonade. (Score:1)
      
      by superdana ( 1211758 ) writes: on Saturday July 26, 2008 @09:52PM (#24353673)
      
      If by "deliberate design decision" you mean "laziness," yeah, I'll buy that.
      
      Parent Share
      twitter facebook
      - Re:...make lemonade. (Score:5, Insightful)
        
        by spinkham ( 56603 ) writes: on Saturday July 26, 2008 @10:43PM (#24354047)
        
        No, I mean favoring speed and computational simplicity over error detection.
        It is often a valid trade off. For example, most filesystems do not validate the stored data at all for size and computational reasons. As hard drives and arrays get bigger, that trade of no longer makes much sense, and most all new filesystems being designed have hash based error detection built in at some level.
        Good design takes experience. There aren't that many systems like S3 that have been built in the past, and there are many tricky decisions to be made. No system gets it all correct out of the gate.
        
        Parent Share
        twitter facebook
  - Re:...make lemonade. (Score:1)
    
    by xRizen ( 319121 ) writes: on Saturday July 26, 2008 @06:53PM (#24352147)
    
    Or they could use TCP.
    
    Parent Share
    twitter facebook
  - Re:...make lemonade. (Score:2)
    
    by Sique ( 173459 ) writes: on Saturday July 26, 2008 @06:59PM (#24352185) Homepage
    
    It is quite simple to prove mathematically, that it is in general impossible to devise a protocol that guarantees 100 percent correctness of a signal on an independable line.
    (You simply do it by recursive reduction: If the protocol works even if the last bit is missing, then you could send the message without the last bit anyway. If the second last bit can go missing without damaging the integrity of the message, you can leave out that one also... etc.pp. until you don't need to send any bit at all to deliver your message ;) )
    
    Parent Share
    twitter facebook
    - Re:...make lemonade. (Score:3, Insightful)
      
      by TheRaven64 ( 641858 ) writes: on Saturday July 26, 2008 @07:48PM (#24352627) Journal
      
      True, but not particularly informative. The point is to detect errors. Error correction is nice, but error detection is enough if the sender can then retransmit. Oh, and your 'proof' is flawed, since you are completely ignoring the fact that any correction scheme contains redundant information, so the while n bits might work instead of n+1 bits, n-1 might not.
      If the last bit is missing, then your receiver knows that there is an error. If the last bit is flipped, then it knows that there is an error. A checksum can be very simple and just give the count of the number of bits that are set in the message. This will protect from single-bit errors, but an error in both the message and the checksum can cause an erroneous packet to pass. It's basically a hash, and the idea is to make sure that hash collisions are as infrequent as possible.
      You can usually guarantee a maximum number of errors in the network and it's possible to design correction schemes which will detect any n-bit error.
      In some cases, it's possible to accurately detect all errors because all failures will be in one direction (0 to 1 or 1 to 0, but not both). In this case, and XOR'd copy of the message will work, because any bit in the message flipping from 0 to 1 needs a corresponding bit in the check flipping from 1 to 0 (there are shorter encoding schemes that work in this case too, but this is a very simple example).
      
      Parent Share
      twitter facebook
      - Re:...make lemonade. (Score:5, Interesting)
        
        by Sique ( 173459 ) writes: on Saturday July 26, 2008 @08:05PM (#24352803) Homepage
        
        I see you completely miss the point of the proof. I know that you can minimize the impact of a bit error by checksums, and that you can improve reliability by adding redundance. But what is the consequence of error detection? Normally the protocol then asks for resending the message. But how do you (as the sender) know that the message finally arrived correctly? You wait for an aknowledgement. But what if the aknowledgement gets lost or scrambled? You add redundancy in the handshaking. But how does redundancy help reliability? You can ask for a resend due to detected errors etc.pp.
        Your protocol never finds an end because it has to secure the correctness of the security of the correctness of the security...
        
        Parent Share
        twitter facebook
        
        Re:...make lemonade. (Score:2)
        
        by jlp2097 ( 223651 ) writes: on Sunday July 27, 2008 @03:57AM (#24355943) Homepage Journal
        
        It's called the Two Generals' Problem [wikipedia.org].
        
        Parent Share
        twitter facebook
    - Re:...make lemonade. (Score:1, Interesting)
      
      by Anonymous Coward writes: on Saturday July 26, 2008 @08:39PM (#24353137)
      
      That there's nothing you can do to get the correct message to the other end if the lower level connection doesn't send anything (e.g. someone unplugged the ethernet cable) or corrupts everything sent, is spectacularly uninteresting.
      What people actually care about it given a connection with some statistical level of unreliability you can build a reliable connection out of it. The less reliable the real connection the more overhead you need. Obviously the "reliable connection" isn't actually perfectly reliable if the underlying connection doesn't meet the reliability level that was assumed.
      
      Parent Share
      twitter facebook
  - Re:...make lemonade. (Score:2)
    
    by Z34107 ( 925136 ) writes: on Sunday July 27, 2008 @04:31AM (#24356041)
    
    Or they could checksum their UDP packets. The entire packet, not just the customer payload. Duh.
    Not a network engineer, but I believe that UDP packets contain the source MAC address. When a router receives that packet, it will blow away that MAC address, replace it with it's own, and forward it out the right interface. (This is assuming they're using UDP, TCP/IP, or something else entirely to transmit whatever state fields were corrupted.)
    If they did checksum the entire packet, they would have to rebuild the sum at every node to account for the changing MAC address, and who knows what else is modified in the packet header between nodes. This smacks of a lot of wasted processing time, times, like, a metric internet or two.
    If it did get corrupted in transmission (not due to faulty memory, as some speculated, or due to evil gremlins or something else) then maybe adding a checksum just to the state bits would be worthwhile.
    
    Parent Share
    twitter facebook
    - Re:...make lemonade. (Score:2)
      
      by jandrese ( 485 ) writes: <kensama@vt.edu> on Sunday July 27, 2008 @12:48PM (#24358985) Homepage Journal
      
      The MAC address is stored down at the ethernet frame level. The UDP checksum only covers the UDP header, it won't change due to changes at the ethernet frame level. That said, the UDP checksum is notoriously weak and relying on it at all is just asking for failure.
      
      Parent Share
      twitter facebook
      - Re:...make lemonade. (Score:2)
        
        by Z34107 ( 925136 ) writes: on Sunday July 27, 2008 @04:02PM (#24360665)
        
        Interesting. And who knows how the failed bit got toggled anyway? Perhaps the checksum would have been built around the bad information anyway, and all the other machines would have picked it up as valid.
        
        Parent Share
        twitter facebook
Lost time? (Score:2)

by freelunch ( 258011 ) writes: on Saturday July 26, 2008 @06:05PM (#24351705)

By 11:05am PDT, ..., the system's state cleared.
At 2:57pm PDT, Amazon S3's EU location began successfully completing customer requests.
So WTF happened during the four hours between 11:05AM and 2:57PM?
And... have they learned nothing from the TIBCO fiasco?

Share
twitter facebook
- Re:Lost time? (Score:0)
  
  by Anonymous Coward writes: on Saturday July 26, 2008 @06:09PM (#24351741)
  
  That's how long it took things to come back online.
  
  Parent Share
  twitter facebook
  - Re:Lost time? (Score:1, Interesting)
    
    by Anonymous Coward writes: on Saturday July 26, 2008 @06:17PM (#24351793)
    
    That's how long it took things to come back online.
    Well that explains everything. I suppose I should expect that since it goes to 11....
    Four hours is a Longgg time... and it was five hours for the US.
    Why did it take four hours? Why not 40 minutes or four days?
    There is a big hole in the story.
    
    Parent Share
    twitter facebook
    - Re:Lost time? (Score:3, Informative)
      
      by Anonymous Coward writes: on Saturday July 26, 2008 @06:21PM (#24351825)
      
      sun enterprise and most other enterprise servers takes 30-45 minutes to pass the prom tests after you start them up. thats per server, before you even boot the os. my guess is they brought up servers in large batches of 25% at a time and took about an hour per batch.
      
      Parent Share
      twitter facebook
- Re:Lost time? (Score:5, Informative)
  
  by Anpheus ( 908711 ) writes: on Saturday July 26, 2008 @06:14PM (#24351771)
  
  Read "the system's state cleared" as "we turned everything off" and they proceeded to turn every server on one by one until around 3PM when the EU location was complete and not showing any symptoms.
  
  Parent Share
  twitter facebook
- Re:Lost time? (Score:1)
  
  by Annymouse Cowherd ( 1037080 ) writes: on Saturday July 26, 2008 @09:04PM (#24353327) Homepage
  
  They programmed a checksum into their message-passing?
  
  Parent Share
  twitter facebook
Other companies could learn from this... (Score:5, Insightful)

by Manip ( 656104 ) writes: on Saturday July 26, 2008 @06:10PM (#24351753)

Other large businesses could learn a lot from Amazon's example.
How often do you have the problem really explained to you, an apology, and a reasonable set of changes to stop it occurring again?
Most businesses would never explain the root of any problem. They simply list "hardware issues." And they NEVER say sorry anymore - supposedly it opens them up to more liability or something.
If I was an Amazon customer I would be happy with their explanation and apology even if obviously the downtime is still an issue.

Share
twitter facebook
- Re:Other companies could learn from this... (Score:2, Insightful)
  
  by FalcDot ( 1224920 ) writes: on Saturday July 26, 2008 @06:33PM (#24351957)
  
  Looking back, I feel that the one thing all our technological progress has given us more than anything else, is more and better means of communication.
  You can talk to people on the other side of the world, heck, on the other side of the solar system if you don't mind the delay. Video feeds, planes that'll actually get you there in less than a day, ...
  And yet, with all of this, it seems that we're not actually doing it. A company explaining what went wrong is the exception. Internet forums without flamers and trolls? Exception.
  "Anything you say can and will be used against you."
  
  Parent Share
  twitter facebook
- Re:Other companies could learn from this... (Score:5, Funny)
  
  by Anonymous Coward writes: on Saturday July 26, 2008 @06:34PM (#24351969)
  
  Other companies could learn something from this, unfortunately they won't be able to do anything similar as Amazon has patented the process of explaining technological problems to customers.
  
  Parent Share
  twitter facebook
- Re:Other companies could learn from this... (Score:0, Offtopic)
  
  by ddrichardson ( 869910 ) writes: on Saturday July 26, 2008 @07:06PM (#24352255)
  
  If I was an Amazon customer I would be happy with their explanation and apology even if obviously the downtime is still an issue.
  Since their drive towards Amazon Prime, their deliveries have been appalling (at least in the area of the UK I'm in), YMMV. As much as I appreciate their candour, the delivery and delay problems are what is driving me away at the moment.
  
  Parent Share
  twitter facebook
  - Comment removed (Score:2)
    
    by account_deleted ( 4530225 ) * writes: on Saturday July 26, 2008 @07:49PM (#24352633)
    
    Comment removed based on user account deletion
    
    Parent Share
    twitter facebook
    - Re:Other companies could learn from this... (Score:2)
      
      by ddrichardson ( 869910 ) writes: on Saturday July 26, 2008 @09:33PM (#24353561)
      
      That as a customer their server up time is not as big a factor as poor delivery. I'd go further to say that I wish their service was as good as their uptime. I would have thought the post was pretty straight forward.
      
      Parent Share
      twitter facebook
- It's quite an old story - see RFC789 (Score:5, Interesting)
  
  by anti-NAT ( 709310 ) writes: on Saturday July 26, 2008 @07:18PM (#24352367) Homepage
  
  Vulnerabilities of Network Control Protocols: An Example [ietf.org], published in January 1981.
  What do they say about those who ignore history?
  
  Parent Share
  twitter facebook
  - Re:It's quite an old story - see RFC789 (Score:5, Funny)
    
    by Gazzonyx ( 982402 ) writes: <scott.lovenbergNO@SPAMgmail.com> on Saturday July 26, 2008 @11:17PM (#24354321)
    
    [...]
    What do they say about those who ignore history?
    I think it was, they're doomed to reimplement it... poorly. Or was that Unix? ;)
    
    Parent Share
    twitter facebook
- Re:Other companies could learn from this... (Score:3, Insightful)
  
  by SanityInAnarchy ( 655584 ) writes: <ninja@slaphack.com> on Saturday July 26, 2008 @08:30PM (#24353043) Journal
  
  Well, technically speaking, there isn't an apology there:
  Finally, we want you to know that we are passionate about providing the best storage service at the best price so that you can spend more time thinking about your business rather than having to focus on building scalable, reliable infrastructure. Though we're proud of our operational performance in operating Amazon S3 for almost 2.5 years, we know that any downtime is unacceptable and we won't be satisfied until performance is statistically indistinguishable from perfect.
  Allow me to translate:
  We screwed up. We'll do better next time.
  Nowhere in the document do the words "I'm sorry" appear. That's entirely implied.
  
  Parent Share
  twitter facebook
  - Re:Other companies could learn from this... (Score:5, Insightful)
    
    by Alpha830RulZ ( 939527 ) writes: on Saturday July 26, 2008 @11:41PM (#24354497)
    
    The words may not be there, but there is a pretty clear message there for me to see that they are not happy or smug about this event, and are agreeing with the consumer that this shouldn't have happened, and won't happen again if they can help it. That's enough for me.
    And I'm actually one of their consumers, compared to some of the dilletantes here. We use S3 and EC2 to manage training and demo instances of our software, and are pretty pleased so far.
    
    Parent Share
    twitter facebook
    - Re:Other companies could learn from this... (Score:2)
      
      by SanityInAnarchy ( 655584 ) writes: <ninja@slaphack.com> on Monday July 28, 2008 @12:26AM (#24364165) Journal
      
      I am a customer, and this does reassure me.
      It is sad, though, that it has to go through legal first.
      
      Parent Share
      twitter facebook
- Re:Other companies could learn from this... (Score:2)
  
  by brxndxn ( 461473 ) writes: on Sunday July 27, 2008 @02:12AM (#24355399)
  
  Ya, seriously.. Honesty!! In this post-911 day and age, we have honesty!
  I'm gonna go buy something from Amazon.com now.
  
  Parent Share
  twitter facebook
- Re:Other companies could learn from this... (Score:1)
  
  by emaname ( 1014225 ) writes: on Sunday July 27, 2008 @03:13PM (#24360279)
  
  I agree. A little transparency goes a long way toward reinforcing credibility. This is a very smart business move. It demonstrates two things. Amazon respects its customers enough not to insult them with some lame, vague 'hardware problem' excuse. And it validates that they actually troubleshoot their problems and understand them before they apply a solution. The combination of these two responses makes the apology even more sincere.
  
  Parent Share
  twitter facebook
It was drunk, had father issues, and... (Score:1, Funny)

by SensitiveMale ( 155605 ) writes: on Saturday July 26, 2008 @06:21PM (#24351823)

was trying to hold onto a man?
I'm just guessing here.

Share
twitter facebook
Sigh. Tempus fugit, and experience is lost (Score:1, Interesting)

by Anonymous Coward writes: on Saturday July 26, 2008 @06:25PM (#24351869)

I remember at least one similar incident:
Early (or earlish) arpanet, the network controllers did not implement checksum on messages. A bit flip caused the whole shebang to keep on forwarding the same message over and over, to the point that nothing else would flow, not even a reset command. Some one had to be sent to the culprit box and manually reset it.
I remember this being discussed on some technical circles for a while (I think Software Engineering Notes from the ACM had a go on it).

Share
twitter facebook
Programmers never learn... (Score:0)

by Anonymous Coward writes: on Saturday July 26, 2008 @06:27PM (#24351891)

Stop making protocols whereby one server can crash [soft.com] another already. Especially when they talk to each other constantly and there's a lot of them. Cascade failures, FTW.

Share
twitter facebook
- Re:Programmers never learn... (Score:2)
  
  by Manip ( 656104 ) writes: on Saturday July 26, 2008 @06:46PM (#24352077)
  
  In 99.9% of cases it isn't the protocol that causes a server crash and instead it is the way that the protocol is implemented.
  This story is another example of that. Although they're fixing it by changing the protocol spec' that is really just a much cheaper way of resolving the core issue (e.g. That any input shouldn't ever cause a crash - corrupt or otherwise).
  Programmers need to learn something but I think the real lesson here is simply that input over the network cannot ever be trusted. You should assume that it is corrupt, untrusted, or wrong.
  
  Parent Share
  twitter facebook
  - Re:Programmers never learn... (Score:0)
    
    by Anonymous Coward writes: on Saturday July 26, 2008 @07:11PM (#24352307)
    
    "I think the real lesson here is simply that input over the network cannot ever be trusted. You should assume that it is corrupt, untrusted, or wrong."
    Isn't this Microsoft's answer?
    Deny or Allow?
    Oh wait. Replace input over the network with end user?
    
    Parent Share
    twitter facebook
    - Re:Programmers never learn... (Score:2)
      
      by somersault ( 912633 ) writes: on Saturday July 26, 2008 @08:22PM (#24352989) Homepage Journal
      
      No. Even if the end user is not malicious or stupid, the network itself will not always transmit accurately. Interference or signal degradation on the lines, or a dodgy switch can screw things up.
      
      Parent Share
      twitter facebook
  - Re:Programmers never learn... (Score:2)
    
    by caluml ( 551744 ) writes: <slashdot@NoSPaM.spamgoeshere.calum.org> on Saturday July 26, 2008 @08:14PM (#24352877) Homepage
    
    Programmers need to learn something but I think the real lesson here is simply that input over the network cannot ever be trusted. You should assume that it is corrupt, untrusted, or wrong.
    Well, I have to say I'm guilty. I just assume that all the checksums at the various levels of TCP/IP make sure the data that comes over a socket is the same as what went in.
    Of course, if I was working on a critical/must-never-go-down system, I'd maybe be a little more paranoid.
    
    Parent Share
    twitter facebook
  - Re:Programmers never learn... (Score:4, Interesting)
    
    by SanityInAnarchy ( 655584 ) writes: <ninja@slaphack.com> on Saturday July 26, 2008 @08:33PM (#24353079) Journal
    
    I think the real lesson here is simply that input over the network cannot ever be trusted.
    It is their network, and S3 is built on tech which is explicitly designed to not have any kind of security built-in. The security is applied at the API level, but any misbehaving machine within the S3 cluster could cause some serious damage.
    I actually agree with this philosophy, to an extent. After all, this is essentially a large number of computers acting as a hard disk. How would you approach talking to a hard disk in your own machine? Do you assume that everything is corrupt, untrusted, or wrong?
    
    Parent Share
    twitter facebook
    - Re:Programmers never learn... (Score:2)
      
      by afidel ( 530433 ) writes: on Monday July 28, 2008 @04:12PM (#24374203)
      
      Yes, or at least we should. Both zfs and ANSI T10-DIF (hardware) assume that there will be silent errors at the storage level and make changes to accommodate them. The funny thing to me is that ANSI T10-DIF takes the same approach the mainframe made 28+ years ago and use 520B blocks instead of 512B blocks.
      
      Parent Share
      twitter facebook
      - Re:Programmers never learn... (Score:2)
        
        by SanityInAnarchy ( 655584 ) writes: <ninja@slaphack.com> on Tuesday July 29, 2008 @01:54AM (#24381059) Journal
        
        Yes, or at least we should.
        Let me be more specific:
        ZFS assumes everything could be corrupt. It makes no assumptions about wrong or untrusted. ("Wrong" being defined here as human error, so that it's not just "corrupt".) And I think that's about all a storage layer should do.
        ZFS also isn't the only way to do this. It's perhaps the slickest, but there's still things like the bad-block relocation layer in Linux, and there are higher-level software layers, like Git.
        Now, yes, S3 dropped the ball -- there was a point where they missed "corrupt" on that checklist. But I can see how they might have made the simple leap from "trusted" to "assumed non-corrupt".
        Also: Consider that this is not the data. This is the communication layer. You might well assume that your data is fine, but what about your northbridge? How do you know the data in RAM is good? (Does ZFS make any allowances for that?)
        
        Parent Share
        twitter facebook
It was a design defect (Score:5, Informative)

by j. andrew rogers ( 774820 ) writes: on Saturday July 26, 2008 @06:44PM (#24352053)

It has been generally well-known for a number of years now that any time you have a large cluster you cannot count on hardware checksums to catch every bit flip that may occur during copies and transmission, particularly with consumer hardware which has many internal paths with no checksums at all. Google learned this the hard way, like the supercomputing people before them, and now like Amazon after Google. And some of the better database engines also do their own internal software checksums as well to catch uncaught errors introduced as the data gets copied across the silicon, disks, and network -- it is one way they get their very high uptime and low failure rate.
It does not reflect well on the software community that most people *still* do not know to do this for very large scale system designs. The performance cost of doing a software CRC on your data every time it is moved around is low enough that it is generally worth it these days. If your system is large enough, the probability of getting bitten by this approaches unity. Very fast implementations of Adler-32 and other high-performance checksum algorithms are widely available online.

Share
twitter facebook
- Re:It was a design defect (Score:5, Interesting)
  
  by James Youngman ( 3732 ) writes: <jay@nOspam.gnu.org> on Saturday July 26, 2008 @07:17PM (#24352351) Homepage
  
  Adler-32 wouldn't be a great choice. It's fast but it's weak for short messages [ietf.org] and I've seen it fooled by multi-bit errors on large messages too.
  See Koopman's paper 32-bit cyclic redundancy codes for Internet applications [ieee.org] for some better ideas.
  
  Parent Share
  twitter facebook
  - Re:It was a design defect (Score:2)
    
    by j. andrew rogers ( 774820 ) writes: on Saturday July 26, 2008 @08:00PM (#24352753)
    
    For these purposes, a CRC that is fast and weak is generally superior to one that is slow and strong because the CPU load of the implementation does have a measurable impact on system performance. Remember, the software CRC is supposed to catch failures in the other layers of CRC and error detection, so it does not have to be perfect. If it reduces the probability of an uncaught problem from a few times a year to a few times per millenium, that may be sufficient if using a stronger CRC means burning 10% of your total CPU time. Adler-32 is an example of an algorithm that is used in practice in preference to stronger, slower algorithms, hence why I mentioned it. If there is something as fast or faster that is also stronger, we should be using that instead.
    Many of the software checksums used are intentionally weak and fast to minimize the performance impact, though as systems get bigger they may need to use something stronger to keep the probability of an uncaught error within the acceptable safety margins. Fortunately, people are working on that problem.
    
    Parent Share
    twitter facebook
    - Re:It was a design defect (Score:5, Interesting)
      
      by James Youngman ( 3732 ) writes: <jay@nOspam.gnu.org> on Saturday July 26, 2008 @08:22PM (#24352981) Homepage
      
      I've seen Adler-32 fail twice, so your assumption of "a few times per millennium" doesn't seem to work for me (I'm under 40). In fact in those cases it was even stacked on top of at least one other checksum mechanism, too.
      One of the problems with it is poor spreading of the input bits into s2. There are other algorithms which don't have that weakness but don't (IIRC) cost any more to compute.
      
      Parent Share
      twitter facebook
      - Re:It was a design defect (Score:2)
        
        by j. andrew rogers ( 774820 ) writes: on Saturday July 26, 2008 @08:41PM (#24353147)
        
        It does not surprise me at all that there are better algorithms out there or that Alder-32 failed, but the failure probability is compound. It only has to work a few times a year in many cases, and Adler-32 *is* used in some respectable systems as the error trap of last resort. Out of curiosity, not being a topical expert, what would be the best CRC algorithm to use on modern silicon? It is also worth noting that some places are starting to use cryptographic hashes instead of CRC for software checksum purposes, so perhaps this is all moot.
        
        Parent Share
        twitter facebook
        
        Re:It was a design defect (Score:2)
        
        by James Youngman ( 3732 ) writes: <jay@nOspam.gnu.org> on Sunday July 27, 2008 @05:49PM (#24361391) Homepage
        
        Honestly I'm not the best person to ask about that. But if I were choosing one without external help, I'd start by reading the Koopman paper. However, it contains no implementations AFAIK.
        
        Parent Share
        twitter facebook
  - Re:It was a design defect (Score:2)
    
    by v1 ( 525388 ) writes: on Sunday July 27, 2008 @05:52PM (#24361409) Homepage Journal
    
    For those not familiar with checksums, the "ideal" checksum algorithm will toggle approximately 50% of the bits in the checksum for any one bit in the source material toggled. This makes it highly unlikely for several errors in an otherwise correct message to combine to produce the same checksum. Ideally, two messages that produce the same checksum should be wildly different, not nearly identical.
    
    Parent Share
    twitter facebook
- Re:It was a design defect (Score:2)
  
  by ToasterMonkey ( 467067 ) writes: on Sunday July 27, 2008 @12:03AM (#24354647) Homepage
  
  I agree with you,
  particularly with consumer hardware which has many internal paths with no checksums at all.
  but, are there any hardware checksums AT ALL in the Intel PC architecture, aside from add-on cards such as HBAs which only protect external data?
  PCIe, and ECC ram at least do some kind of error correction, what, two and one bits worth respectively? I think that's it.
  This is one of the biggest differences between cheapo Intel hardware and proprietary stuff from Sun, Fujitsu, IBM, HP, etc.
  I can't really feel sorry for people who saved a few bucks implementing critical systems on cheap hardware and didn't bother implementing the proper levels of software error correction. They got what they paid for.
  
  Parent Share
  twitter facebook
  - Re:It was a design defect (Score:2)
    
    by j. andrew rogers ( 774820 ) writes: on Sunday July 27, 2008 @12:40AM (#24354861)
    
    You are correct, Intel processors and chipsets have limited CRC in the internal pathways. In fact, if this matters to you, it is one of the areas where AMDs silicon is better, having more comprehensive error detection. HyperTransport, for example, has CRC.
    
    Parent Share
    twitter facebook
- Re:It was a design defect (Score:2)
  
  by merreborn ( 853723 ) writes: on Sunday July 27, 2008 @10:34AM (#24357803) Journal
  
  It does not reflect well on the software community that most people *still* do not know to do this for very large scale system designs.
  This sort of knowledge gap exists in many arenas. A classic error in client-server software: never trust the client. So many "hacks" in online games boil down to the game server trusting the game client to obey the rules, when it's really the server's responsibility to handle all the rule enforcement.
  So, yes, software developers as a group frequently fail to learn from those who came before them. In part, this probably has to do with existing education -- it seems like most college programs are always years behind the curve. There's also seems to be a large disconnect between the academic community and the rest; scientists can keep up to date on progress in the field reading academic journals and the like. Programmers, on the other hand, seem to disappear into corporations after college, where much of their knowledge becomes proprietary.
  The lack of efficient knowledge sharing in our field is causing us to repeat the same mistakes for decades on end.
  
  Parent Share
  twitter facebook
  - Re:It was a design defect (Score:2)
    
    by petermgreen ( 876956 ) writes: <plugwashNO@SPAMp10link.net> on Sunday July 27, 2008 @04:42PM (#24360911) Homepage
    
    This sort of knowledge gap exists in many arenas. A classic error in client-server software: never trust the client. So many "hacks" in online games boil down to the game server trusting the game client to obey the rules, when it's really the server's responsibility to handle all the rule enforcement.
    Unfortunately for a game to be popular it needs to be responsive. Often that means the client doing some of the movement and hit calculations and then the server adjusting those to keep the world somewhat in sync rather than the game doing all the calculations.
    And of course it would be crazy to do all the rendering on the server so the client ends of
    OTOH more serious software certainly should not be trusting clients without a very good reason.
    
    Parent Share
    twitter facebook
  - Re:It was a design defect (Score:1)
    
    by DiamondMX ( 1147759 ) writes: on Tuesday July 29, 2008 @03:45AM (#24381625)
    
    Modern games have rather a lot of rules, enough that it's inefficient to have the server run all of them.
    It's not a knowledge gap - everyone knows it already - it's a compromise for performance in a field (unlike online shopping) where performance > security.
    
    Parent Share
    twitter facebook
It was the flood of disgruntled me.com users... (Score:1)

by jpellino ( 202698 ) writes: on Saturday July 26, 2008 @06:49PM (#24352099)

heading for s3.

Share
twitter facebook
It providers a timeline of events (Score:1, Funny)

by Anonymous Coward writes: on Saturday July 26, 2008 @07:22PM (#24352383)

It providers a timeline of events
It provideRS? PROVIDERS?!?
I'TS PROVIDED!

Share
twitter facebook
S3 (Score:1)

by sharperguy ( 1065162 ) writes: on Saturday July 26, 2008 @09:01PM (#24353307)

What Solid Snake Simulation?

Share
twitter facebook
Always happens on a Sunday (Score:2, Interesting)

by sebastiengiroux ( 1333579 ) writes: on Saturday July 26, 2008 @09:05PM (#24353345) Homepage

Anybody else noticed that these major problems always occur when no one is around to fix them?

Share
twitter facebook
- Re:Always happens on a Sunday (Score:4, Informative)
  
  by innerweb ( 721995 ) writes: on Saturday July 26, 2008 @10:41PM (#24354031)
  
  After working in a production environment as a developer, I can assure you that the correct interpretation is "Anybody else noticed that these major problems always occur when no one is around to catch them (before they get out of hand)?"
  InnerWeb
  
  Parent Share
  twitter facebook
  - - Re:Always happens on a Sunday (Score:2)
      
      by innerweb ( 721995 ) writes: on Sunday July 27, 2008 @10:35PM (#24363529)
      
      Thats funny, I did not catch that double meaning. It was a production environment. Document production, and I was one of the developers who wrote software to keep it going along. So, I developed in a production environment.
      InnerWeb
      
      Parent Share
      twitter facebook
Byzantine failure (Score:3, Interesting)

by ge ( 12698 ) writes: on Sunday July 27, 2008 @10:15AM (#24357655)

So the whole cloud is in trouble if one node starts spewing nonsense? So much for redundancy. Amazon developers would be well advised to read up on the "Byzantine Generals" problem.

Share
twitter facebook
- Re:Byzantine failure (Score:2)
  
  by hansamurai ( 907719 ) writes: <hansamurai@gmail.com> on Monday July 28, 2008 @01:40PM (#24371849) Homepage Journal
  
  Yeah, they should have bought a book on it from half.com or something.
  
  Parent Share
  twitter facebook
A customer's reponse (Score:2)

by Revvy ( 617529 ) writes: on Sunday July 27, 2008 @03:06PM (#24360207) Homepage

I work for a company that uses Amazon S3 for our customer's data storage for the same reasons that many other companies do - they're reliable and inexpensive. We have a couple hundred terabytes of data stored on Amazon's servers and, aside from this one instance, we haven't had a major problem in three years.

Because we're in Seattle and a few blocks from Amazon's headquarters, we got a personal visit last week from one of the senior managers of Amazon's hosted platforms group. In addition to being able to ask him all kinds of great questions about how they do their business and what technologies they employ that we could also use, we got to ask him about what happened.

He was completely open and honest about it. He knew that we, like every other Amazon S3 customer, had suffered and that some of us had lost a Sunday to dealing with customer complaints. He apologized and told us that they were taking steps to make sure it wouldn't happen again.

Amazon has handled this very well and we will continue to be a customer of theirs.

---
Five nines allows for over eight hours of downtime a year.

Share
twitter facebook
Gossip protocol decoded (Score:1)

by csb ( 23046 ) writes: on Sunday July 27, 2008 @07:50PM (#24362287)

"The server is down, purple monkey dishwasher"

Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

for want of a nail ... (Score:5, Interesting)

Re:for want of a nail ... (Score:5, Funny)

haha (Score:5, Informative)

Re:haha (Score:1)

Re:haha (Score:3, Informative)

Re:for want of a nail ... (Score:0)

Re:for want of a nail ... (Score:2)

Re:for want of a nail ... (Score:3, Funny)

Re:for want of a nail ... (Score:1)

evil bit defined (Score:1)

Re:for want of a nail ... (Score:0)

Re:for want of a nail ... (Score:5, Informative)

Re:for want of a nail ... (Score:4, Interesting)

re:Northeast Blackout of 1965 (Score:2)

Re:Northeast Blackout of 1965 (Score:1)

ECC memory, anyone? (Score:4, Interesting)

Re:ECC memory, anyone? (Score:4, Informative)

Re:ECC memory, anyone? (Score:5, Informative)

Re:ECC memory, anyone? (Score:3, Informative)

Re:ECC memory, anyone? (Score:2)

Re:ECC memory, anyone? (Score:3, Informative)

Re:for want of a nail ... (Score:0)

Re:for want of a nail ... (Score:2)

Re:for want of a nail ... (Score:0)

Re:for want of a nail ... (Score:0)

Re:for want of a nail ... (Score:1)

You have no idea... (Score:1, Interesting)

Re:for want of a nail ... (Score:5, Funny)

They need more Erlang. (Score:0)

Re:They need more Erlang. (Score:5, Insightful)

Re:They need more Erlang. (Score:0)

Re:They need more Erlang. (Score:5, Insightful)

Simple (Score:3, Insightful)

Re:Simple (Score:2)

It's not the first time... (Score:0)

...make lemonade. (Score:4, Funny)

Re:...make lemonade. (Score:3, Insightful)

Re:...make lemonade. (Score:3, Interesting)

Re:...make lemonade. (Score:1)

Re:...make lemonade. (Score:5, Insightful)

Re:...make lemonade. (Score:1)

Re:...make lemonade. (Score:2)

Re:...make lemonade. (Score:3, Insightful)

Re:...make lemonade. (Score:5, Interesting)

Re:...make lemonade. (Score:2)

Re:...make lemonade. (Score:1, Interesting)

Re:...make lemonade. (Score:2)

Re:...make lemonade. (Score:2)

Re:...make lemonade. (Score:2)

Lost time? (Score:2)

Re:Lost time? (Score:0)

Re:Lost time? (Score:1, Interesting)

Re:Lost time? (Score:3, Informative)

Re:Lost time? (Score:5, Informative)

Re:Lost time? (Score:1)

Other companies could learn from this... (Score:5, Insightful)

Re:Other companies could learn from this... (Score:2, Insightful)

Re:Other companies could learn from this... (Score:5, Funny)

Re:Other companies could learn from this... (Score:0, Offtopic)

Comment removed (Score:2)

Re:Other companies could learn from this... (Score:2)

It's quite an old story - see RFC789 (Score:5, Interesting)

Re:It's quite an old story - see RFC789 (Score:5, Funny)

Re:Other companies could learn from this... (Score:3, Insightful)

Re:Other companies could learn from this... (Score:5, Insightful)

Re:Other companies could learn from this... (Score:2)

Re:Other companies could learn from this... (Score:2)

Re:Other companies could learn from this... (Score:1)

It was drunk, had father issues, and... (Score:1, Funny)

Sigh. Tempus fugit, and experience is lost (Score:1, Interesting)

Programmers never learn... (Score:0)

Re:Programmers never learn... (Score:2)

Re:Programmers never learn... (Score:0)

Re:Programmers never learn... (Score:2)

Re:Programmers never learn... (Score:2)

Re:Programmers never learn... (Score:4, Interesting)

Re:Programmers never learn... (Score:2)

Re:Programmers never learn... (Score:2)

It was a design defect (Score:5, Informative)

Re:It was a design defect (Score:5, Interesting)