Lessons From Recent Power Outages

Wed Jul 18 16:55:01 CDT 2012

Uh, isn't blaming Amazon for one part of their "cloud" going away missing the point that some that provide cloud-hosted applications have chosen to do so without architecting sufficient diversity and robustness into their platforms?  Setting aside specific power system problems in this incident, what would happen if there were an earthquake, tornado, bombing, plane crash or other catastrophic even that took that data center offline?  This isn't a "cloud" problem; data center customers with their own dedicated servers still have to deal with interruptions and outages.  Actually, in the Amazon EC2 "cloud" case, Amazon already partitions their cloud resources into multiple availability zones as a means for application developers and operators to get compute resources deployed in diverse sets of facilities.

It could be that the applications in questions have availability goals that can withstand occasional outages vs. having 99.999% availability.  But there's a big cost, either in $$$ or design and architecture to get higher availability.    I know from the work we did at Vonage and how that architecture is highly distributed geographically.  It's nice to know that the loss of a data center isn't going to put you out of business (or kill people because your E911 platform stopped working).  The sphincter doesn't clench near a tight, and now you have to worry about losing ANOTHER facility.

I'm sure there will be many lessons learned by the guys running the power infrastructure for that Amazon data center.  But the more important lessons ought to be learned by those that didn't give serious thought to failure modes and availability.  Even when there are tools available to give you the diversity knobs you would desire, if you were interested in actually using them.  For those really paranoid, you'd probably deploy across multiple cloud vendors to avoid a certain set of common-mode failures and arrange for diverse connectivity for the same reasons.

Amazon will surely get spanked, but there's plenty of blame to go around.  It used to be that system architectures for robust systems could be somewhat simpler; the "HOT" site and the "STANDBY" site that you synced data to, and fell over to when Something Bad happened.  Elastic cloud infrastructure brings you more degrees of freedom and more building blocks.  The building blocks themselves are different than having your own dedicated infrastructure (which  you can still go any buy), but they enable a different cost profile that demands another look at you you achieve robustness.

This isn't new.  At Vonage, we built a telephone infrastructure out of commodity-grade Linux boxes as compared to the 99.999% availability, very expensive platforms that the Phone Company used before.  You can build really reliable platforms out of unreliable components by having the right architecture.  We didn't need a Linux box with 99.999% availability because there were a pool of boxes, and if 2 out of 4 were running at any given time, everything was fine.  The old approach required ONE REALLY RELIABLE, REALLY EXPENSIVE box that never went down.  It's a case of choosing to solve the problem with different tools that were not available years ago.

This wasn't a case of the elastic stretching too far.  It's really closer to having 6 redundant components, and only choosing to use (deploy) into one of them because it's easier.  Not using the tools available to  you.  We have all sorts of tools now, with the ability to deploy into half a dozen data centers around the world without having to get on an airplane or even ship boxes.  What excuse do application providers have when the one and only single instance goes down?  What happens when another airplanes falls out of the sky into your data center?  Even seen the office park in Ashburn where this stuff is?  As I recall, it's pretty much under the pattern for IAD.  It's not just power that could have failed.

louie
wa3ymh

On Jul 18, 2012, at 5:13 PM, Terry Fox wrote:

> Um, isn’t blaming the grid kind of burying your head in the sand?  If they are supposed to have such a fantastic and reliable Cloud, shouldn’t their cloud support include on-site backup power facilities?  And, there should not be an “secret” or “unknown” closets of fiber that nobody knows about until Virginia Power/Pepco goes bye-bye.
>  
> This reminds me of the story about the American University (or was it Howard?) facility that installed a new generator.  When Pepco died, their generator promptly came on, ran for a few hours, then died.  Somehow the pump that fills the day tank from the main tank was wired to Pepco instead of the generator, and the small day tank ran dry.  Oops.
>  
> I guess the Elastic got stretched too thin, and things did not computer anymore.  All this because the virtual cloud got bumped offline by a few very real clouds.
>  
> In the old days, this would be a true “teaching moment”, bordering on a firing offense.  Being offline for an hour or two could be excusable, but for days?
> Terry, WB4JFI
>  
>  
> From: Andre Kesteloot
> Sent: Wednesday, July 18, 2012 2:54 PM
> To: Tacos
> Subject: Lessons From Recent Power Outages
>  
>  
> 
> 
> 
>> From: "IEEE 
>> Subject: Lessons From Recent Power Outages
>> 
>> 
>> To view this e-mail as a web page, go here
>> Forward to a friend | Print	
>> CONNECT WITH US:   	
>> 		
>> 
>> News and opinions on sustainable power, cars, and climate	July 18, 2012
>> 
>> Lessons From Recent Power Outages
>> by Bill Sweet
>> The cloud might be above it all, but the stuff upon which it rests clearly isn’t. On 6 July, a fast-moving band of severe thunderstorms left 750 000 people without power in Northern Virginia and took out Amazon’s Elastic Compute Cloud server facility. This local weather event left Amazon cloud customers such as Netflix, Instagram, Perest, and Heroku without access to their databases for days, and made these services unavailable to Web users around the globe. Observers have rightly asked what can be done to improve the grid so that virtual systems aren’t so vulnerable to real-world events. Researchers are already on the case, with software for improved monitoring of transmission lines and a big smart-grid pilot project that will test networking, communication, and distribution-management tools in an effort to speed up identification of problems and the dispatch of technicians to trouble spots.
>> ADVERTISEMENT
>> 
>> ENERGY NEWS
>> 
>> Outage Recovery and Market Manipulation Are Still Problems
>> The thin-stretched grid is still stretched pretty thin
>> 
>> 
>> Smart Conservation for the Lazy Consumer
>> People aren't conserving energy for love or money--you have to trick them into it
>> 
>> 
>> Jim Rogers, Duke and the Future of Nuclear
>> Several Southeast reactor projects hang in the balance
>> 
>> 
>> Japan Restarts Nuclear Reactor as Report Lays Blame for Fukushima
>> Commission calls the nuclear crisis a "profoundly manmade disaster."
>> 
>> 
>> Fukushima Nuclear Accident: The Earthquake Question
>> Government report questions TEPCO's assertions that tsunami caused all damage
>> 
>> 
>> U.S. Weather Extremes and Climate Change
>> "We're going outside the realm of conditions previously experienced"
>> 
>> ADVERTISEMENT
>>  
>> ADVERTISEMENT
>> 
>>  
>> BACK ISSUES	|	Interested in other IEEE Newsletters? SIGN UP HERE.	|	PRIVACY POLICY	|	PRINT THIS ISSUE
>> 		
>> CONNECT WITH US:  	
>> This email was sent by IEEE  |  3 Park Avenue New York, NY 10016 USA  |  UNSUBSCRIBE  |  ADVERTISE
>> 
>> © 2012 IEEE - All rights reserved. Use of this Web site signifies your agreement to the terms and conditions.
>> 
>> IEEE Media, 3 Park Avenue, 17th Floor, New York, NY 10016
>> 
>> 
>> 
>> 
> 
> _______________________________________________
> Tacos mailing list
> Tacos at amrad.org
> https://amrad.org/mailman/listinfo/tacos
> _______________________________________________
> Tacos mailing list
> Tacos at amrad.org
> https://amrad.org/mailman/listinfo/tacos

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://amrad.org/pipermail/tacos/attachments/20120718/be8af30d/attachment-0001.html>