Design for Failure

Reprint from George Reese (O’Reilly)

So many cloud pundits are piling on to the misfortunes of Amazon Web Services this week as a response to the massive failures in the AWS Virginia region. If you think this week exposed weakness in the cloud, you don’t get it: it was the cloud’s shining moment, exposing the strength of cloud computing.

In short, if your systems failed in the Amazon cloud this week, it wasn’t Amazon’s fault. You either deemed an outage of this nature an acceptable risk or you failed to design for Amazon’s cloud computing model. The strength of cloud computing is that it puts control over application availability in the hands of the application developer and not in the hands of your IT staff, data center limitations, or a managed services provider.

The AWS outage highlighted the fact that, in the cloud, you control your SLA in the cloud—not AWS.

The Dueling Models of Cloud Computing

Until this past week, there’s been a mostly silent war ranging out there between two dueling architectural models of cloud computing applications: “design for failure” and traditional. This battle is about how we ultimately handle availability in the context of cloud computing.

The Amazon model is the “design for failure” model. Under the “design for failure” model, combinations of your software and management tools take responsibility for application availability. The actual infrastructure availability is entirely irrelevant to your application availability. 100% uptime should be achievable even when your cloud provider has a massive, data-center-wide outage.

Most cloud providers follow some variant of the “design for failure” model. A handful of providers, however, follow the traditional model in which the underlying infrastructure takes ultimate responsibility for availability. It doesn’t matter how dumb your application is, the infrastructure will provide the redundancy necessary to keep it running in the face of failure. The clouds that tend to follow this model are vCloud-based clouds that leverage the capabilities of VMware to provide this level of infrastructural support.

The advantage of the traditional model is that any application can be deployed into it and assigned the level of redundancy appropriate to its function. The downside is that the traditional model is heavily constrained by geography. It would not have helped you survive this level of cloud provider (public or private) outage.

The advantage of the “design for failure” model is that the application developer has total control of their availability with only their data model and volume imposing geographical limitations. The downside of the “design for failure” model is that you must “design for failure” up front.

The Five Levels of Redundancy

In a cloud computing environment, there are five possible levels of redundancy:

  • Physical
  • Virtual resource
  • Availability zone
  • Region
  • Cloud

When I talk about redundancy, I’m talking about a level of redundancy that enables you to survive failures with zero downtime. You have the redundancy that simply lets the system keep moving when faced with failures.

Physical redundancy encompasses all traditional “n+1” concepts: redundant hardware, data center redundancy, the ability to do vMotion or equivalents, and the ability to replicate an entire network topology in the face of massive infrastructural failure.

Traditional models end at physical redundancy. “Design for failure” doesn’t care about physical redundancy. Instead, it allocates redundant virtual resources like virtual machines so that the failure of the underlying infrastructure supporting one virtual machine doesn’t impact the operations of the other unless they are sharing the failed infrastructural component.

The fault tolerance of virtual redundancy generally ends at the cluster/cabinet/data center level (depending on your virtualization topology). To achieve better redundancy, you spread your virtualization resources across multiple availability zones. At this time, I believe only Amazon gives you full control over your availability zone deployments. When you have redundant resources across multiple availability zones, you can survive the complete loss of (n-1) availability zones (where n is the number of availability zones in which you are redundant).

Until this week, no one has needed anything more than availability zone redundancy. If you had redundancy across availability zones, you would have survived every outage suffered to date in the Amazon cloud. As we noted this week, however, an outage can take out an entire cloud region.

Regional redundancy enables you to survive the loss of an entire cloud region. If you had regional redundancy in place, you would have come through the recent outage without any problems except maybe an increased workload for your surviving virtual resources. Of course, regional redundancy won’t let you survive business failures of your cloud provider.

Cloud redundancy enables you to survive the complete loss of a cloud provider.

Applied “Design for Failure”

In presentations, I refer to the “design for failure” model as the AWS model. AWS doesn’t have any particular monopoly on this model, but their lack of persistent virtual machines pushes this model to its extreme. Actually, best practices for building greenfield applications in most clouds fit under this model.

The fundamental principle of “design for failure” is that the application is responsible for its own availability, regardless of the reliability of the underlying cloud infrastructure. In other word, you should be able to deploy a “design for failure” application and achieve 99.9999% uptime (really, 100%) leveraging any cloud infrastructure. It doesn’t matter if the underlying infrastructural components have only a 90% uptime rating. It doesn’t matter if the cloud has a complete data center meltdown that takes it entirely off the Internet.

There are several requirements for “design for failure”:

  • Each application component must be deployed across redundant cloud components, ideally with minimal or no common points of failure
  • Each application component must make no assumptions about the underlying infrastructure—it must be able to adapt to changes in the infrastructure without downtime
  • Each application component should be partition tolerant—in other words, it should be able to survive network latency (or loss of communication) among the nodes that support that component
  • Automation tools must be in place to orchestrate application responses to failures or other changes in the infrastructure (full disclosure, I am CTO of a company that sells such automation tools, enStratus)

Applications built with “design for failure” in mind don’t need SLAs. They don’t care about the lack of control associated with deploying in someone else’s infrastructure. By their very nature, they will achieve uptimes you can’t dream of with other architectures and survive extreme failures in the cloud infrastructure.

Let’s look at a design for failure model that would have come through the AWS outage in flying colors:

  • Dynamic DNS pointing to elastic load balancers in Virginia and California
  • Load balancers routing to web applications in at least two zones in each region
  • A NoSQL data store with the ring spread across all web application availability zones in both Virginia and California
  • A cloud management tool (running outside the cloud!) monitoring this infrastructure for failures and handling reconfiguration

Upon failure, your California systems and the management tool take over. The management tool reconfigures DNS to remove the Virginia load balancer from the mix. All traffic is now going to California. The web applications in California are stupid and don’t care about Virginia under any circumstance, and your NoSQL system is able to deal with the lost Virginia systems. Your cloud management tool attempts to kill off all Virginia resources and bring up resources in California to replace the load.

Voila, no humans, no 2am calls, and no outage! Extra bonus points for “bursting” into Singapore, Japan, Ireland, or another cloud! When Virginia comes back up, the system may or may not attempt to rebalance back into Virginia.

Relational Databases

OK, so I neatly sidestepped the issue of relational databases. Things are obviously not so clean with relational database systems and the NoSQL system almost certainly would have lost some minimal amounts of data in the cut-over. If that data loss is acceptable, you better not be running a relational database system. If it is not acceptable, then you need to be running a relational database system.

A NoSQL database (and I hate the term NoSQL with the passion of a billion white hot suns) trades off data consistency for something called partition tolerance. The layman’s description of partition tolerance is basically the ability to split your data across multiple, geographically distinct partitions. A relational system can’t give you that. A NoSQL system can’t give you data consistency. Pick your poison.

Sometimes that poison must be a relational database. And that means we can’t easily partition our data across California and Virginia. You now need to look at several different options:

  • Master/slave across regions with automated slave promotion using your cloud management tool
  • Master/slave across regions with manual slave promotion
  • Regional data segmentation with a master/master configuration and automated failover

There are likely a number of other options depending on your data model and DBA skillset. All of them involve potential data loss when you recover systems to the California region, as well as some basic level of downtime. All, however, protect your data consistency during normal operations—something the NoSQL option doesn’t provide you. The choice of automated vs. manual depends on whether you want a human making data loss acceptance decisions. You may particularly want a human involved in that decision in a scenario like what happened this week because only a human really can judge, “How confident am I that AWS will have the system up in the next (INSERT AN INTERVAL HERE)?”

The Traditional Model

As its name implies, the “design for failure” requires you to design for failure. It therefore significantly constrains your application architecture. While most of these constraints are things you should be doing anyways, most legacy applications just aren’t built that way. Of course, “design for failure” is also heavily biased towards NoSQL databases which often are not appropriate in an enterprise application context.

The traditional model will support any kind of application, even a “design for failure” application. The problem is that it’s often harder to build “design for failure” systems on top of the traditional model because most current implementations of the traditional model simply lack the flexibility and tools that make “design for failure” work in other clouds.

Control, SLAs, Cloud Models, and You

When you make the move into the cloud, you are doing so exactly because you want to give up control over the infrastructure level. The knee-jerk reaction is to look for an SLA from your cloud provider to cover this lack of control. The better reaction is to deploy applications in the cloud designed to make your lack of control irrelevant. It’s not simply an availability issue; it also extends to other aspects of cloud computing like security and governance. You don’t need no stinking SLA.

As I stated earlier, this outage highlights the power of cloud computing. What about Netflix, an AWS customer that kept on going because they had proper “design for failure”? Try doing that in your private IT infrastructure with the complete loss of a data center. What about another AWS/enStratus startup customer who did not design for failure, but took advantage of the cloud DR capabilities to rapidly move their systems to California? What startup would ever have been able to relocate their entire application across country within a few hours of the loss of their entire data center without already paying through the nose for it?

These kinds of failures don’t expose the weaknesses of the cloud—they expose why the cloud is so important.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s