Netflix on AWS Outage

Courtesy:  Netflix

Lessons Learned
This outage gave us some valuable experience and helped us to identify and prioritize several ways that we should improve our systems to make it more resilient to failures.

Create More Failures
Currently, Netflix uses a service called “Chaos Monkey” to simulate service failure. Basically, Chaos Monkey is a service that kills other services. We run this service because we want engineering teams to be used to a constant level of failure in the cloud. Services should automatically recover without any manual intervention. We don’t however, simulate what happens when an entire AZ goes down and therefore we haven’t engineered our systems to automatically deal with those sorts of failures. Internally we are having discussions about doing that and people are already starting to call this service “Chaos Gorilla”.

Automate Zone Fail Over and Recovery
Relying on multiple teams using manual intervention to fail over from a failing AZ simply doesn’t scale. We need to automate this process, making it a “one click” operation, that can be invoked if required.

Multiple Region Support
We are currently re-engineering our systems to work across multiple AWS regions as part of the Netflix drive to support streaming in global markets. As AWS launches more regions around the world, we want to be able to migrate the support for a country to a newly launched AWS region that is closer to our customers. This is essentially the same operation as a disaster recovery migration from one region to another, so we would have the ability to completely vacate a region if a large scale outage occurred.

Avoid EBS Dependencies
We had already decided that EBS performance was an issue, so (with one exception) we have avoided it as a primary data store. With two outages so far this year we are taking steps to further reduce dependencies by switching our stateful instances to S3 backed AMIs. These take a lot longer to create in our build process, but will make our Cassandra storage services more resilient, by depending purely on internal disks. We already have plans to migrate our one large MySQL EBS based service to Cassandra.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s