Courtesy: Adrian Cockcroft
There has been a lot of discussion in the last few days about EBS since it was implicated in a long outage at reddit.com.
Rule of Thumb
The benchmarking Netflix did when we started on AWS highlighted some inconsistent behavior in EBS. The conclusion we reached is a rule of thumb for EBS – If you sustain less than 100 iops (input+output per second) long term average it works fine. Short term bursts can be 1000 iops. By short term I mean less than a minute, long term more than 10 minutes. YMMV.
If you are doing benchmarks like this, collect response time and throughput and plot your data over time. You need to run long enough that the performance shows steady state behavior. The problem with EBS is that it doesn’t have a particularly steady state. To explain why we need to look at the underlying architecture. I don’t know the details of how EBS is implemented, but there is enough information available to explain how it behaves.
The AWS EC2 architecture is built out of commodity low cost servers, they have a single 1 Gbit network, a few CPUs, a few disks and a few GBytes of RAM. Over time the models have changed, and EC2 does have a 10Gbit network option now, but for the purposes of this discussion, we will concentrate on the 1Gbit network models. Individual servers are virtualized into the familiar EC2 models by slicing up the RAM, CPUs and disk space, and sharing the network bandwidth and disk iops. When EC2 instances break or are de-configured any data on the internal disks is lost.
Elastic Block Store http://aws.amazon.com/ebs/
The AWS EBS service provides a reliable place to store data that doesn’t go away when EC2 instances are dropped, but it provides the same mounted filesystem capability as the internal disks. If you need more disk space or iops you can mount more EBS volumes on a single EC2 instance and spread out the load. The EBS volume is connected to the EC2 instance over the same 1Gbit network as everything else. In a datacenter this would normally be built using commercially available high end storage from NetApp, EMC or whoever, it would be quite expensive (cost much more than the EC2 instance itself) and be fast and reliable up to the limits of the network. To build a low cost cloud, the alternative is to use RAIN (Redundant Array of Inexpensive Nodes) which could be based on standard EC2 instances, or variants that have more disks per CPU. Software is then used to coordinate the RAIN systems and provide an EBS service that will be slower than high end storage, but still be very reliable and be limited by the 1Gbit network.
S3 and Availability Zones
AWS also has an S3 storage service that behaves like a key/value store accessed via http requests and a REST API rather than a directly mounted filesystem. It is possible to rapidly snapshot an EBS volume to and from S3, including incremental backups and restores that fill as they go so you don’t have to wait before using them. This implies to me that they share a common back-end infrastructure to some extent. The primary additional difference is that EBS volumes only exist in a single AWS Availability Zone, and S3 data is replicated across two or three Availability Zones. It takes longer to replicate the data for S3, so it is slower, but it is very robust and it is almost impossible to lose data. You can think of an Availability Zone as a complete datacenter. All the zones in a region are separate datacenters that are close enough together to support a high bandwidth and low latency network between them, but they have separate power sources and connections to the Internet.
The most efficient chunk of compute and storage resource to buy and deploy when building a cloud is either too big or too small for the actual use cases of real applications. Virtualization is used to sub-divide the chunks, but then each individual machine is supporting several independent tenants. For local disks, the space is divided between the tenants, and for network, everyone is sharing the same 1Gbit interface. This works well on average, because most use cases aren’t network or disk bound, but you cannot control who you are sharing with and some of the time you will be impacted by the other tenants, increasing variance within each EC2 instance. You can minimize the variance by running on the biggest instance type, e.g. m1.xlarge, or m2.4xlarge. In this case there isn’t room for another big tenant, so you get as much as possible of the disk space and network bandwidth to yourself. The virtualization layer reserves some of the capacity. It’s possible to tell that another tenant is keeping the CPU busy by looking at the “stolen time”, but there are no metrics for stolen iops or network bandwidth.
The EBS service is also multi-tenant. Many clients mount disk space from a common backend pool of EBS disks. You don’t get to see how the disk space is allocated, or how data is replicated over more than one disk or instance for durability, but it is limited to that availability zone. A busy client can slow down other clients that share the same EBS service resources. EBS volumes are between 1GB and 1TB in size. If you allocate a 1TB volume, you reduce the amount of multi-tenant sharing that is going on for the resources you use, and you get more consistent performance. Netflix uses this technique, our high traffic EBS volumes are mostly 1TB, although we don’t need that much space.
This is actually no different in principle to large shared storage area network (SAN) backends (from companies like EMC or NetApp) that are in common datacenter use. Those also have unpredictable performance when pushed hard, and they mask this issue with lots of battery backed memory. The difference is cost. EBS is 10c per Gbyte per month. If you build a competing public cloud service using high end storage, you could get better performance but your cost base would be far higher.