There are three major lessons about IaaS we’ve learned from this experience:
1) Spreading across multiple availability zones in single region does not provide as much partitioning as we thought. Therefore, we’ll be taking a hard look at spreading to multiple regions. We’ve explored this option many times in the past – not for availability reasons, but for customers wishing to have their infrastructure more physically nearby for latency or legal reasons. We’ve always chosen to prioritize it below other ways we could spend our time. It’s a big project, and it will inescapably require pushing more configuration options out to users (for example, pointing your DNS at a router chosen by geographic homing) and to add-on providers (latency-sensitive services will need to run in all the regions we support, and find some way to propagate region information between the app and the services). These are non-trivial concerns, but now that we have such dramatic evidence of multi-region’s impact on availability, we’ll be considering it a much higher priority.
2) Block storage is not a cloud-friendly technology. EC2, S3, and other AWS services have grown much more stable, reliable, and performant over the four years we’ve been using them. EBS, unfortunately, has not improved much, and in fact has possibly gotten worse. Amazon employs some of the best infrastructure engineers in the world: if they can’t make it work, then probably no one can. Block storage has physical locality that can’t easily be transferred. That makes it not a cloud-friendly technology. With this information in hand, we’ll be taking a hard look on how to reduce our dependence on EBS.
3) Continuous database backups for all. One reason why we were able to fix the dedicated databases quicker has to do with the way that we do backups on them. In the new Heroku PostgreSQL service, we have a continuous backup mechanism that allows for automated recovery of databases. Once we were able to provision new instances, we were able to take advantage of this to quickly recover the dedicated databases that were down with EBS problems.
We have been porting this continuous backup system to our shared database servers for some time and were finishing up testing at the time of the outage. We’ve previously relied on point backups of individual databases in the event of a failure rather than the continuous full server backups that the new system makes use of. We are in the process of rolling out this updated backup system to all of our shared database servers; it’s already running on some of them and we are aiming to have it deployed to the remainder of our fleet in the next two weeks.