An interesting day at work
Today was an exciting day for the internet! As you might have seen when you tried to use your favorite services today, many of them spent much of the day on the floor.
VentureBeat captured a partial list of the thousands of services that went down — everything from Flipboard and Netflix to Airbnb and Slack and even many airlines’ apps were down or partially down for much of today. Even Apple’s app store, Apple Music, and FaceTime were down for many people.
What do all of these thousands of services have in common? They are all built on Amazon Web Services (AWS), the predominant cloud provider that provides the backbone on which many modern services and apps you use is built.
Unfortunately, Textio was not immune to today’s chaos. We had our only substantial downtime in the history of the company today — though our web site remained up, our service was down for nearly 5 hours.
From the moment our alerts went off at 9:42am PST this morning letting us know that something was amiss, our team worked ceaselessly to bring back the service, and we finally got it back online around 2:20PM.
Though we have a lot to learn from today’s downtime, I want to apologize for the interruption in service and walk you through the technical details of what happened as best we understand them.
If you are not interested in the technical details, please accept our apology and know that all of us will be working extremely hard to try to make sure nothing like this happens again.
What happened?
Around 9:42am Pacific Time this morning, an AWS service called S3 went down in the us-east-1 region in Northern Virginia. S3 is the service used for platforms like Textio to store a vast amount of data, and it also is the base storage layer upon which most of the rest of the core AWS services are built.
Amazon hasn’t yet released any details about what actually happened within AWS, but we can expect that they will over the coming days and weeks. An outage of this magnitude is extremely rare, happening on average every few years. When all is said and done, this will likely have been one of the two biggest service outages in the history of AWS.
How did this affect Textio?
While an outage in Northern Virginia seems like it might be isolated (hey, isn’t Textio in Seattle?) unfortunately, much of the world’s data is stored in this us-east-1 datacenter. It is centrally located within the US, and it is Amazon’s “default” region for S3, meaning that whether you are using Textio in Nova Scotia or Sydney, your data ends up passing through Virginia.
The S3 failure created a cascade of other failures. Many of Amazon’s services including parts of EC2, Lambda, and ELB — all of which Textio needs in order to run its live service — went down. Even RDS, which we use to run our databases was, to use a technical term, going haywire.
Within a few minutes, Textio customers started to report the outage directly to us which was helpful because it correlated with the alerts we were getting from our automated monitoring systems. We immediately dropped everything and started working on fixing the problems.
Every time you type in Textio, your edits go into S3. Every time you load a document you’ve created before, that loads from S3. Every import saves to S3 and then loads into Textio. Our documents storage was not available and, as a result, our web applications were failing and causing people to see error messages when they tried to use Textio.
How did Textio react?
Once we figured out that this was not an isolated situation, we quickly put up a red notification bar on our site letting customers know of the outage and the latest information we had. Amazon’s own service status page was down, and even the console with which you can manage their services was showing an error message… so we knew we weren’t the only ones skipping lunch today.
With our primary S3 storage location down, we looked at cutting over to use our secondary platform in us-west-2 (Oregon) but discovered that the management infrastructure we use to deploy and manage our service is also dependent on S3 to allow us to make that cutover.
In fact, something we learned the hard way today was that so much of the infrastructure we use — from GitHub to CircleCI to CloudFormation — is reliant on us-east-1, and so our entire build/test/deploy/manage stack was also dealing with the same issues. So that even though we had “redundancy” built in to our service, our ability to utilize that redundancy was greatly reduced.
Our frontend web app is also based on S3, but we were able to keep it up during the outage. We have a secondary infrastructure built on a different stack with the ability to instantaneously and manually cut over to the secondary server with only a DNS change. Unfortunately, cutting over our backend service wasn’t as easy.
By 1:12PM, Amazon had partially restored the S3 service… you could list documents, read documents, even delete them… everything except write new documents to S3.
This caused a tough decision for us. On one hand, it allowed our backend services to start mostly working again. You could sign in, see your list of documents, type and get predictions. As you typed, new predictions came back and everything seemed normal. Yea!
Except… when Textio tried to actually write those words you were typing to S3, the save failed — meaning that you would think Textio was working and them come back later to find your work gone.
So, I made the tough choice to switch Textio into what we call “maintenance mode.” This is a mode we can cut into at a moment’s notice which prevents the Textio service from loading entirely. (You get a nice web page with some friendly text instead, but let’s be honest — that wasn’t what you were looking for.) The web site continued to be up, but anyone trying to sign in would be redirected instead to the error page.
My thinking was that it was better for the service to remain down than for you to lose a word of your hard work. I can’t imagine something more disheartening than to lose something you were working on, and we decided that we’d rather keep the service offline until it worked fully.
At 2:08PM, AWS re-enabled saving documents to S3, and we started the process of validating that things were working properly and bringing the site back online. By 2:20PM, we fully reenabled our backend services and turned off maintenance mode, allowing Textio to work normally again.
No data was lost during the outage, and everything is back operating 100%.
What did we learn?
We learn from our mistakes at Textio.
In the coming days, we will be doing a series of post mortem discussions within the engineering team to understand better how we could have handled the interruption and how to make our service more resilient to future service provider outages.
While a catastrophic AWS outage like today may only happen once every couple of years, we want to be the service you trust with all your documents, all of the time. And so we will redouble our efforts to ensure that even if the world falls down around us that Textio will stay up and continue to earn your trust.
Jensen Harris is CTO and co-founder of Textio.