The AWS Outage—Lessons on Worst-Case ScenariosHow bad is the worst-case scenario of your systems? I suspect it's not as bad as the DynamoDB system.Welcome to the Scarlet Ink newsletter. I’m Dave Anderson, an ex-Amazon Tech Director and GM. Each week I write a newsletter article on tech industry careers and tactical leadership advice. Free members can read some amount of each article, while paid members can read the full article. For some, part of the article is plenty! But if you’d like to read more, I’d love you to consider becoming a paid member! Amazon had a bit of an operational issue recently. You probably heard about it if you use the internet. I spent 12+ years operating software at Amazon. Amazon does plenty of things poorly, but in my experience, they’re the top of the class when it comes to operating software. Recent events might trick you into thinking otherwise. Amazon released some details of what happened during their AWS outage. In reading through their report, I think it’s clear that a large part of what happened were edge case design flaws in services. Unlikely race conditions. Timeouts that caused weird effects in services. Unexpected and unhandled inconsistent states. These weren’t as a result of downsizing, as some media excitedly tried to proclaim. This wasn’t a problem where a junior engineer made a mistake. These were fundamental design issues with multiple systems. There’s a good chance that these design decisions were literally made years ago but only surfaced when a very specific situation occurred. Even more interesting, what you have are very complex systems (like DynamoDB) that interact with other very complex systems (like EC2). What you have in aggregate is essentially a much larger system, with even more complexity. Which is why a small timeout can cascade into the entire system failing. What I find fascinating is that I’m positive that anyone who has operated large-scale software has seen all of these individual patterns. As I read through that outage summary, I could immediately picture past operational events caused by extremely similar patterns. It’s not like “race condition” is an unheard-of event. Here’s what that logic chain means to me:
You can build a rock-solid, simple website. Your static blog website won’t necessarily have any race conditions, won’t have timeout issues, and your DNS won’t stop pointing at your load balancer. Simple sites can only fail in simple ways. But when you have a complex system? Let’s look back at point two above. Regardless of your intellect and your attention to detail, your system can and will fail eventually. And I’ve seen brilliant people repeatedly fail to identify how bad things can get. What is a worst-case scenario?Years ago, I decided to learn how to sail. I spoke to some of my Amazon employee friends at the rock climbing gym, and they agreed that they also wanted to take sailing lessons. I shopped around and found out how much it would cost our group to take those lessons. I also looked at Craigslist at sailboats (because that’s a thing you do) and stumbled across a lovely 23-foot sailboat for almost exactly the same amount of money. Paying a few thousand for what is essentially a one-time training expense or paying the same amount for a capital expense? It felt like the only responsible fiscal decision was to buy a boat. Thankfully, my friends are just as insane as I am, and we collectively agreed that self-teaching ourselves to sail on our boat was the best idea. Fast-forward a few months, and our sailing skills have improved to “not terrible.” We were reasonably competent at pulling out of the dock, and had taken many trips across Lake Union and Lake Washington. We felt competent enough at this point to be able to relax. One day, we were out on Lake Union, and it was a breezy day. In fact, as the day progressed, it went from a convenient breeze to some really aggressive wind. Thankfully our boat had a pretty heavy keel, which meant we were able to maintain sail while other boats were dropping theirs. We were cruising at our top speed, tipping further with each lake crossing. We zigged and zagged back and forth across Lake Union (an awfully narrow lake) as the wind grew in strength. While laughing and having a great time, one of the guys asked, “So this is fun, but also a bit scary. What’s the worst-case scenario here?" We were all Amazon employees, and familiar with the change management terminology. Part of that process requires defining the worst-case scenario. We all understood the inside joke. In fact, part of what we understood was that engineers regularly underestimate the worst-case scenario for situations. I looked around, and then famously responded, “Well, the worst-case scenario is that our sail gets stuck up. And then the wind continues to pick up more, and we are totally hosed.” Everyone on the boat laughed, and collectively agreed that this would indeed be the worst-case scenario. The wind picked up. We decided it was time to lower the sail, because our deck was starting to tilt at an unnerving angle. One of us went to do the sail lowering. You do this by loosening a tie-down on the deck, and you can lower the sail. Except after loosening the tie-down, the sail didn’t move. It stayed up regardless of how hard it was pulled. When we all immediately looked up, the problem was fairly obvious. A loose line had wrapped around the top of the mast and gotten tangled. This held our sail and prevented it from being lowered. We cracked up, in the way you crack up when a disaster is happening, but it’s so incredibly ironic that you just have to laugh. This story is 100% true. With this group of friends, “What’s the worst-case scenario?” question always causes a laugh. And we can laugh because, (spoiler alert), we didn’t actually die. Over the next couple of hours, we sailed frantically back and forth across the lake, repeatedly coming to the brink of swamping. Everyone tightened their life jackets. This was in the winter, and the water was freezing. We were concerned about our ability to make it to land if we tipped in the middle of the lake. We strongly considered ramming our boat onto the shore. It would surely destroy the boat, but it felt like a way to avoid risking our lives. We also seriously discussed the merits of someone climbing the mast to untangle the line. As we were all rock climbers, this felt both possible and dangerous. By a stroke of luck, as we debated these options, during a tack someone shook the loose line for the 27th time, and it came unstuck. We were able to pull the sail down. Crisis averted. Heart rates returned to normal. While building and maintaining technical systems doesn’t have the same physical risk, the idea of underestimating worst-case scenarios still applies. We laughed at the idea of the sail being stuck up, because we never dreamed it could happen. It would be the weirdest, worst luck if it happened to get stuck during the strongest wind we’d ever run into. Yet it happened. And that’s precisely what we see with technical systems. The oddest and unluckiest things happen at the worst time... |