A few weeks ago, SignalLeaf crashed. Hard. The site was down and my error reporting system was reporting an error in the connectivity to the message queue system. My house seemed to be crumbling all around me, and I was in full panic mode.
At this point, SignalLeaf is averaging around 4,400 podcast episode downloads (“listens”) per day, with spikes near 20,000 a day and peak traffic during daylight hours in the U.S. To have it go down during the daylight hours would be … less than ideal. Or, a disaster. But here’s the thing: in spite of what turned out to be 2 weeks of intermittent crashes and problems, I didn’t lose any important data or have any service outage in serving podcast episodes. All thanks to a somewhat robust architecture.
A Series Of Unfortunate Events
It started with a network problem between my server and my message queueing service. I spent almost an entire working day babysitting SignalLeaf, restarting it every time it crashed. Sometimes the network between the two services would glitch every 10 minutes. Sometimes it would go for a couple of hours just fine. Eventually, the network issues were resolved and everything became stable again.
But that’s only where the problems started. A few days later, I started getting more crash reports with a different issue… this time it was database related. Once again, my services were crashing all around me – except for the media service that delivered podcast episodes to people. It was still humming along and serving episodes, in spite of the main website being down.
This time around, it was a database problem. I had forgotten about a small database plan that I was using, and that database plan met it’s limits. Any attempt to write data to it failed, meaning no new episodes could be added, no new accounts, etc. It was, once again, a catastrophic failure on part of the service, but not everywhere. The database problem was solved in a few hours, and things seemed to be stable once again.
Except “stable” was more like “frozen in time”.
A few days later, one of the podcasters that uses SignalLeaf contacted me to let me know that the reports were not showing any new data for the last few days… the same number of days that it had been since the database issue started. Oh-oh. After a bit of digging around, I figured out that there was a series of cascading failures caused by the database being down for a bit.
I have backup code in place for the times when the queue services is interrupted. This code saves all messages to my database before sending them through the queue. Once the data is processed on the other end of the queue, it’s marked as such in the database and I know I can remove the data. But when the database was having issues, none of these records were being written. The queueing service did it’s job and held on to the messages – near 12,000 of them – but my code was failing when it tired to find the corresponding database record for the message queue message that it was processing. There was no database record for the message. Bam, crash. Ouch.
A few minutes later, I had the queue processing code fixed up to handle that scenario and the 12,000 messages stuck in the queue were processing through. It took about 30 minutes for them to finish processing, and all of the reports on SignalLeaf were back up to date again.
Did You Notice The Trend, Here?
There were a lot of problems in a short period of time. It was not fun. I spent far too much time babysitting things, but in the process I learned a lot about my code and where it is and is not robust. I also improved the code’s robustness quite a bit in the places that needed it. But through this all, with every failure and every solution, there was one thing that never stopped working: media delivery for podcast listens.
Quite a while back I decided to split my code in to multiple services, one of which is what I call my media service. This is the service / website that is responsible for serving all “media” (podcast episodes, RSS, etc) to browsers, podcast apps, etc. And while the rest of my services were crashing all around me, there was a core part of the app that was trucking along just fine – the media service. At least, on the surface where the browsers and the podcast apps were concerned, it was still trucking along. I continued to server a few thousand episodes per day, even during the downtime that the rest of SignalLeaf experience.
This was by design, not by accident. It was unfortunate that I had to test this design in production with the rest of my app crashing. But it did prove the design worked as intended.
Robust, Critical Features
Back when I split apart the code that served episodes from the code that managed everything else, I decided to put a message queue in place. This queue was part of keeping things fast, simple and cleanly separated. It also provided the core of the robust nature that I put in place, in the media services.
With the media services being critical to a podcast media host, I wanted to make sure it would still work as long as my database was able to read data.
During the message queue outage, the media services still worked. They sent episodes to the browser, even though they couldn’t send tracking events to the message queue. The code to send a message to the queue happened asynchronously, after the request for the episode was fulfilled. Serve the episode first, and then try to send the message queue message.
During the database outage, the media services still worked… I could still read data from the database, but I couldn’t write data. The media services don’t rely on being able to write data. I explicitly made a decision quite some time ago, to allow the database write to fail and still serve the podcast episode. As long as the database can be read, the files can be served.
By keeping the critical part of my service as simple as possible – I just need read access to the database, to serve episodes – I was able to keep the most important parts of the system up and running while the rest of it crashed and burned around me. The additional features that I want in my system happen outside the critical path. They either have try catch blocks around them, handling the error and serving the file anyways, or they happen asynchronously, allowing the file to be served before the code even runs.
Improving What Needs It, When It Needs It
Having a good architecture in place allows you to do things like this – to have code that allows failures in some places while keeping other critical things alive. I’m sure my customers were a little less than happy about the main website being down so much, but I’m also sure that they were more than happy to see episodes still being served during that time. Of course I want to improve my code, and I have been. I’ve made improvements every time I run in to a problem, and I will continue to improve things over time.
But here’s the thing: I don’t think I could predict how the failures that ran in to would affect SignalLeaf, prior to seeing these failures. There are only so many types of failures and specific problems that you can account for, up-front. Yes, experience with problems will give you a bit of a spidey-sense in bad code and architecture, allowing you to prevent some problems in the future. But there will always be new and interesting ways in which your code will fail. Always.
The question, then, is not how robust you made the code before you shipped it, but how quickly did you improve the code to prevent that same failure again? How robust is the code now, having faced and survived that particular failure? How much of your system can go down and still allow the most critical part of your system to operate?
I learned a lot about these questions and answers in the last few weeks of running SignalLeaf, and SignalLeaf is better for it. The code is more robust – it handles an additional set of failures that it previously didn’t handle. The code is now able to handle these failure types without bringing the entire system down.
And having gone through this, survived and improved my system because of the failures that I experienced, I await the next set of unexpected and panic-inducing problems with a little more confidence in my system’s ability to remain active and in my ability to improve the robustness of the system.