In my upcoming book on RabbitMQ Layout (part of the RabbitMQ For Developers bundle, to be released on June 15th), I tell a story about a system that uses an analytics service. In this system, the analytics service isn’t reliable so the developers make a backup of all the events in a local database.
As the story unfolds, the developers in the story make some significant changes to the way the application uses RabbitMQ. The results are significant, but the story only focuses on the RabbitMQ side of things and leaves a lot of potential questions open for the database side of the system, where events are stored as a backup.
I chose not to elaborate on that side of the system in the book, because it didn’t fit. It wasn’t the right context, it would have added too much length to the chapter, and frankly, I didn’t have a complete answer for the problem when I wrote the story. But I think I have an answer, now, and I want to share my thoughts on the potential solution.
Names Have Been Changed To Protect … Me
The story that I tell in the book is based in part on my own experiences in building SignalLeaf and using RabbitMQ to send event data over to Keen.io. While Keen has never been unstable like the analytics service in my story, code has been broken and other services have been down.
A few weeks ago, for example, my RabbitMQ instance became unresponsive. I use shared RabbitMQ hosting with CloudAMQP, and another application on the same physical servers ate all available disk space. I created a new instance on a new cluster and everything started working again. But if I didn’t have that backup database of event entries, as mentioned in the story in my book, I would have lost all the data during the downtime.
Then, this last weekend I updated some SignalLeaf code and mis-configured my exchange and queue bindings in RabbitMQ. After deploying, my code was sending messages to RabbitMQ with a routing key that was not handled. The messages were being lost, and nothing was being published to Keen.io, because of this. I fixed the configuration, but not before 24 hours of analytics data had been missed. Again, the database of event data means I haven’t completely lost all the data.
Having the database backup for the events was incredibly important in these cases. But even with the database, I have a problem with the way my code is setup and the data is stored.
The Database Event Design Problem
In the book, I talk about having the database backup for the same kinds of reasons that I point out above. I also talk about having a nightly process that checks for events that haven’t been published.
In the real world, I have my code for SignalLeaf set up to create a single database record for each event. Along with the event data, I include a “status” field: processed or unprocessed. When a request is made for a podcast episode, a new database entry is made with a status of “unprocessed”. Then a message is sent across RabbitMQ. The code on the other side publishes the event to Keen.io and updates the database entry to be “processed”.
There are several problems with this design.
- Too many network calls from web server
- Errors unrelated to file download, prevent file download
- Duplicate messages may occur
To start, there are too many network calls made from the web server – one for the database and one for RabbitMQ. A podcast listener wanted an episode, and I’m slowing them down with more network calls than I should be using. The extra latency makes it take longer to get the episode to them.
An error writing to the database or RabbitMQ, from the web server, means the process stops and the file is not sent to the user. Why should a database or RabbitMQ failure prevent the file from downloading? These things shouldn’t be tied together so tightly.
If the code behind RabbitMQ works, the event is published to keen. But if the database update fails after that, I won’t know that the keen.io call worked. Any code that looks for “unprocessed” entries could re-process an event, duplicating an entry in keen.io. Multiply this by a few thousand, and suddenly the stats in keen are very wrong.
So.. how do we fix this?
In the book, the developers alleviate these problems by separating the database and analytics service calls. In the real world, I wasn’t sure how I could make this work. But, thanks to the interviews for the RabbitMQ For Developers bundle, I have a better understanding of the situation and solution.
Allow The Code To Fail
The first thing I’m going to do, is not have a database call in the web server code for the event entry. Rather, I will send a single message through RabbitMQ and have multiple routes and queues: one for the database, one for publishing to keen.io. Each queue will have purpose-specific code that does tie the database to keen.io.
Eliminating the database call from the web server will speed up the response to the HTTP request for the file. It will also allow the database and keen.io calls to fail or succeed independently of each other. This is the “let it fail” mentality that Aria Stewart talks about in the “Design For Failure” interview, in the RMQ for Devs bundle.
I can have the database call fail, and I don’t care. I can have the keen.io call fail, and I don’t care. I’ll ‘nack’ the messages and put them back on the queue. They’ll be picked up and processed later, when the network hiccup or bad code or whatever, is resolved.
Now for the other end of code, after the event is published to keen.io.
Don’t Update Database Records
Once the code has published the event to keen, I’ll send a “success!” message through RabbitMQ. That message will get routed to the code that writes to the database again. But, there’s a potential problem here. If the record for the original “unprocessed” status has not been added to the database yet, the message to update that record to “processed” will fail.
If there’s no record, there isn’t anything to update. So… how do I fix that? Don’t try to update any records. Just write new records. This is something I picked up in my interview with Anders Ljusberg, where we talked about Event Sourcing in relation to CQRS.
Event Sourcing is facilitated by a write only data model that is a collection of state changes for a given entity or data object. There’s more to it than this, but that is the part I care about right now. Instead of trying to update a record that might not exist, then, I’m going to write a new record with the “processed” status. If the “unprocessed” record doesn’t exist yet, I don’t care. It will show up eventually, when that message is handled by my queue handling code.
With that done, I can have a process check for events that have an “unprocessed” record, but no “processed” record associated. If these records are older than some time frame (1 day maybe?) I’ll re-process them, knowing they won’t be duplicates.
Tighten Up That Workflow
The solution I just outlined seems pretty solid off-hand. It combines a few things that I learned from the interviews I’ve done for the RabbitMQ For Developers bundle, and puts them to good use. I expect there to be some edge cases and potential issues in implementing this, though. I haven’t yet worked through the real code to make this happen, so there’s bound to be some bumps along the way.
One of the potential bumps I see already, is having a larger workflow coded in to the details. This is something that I fight against in my code architecture already and I can see it happening in this situation. Fortunately, I have yet another interview from which I can draw a solution. In my interview with Jimmy Bogard, we talk about “Sagas” and workflow.
The idea here, is to code a higher level workflow management object – one that knows when the final state of things has been reached. I think it’s possible to overwhelm my current system with this, since I deal with thousands of requests per minute… but it will be interesting to try this and see what happens.
Want To Learn More From These Interviews?
I’ve learned far more from the interviews that I’ve done for this RabbitMQ For Developers bundle, than I ever expected. There’s a wealth of knowledge in these discussions, just waiting to be unleashed upon the world and I’m already taking advantage of it in my daily development!
Pick up your copy of The RabbitMQ For Developers bundle, and get all of these interviews, 12 screencasts, an eBook and so much more!