A question was asked on StackOverflow about managing long-running, resource intensive processes in a way that does not hog up all resources for a given request. That is, in a scenario where a lot of work must be done and it will take a long time, how can you have multiple users served and handled at the same time, without one of them having to wait for their work to begin?
This can be a difficult question to answer, at times. Sometimes a set of requests must be run serially, and there’s nothing you can do about it. In the case of the StackOverflow question, however, there was a specific scenario listed that can be managed in a “fair” way, with limited resources for handling the requests.
The Scenario: Emptying Trashcans
The original question and scenario are as follows:
I’m looking to solve a problem that I have with the FIFO nature of messaging severs and queues. In some cases, I’d like to distribute the messages in a queue to the pool of consumers on a criteria other than the message order it was delivered in. Ideally, this would prevent users from hogging shared resources in the system. Take this overly simplified scenario:
- There is a feature within an application where a user can empty their trash can
- This event dispatches a DELETE message for each item in trash can
- The consumers for this queue invoke a web service that has a rate limited API
Given that each user can have very large volumes of messages in their trash can, what options do we have to allow concurrent processing of each trash can without regard to the enqueue time? It seems to me that there are a few obvious solutions:
- Create a separate queue and pool of consumers for each user
- Randomize the message delivery from a single queue to a single pool of consumers
In our case, creating a separate queue and managing the consumers for each user really isn’t practical. It can be done but I think I really prefer the second option if it’s reasonable. We’re using RabbitMQ but not necessarily tied to it if there is a technology more suited to this task.
Message Priorities And Timeouts?
As a first suggestion or idea to consider, the person asking the question talks about using message priorities and TTL (time-to-live) settings:
I’m entertaining the idea of using Rabbit’s message priorities to help randomize delivery. By randomly assigning a message a priority between 1 and 10, this should help distribute the messages. The problem with this method is that the messages with the lowest priority may be stuck in the queue forever if the queue is never completely emptied. I thought I could use a TTL on the message and then re-queue the message with an escalated priority but I noticed this in the docs:
Messages which should expire will still only expire from the head of the queue. This means that unlike with normal queues, even per-queue TTL can lead to expired lower-priority messages getting stuck behind non-expired higher priority ones. These messages will never be delivered, but they will appear in queue statistics.
This should generally be avoided, for the reasons mentioned in the docs. There’s too much potential for problems with messages never being delivered.
Timeout is meant to tell RabbitMQ that a message doesn’t need to be processed – or that it needs to be routed to a dead letter queue so that it can be processed by some other code.
Priorities may solve part of the problem, but would introduce a scenario where files never get processed. If you have a priority 1 message sitting back in the queue somewhere, and you keep putting priority 2, 3, 5, 10, etc. into the queue, the 1 might not be processed.
For my money, I would suggest a different approach: sending delete requests serially, for a single file.
Single Delete-File Requests
In this scenario, I would suggest taking a multi-step approach to the problem using the idea of a “saga” (aka a long-running workflow object).
When a user wants to delete their trashcan, you send a single message through RabbitMQ to a service that can handle the delete process. That service would create an instance of the DeleteTrashcan saga for that user’s trashcan.
The DeleteTrashcan saga would gather a list of all files in the trashcan that need to be deleted. Then it would start doing the real work by sending a single request to delete the first file in the list.
With each request to delete a single file, the saga waits for a response to say the file was deleted.
When the saga receives the response message to say the previous file has been deleted, it sends out the next request to delete the next file.
Once all the files are deleted, the saga updates itself (and any other part of the system, as needed) to say the trash can is empty.
How This Handles Multiple Users
When you have a single user requesting a delete, things will happen fairly quickly for them. They will get their trash emptied soon, as all of the delete-file requests in the queue will belong to that user:
u1 = User 1 Trashcan Delete Request
When there are multiple users requesting a delete, the process of sending one delete-file request at a time means each user will have an equal chance of having their request handled next.
For example, two users could have their requests intermingled, like this:
u1 = User 1 Trashcan Delete Request
u2 = User 2 Trashcan Delete Request
Note that the requests for the 2nd user don’t start until some time after the requests from the first user. As soon as the second user makes the request to delete the trashcan, though, the individual delete-file requests for that user start to show up in the queue. This happens because u1 is only sending 1 delete-file request at a time, allowing u2 to have requests show up as needed.
Over-all, it will take a little longer for each person’s trashcan to be emptied, but they will see progress sooner rather than later. That’s an important aspect of people thinking the system is fast / responsive to their request.
But, this setup isn’t without potential problems.
Optimizing: Small File Set vs Large File Set
In a scenario where you have a small number of users with a small number of files, the above solution may prove to be slower than if you deleted all the files at once.
After all, there will be more messages sent across RabbitMQ – at least 2 for every file that needs to be deleted (one delete request, one delete confirmation response)
To optimize this, you could do a couple of things:
- Have a minimum trashcan size before you split up the work like this. below that minimum, you just delete it all at once
- Chunk the work into groups of files, instead of one at a time (maybe 10 or 100 files would be a better group size, than 1 file at a time)
Either (or both) of these solutions would help to improve the over-all performance of the process by batching the work and reducing the number of messages being sent. You would need to do some testing in your actual scenario to see which of these (or maybe both) would help and at what settings, though.
Beyond the batching optimization, there is at least one additional problem you may face – too many users.
The Too-Many-Users Problem
If you have 2 or 3 users requesting deletes, it won’t be a big deal. Each will have their files deleted in a fairly short amount of time.
But if you have 100 or 1000 users requesting deletes, it could take a very long time for an individual to get their trashcan emptied – or even started!
In this situation, you may need to have a introduce level controlling process where all requests to empty trashcans would be managed.
This would be yet another Saga to rate-limit the number of active DeleteTrashcan sagas.
For example, if you have 100 active requests for deleting trashcans, the rate-limiting saga may only start 10 of them. With 10 running, it would wait for one to finish before starting the next one.
This would introduce some delay to processing the requests. But it has the potential to reduce the over-all time it takes to delete an individual trashcan.
Again, you would need to test your actual scenario to see if this is needed and see what the limits should be, for performance reasons.
Scaling Up and Out
One additional consideration in the performance of these resource-intensive requests is that of scaling up and out.
Scaling up is the idea of buying “larger” (more resources) servers to handle the requests. Instead of having a server with 64GB of memory and 8 CPU cores, perhaps a box with 256GB of memory and 32 CPU cores would perform better and allow you to increase the number of concurrent requests.
While scaling up can increase the effectiveness of an individual process instance, it becomes expensive and has limitations in the amount of memory, CPU, etc.
Scaling out may be a preferable method in some cases, then.
This is the idea that you buy many smaller boxes and load balance between all of them, instead of relying on one box to do all the work.
For example, instead of buying one box with 256GB memory and 32 cores, you could buy 10 boxes with 32GB of memory and 4 cores each. This would create a total processing capacity that is slightly larger than the one box, while providing the ability to add more processing capacity by adding more boxes.
Scaling your system up and/or out may help to alleviate the resource sharing problem, but likely won’t eliminate it. If you’re dealing with a situation like this, a good setup to divide and batch the processing between users will likely reap rewards in the long-run.
Many Considerations, Many Possible Solutions
There are many things that need to be considered in resource intensive requests.
- The number of users, the number of steps that need to be taken
- The resources used by each of those steps
- Whether or not individual steps can be done in parallel or must be done serially
- … and so much more
Ultimately, the solution I describe here was not suitable for the person that asked, due to some additional constraints. They did, however, find a good solution using networking “fair queueing” patterns.
Because each scenario is going to have different needs and different challenges, you will likely end up with a different solution, again. That’s ok.
Design patterns, such as Sagas and fair queueing, are not meant to be all-in-one solutions. They are meant to be options that should be considered.
Having tools in your toolbox, like Sagas and batch processing, however, will give you options to consider when looking at these situations.