I spent a few days, recently, tracking down a memory leak in a production NodeJS system. I found the leak. But the journey to get there was not exactly scientific, repeatable, or one that I would wish on anyone else.
If tracking down memory leaks sounds difficult to start with, the reality of doing it tends to be even more difficult than most developers imagine.
Noticing The Leak
At my client office, I have a small virtual machine for a server. Several months ago, my coworker noticed that we were using memory beyond the allotted memory in the machine – there was a huge paging file.
We restarted our services and things cleaned up a little, but we knew that we needed more memory. After some conversation w/ the virtualization team, we got the memory increased and life was good for a while.
A month later, we were checking to see how the system was doing and noticed that we had almost eaten all of the available memory again. Out of 12GB memory allocated, we had used up about 10GB. There was no paging file, yet, but the memory usage started around 2GB used when we first turned everything on and got it up and running.
We watched it for a few weeks and memory usage kept climbing, eventually getting to 11GB. I restarted all of the services and it immediately dropped to less than 2GB of usage again.
It appeared to be a slow but steady leak.
Confirming The Leak
Confirmation of the leak in this case, was easy. The system in question is a job scheduling service – it runs nightly batch processes according to the schedules that are configured.
I stood up some test schedules in my development environment and ran them 10, 20, 100+ times to see what would happen to the memory usage.
As expected, memory use grew with each schedule execution. I was able to jump from 55MB of memory for the main process, to well over 500GB of memory usage with only a few dozen schedules executed.
The problem was not the memory usage, though. The problem was the memory never being released – a memory leak somewhere, preventing the system from cleaning up all the old garbage and reducing the memory footprint.
Now that we had confirmed the leak, it was time to look for the cause.
Trying To Identify The Leak
The system in question is built in NodeJS. It has many moving parts – about 10 different node processes on the server, all coordinated through RabbitMQ. The main process for the system was using the most memory – about 1.3GB. The rest of the various processes made up a fair amount of memory usage, but nothing came close to that 1.3GB.
Given that, I dove into Google and looked up some great tutorials on tracking down memory leaks in Node. I found a lot of good articles and read through all of them, trying out many of the tools suggested.
- Debugging Memory Leaks in Node.js – A Walkthrough
- Tracking Down Memory Leaks in Node.js – A Node.JS Holiday Season
- How To Self-Detect A Memory Leak In Node
- Debugging Memory Leaks In Node.js Applications
… and still, there were other articles that largely said the same things as these do
The end-result of all this reading and research, was my attempt to use Google Chrome’s built in memory profile tools to track down the bug. Node is built on V8 after all, and I can easily import a heapdump or 3 into Chrome to examine the output.
Sounded like a great idea, so let’s give it a try!
Heapdump Sound Good In Theory
The majority of articles I read did the usual tutorial thing where they purposely set up a memory leak that was already well-known up front. Once they had that, using the heapdump tool and Google Chrome, they could run a diff on the results and easily identify the leak.
In my case, the heapdump results looked more like a foreign language full of (system) and (array) than anything useful, as shown in the demos.
I couldn’t find anything useful in this mess, and couldn’t make heads or tails out of the gigantic object graphs and thousands of array instances that were being created.
None of it made sense to me, even after 2 straight days of attempting to read more and more details on what the memory profiler was actually telling me.
I was ready to give up and just add a restart script that ran every day at a specific time, to reduce the memory usage on the server. After all, it was a very slow memory leak. It would take more than a month of schedule execution to eat up the memory, so a once-a-week reboot would work, right?
… *sigh* … but alas, I can’t live a workaround like that. So I dug in further… and dropped heapdump.
Divide And Conquer
In the absence of my ability to use the heapdump profile and Chrome DevTools to find the problem, I decided to take the hard way – divide and conquer. I already knew the problem was in my main process so that was easy.
I started by dumping 90% of the code from the service, and having an empty shell that responded to a few RabbitMQ messages. As expected, the memory leak pretty much went away because no real work was being done.
From there, it was a matter of trial and error, putting code back in and taking code back out. With more hours of this under my belt than I wanted, I was starting to wonder if the problem was in my RabbitMQ messaging layers.
I didn’t have very strong evidence for this, but it was a gut feeling based on the memory leak always being present when I had more messaging than not. There were a lot of other possibilities, still, but I liked that idea the most.
An Accidental Discovery
After a day of divide and conquer, I still had not isolated the problem.
It was during another discussion with my coworker, and examining the memory used on the production server, that I found further evidence of where the problem was.
We both noticed that some of the secondary processes were also growing in memory usage. One of them in particular – one that was used by the majority of our jobs in the main schedule – was getting close to 1GB of memory usage. This process was nothing more than a child process execution to do SSH commands. It shouldn’t be leaking like that.
After restarting all of the processes again, and seeing all memory drop back down – including the child process launcher – I thought through the similarities between the main process and this sub-process.
The one major component that was reused between the two processes, was my messaging layer.
Tracking Down The Problem
With further evidence of the messaging layer being the problem, I created two sample applications – one that used my messaging abstraction layer, and one that used the underlying driver.
The results were not what I wanted.
Initially, I showed that I could easily eat up 500MB+ of memory by pumping 10,000 messages through RabbitMQ with my messaging layer. However, the memory usage would not go beyond that. It would cleanup back down to 150MB+ of memory use and then go back up the next time we ran more messages.
To see if this were a slow leak, I set up the process to send 10,000 messages every 20 seconds in the hopes that memory would build up. I went to lunch and came back to have my hopes dashed. Memory usage had not gone up at all.
Isolating The Cause
Another discussion was had, regarding the results of this test. We wanted to ensure that the test reproduced the code from the real app as closely as possible, for the messing system.
I thought it did. I wrote both the real code and test code after all.
But, I was wrong.
The sample code was creating one instance of my message abstraction object and re-using it every time I sent a message. In the real app, a new object instance was created every time a message was sent.
I adjusted the sample code to work the same way and the results were immediate. I was able to increase memory consumption consistently, with it never dropping!
It didn’t take along, after that, for me to isolate the problem to the layer underneath of my messaging abstractions. I submitted an issue to the project with my complete test code and a detailed description of what was happening. A few days later a fix was available.
A Plea For Better Memory Profiling In Node
This was not a fun week for me. I found myself wanting better tooling with more instruction on how to use it in Node. It was dumb luck and intuition based on observation of the running code that allowed me to find and resolve the issue. It was not the tooling I was told to use – and that makes me sad.
So, I’m pleading w/ the Node community and maintainers: Someone show me that I’m wrong, here. Show me the power of great tooling in node, available to the general public (and not hiding behind a paywall of “enterprise node”).