The last post addressed a handle leak in a managed console application. The profiler seemed unable to give us enough information about the source of the leak, so we’re going to turn to our faithful friend WinDbg to get to the bottom of the problem.
First let’s start the console process and let it get to a stable state, then attach to the it by opening WinDbg and pressing F6, which brings up the Attach to Process Window.
Once we’re attached the process will be paused, allowing us to load SOS. We’re going to take the same tack and dump all of the SafeHandle-derived objects on the heap first via !dumpheap. The arguments “–stat” and “–type SafeHandle” tell !dumpheap to dump statistics rather than all the actual addresses on the heap (the default behavior) and to filter the results to type names containing “SafeHandle”, respectively.
Notice that we don’t need a * anywhere to make this work. That’s because !dumpheap’s text filter is more permissive than is the profiler’s from the first part of this series.
Here are the results:
The chief culprit is circled, but sticks out like a sore thumb, anyway. SOS also nicely sorts in ascending order by the TotalSize column so the suspects are usually arrayed near the bottom of the output from the !dumpheap command.
Now that we know what we need to investigate, we can start to investigate who is allocating these objects. Ideally we’ll be able to find a path to some user code within short order. The assumption when investigating these issues is that it’s probably in user code, not in the BCL. In order to get more information on the actual SafeWaitHandle objects on the heap we’ll need to use !dumpheap again, but with different arguments. This time we’ll leverage the –mt <methodTableAddr> argument which will dump the addresses of all objects with a given method table. The method table address is taken from the output of the previous command, highlighted in the below image.
The output is a large string of addresses of SafeWaitHandle objects.
We’ll choose a few at random and see if we can find out who owns it using the !gcroot <objectAddr> command.
The output from the command is a list of all the roots of the object whose address was supplied to the command. Roots are typically objects on the stack, statics or objects which were explicitly marked as GC roots. An arrow-delimited list in the form root->child->child->etc shows how the object is reachable from one or more roots.
This object is a part of a ManualResetEvent which is itself part of a SingleThreadedOperation object. SingleThreadedOperation is user code. Why is it allocating ManualResetEvents? We need to be able to answer this question in order to confirm that this is or is not the source of the issue at hand. When we run !gcroot on a few more of the SafeWaitHandles we see some slightly different output.
The ownership path on two of the other objects are not present. While this might seem like a bug with !gcroot it’s actually telling us something very important. No one owns these objects and they’re candidates for garbage collection. If we continue to !gcroot objects off the heap we find a vast majority of them are unrooted. This explains one of the oddities of our profiler session. Why couldn’t we see all these objects in the snapshots when the handle’s process count clearly indicated that many were being allocated? The answer appears to be that they were being allocated and then quickly became unreachable. The profiler performs full collections before a snapshot which renders it unable to find the objects whose entire lifetime occurred in the period between two snapshots. Luckily WinDbg does not have this limitation!
So, now what? We can’t trace the ownership of the now-unreachable SafeWaitHandles, but we can set a breakpoint in the class’ constructor and figure out who is allocating them. Not quite as good, but it will likely yield good information. To set breakpoints we will use SOSEX. Loading it is very simple. In this case it’s located in the same directory as WinDbg so the relative path supplied to .load will suffice (we can omit the .dll).
SafeWaitHandles are a little too low-level for our purposes so let’s see if we can find out what higher-level construct is actually creating them. Knowing that xResetEvents are implemented using WaitHandles we can go ahead and dump information about the ResetEvents on the heap in order to set a breakpoint in the ManualResetEvent’s constructor.
The output contains the method table address of the types in question, we’ll use this to set our breakpoint.
The number of ManualResetEvents is smaller than the number of SafeWaitHandles above because between a few of these commands a GC was triggered and swept up a large number of them.
In order to actually set a breakpoint using !mbm we need the full name of the method in question. We can get this from a list of all the method descriptions for the type of interest. Using !dumpmt with the –md argument we can see a list of all the methods defined on ManualResetEvent.
The output looks like this:
Setting a breakpoint is very easy, we use the !mbm command from SOSEX to do so, supplying it with the full method name.
There will be a message indicating success or failure after issuing the mbm command.
The breakpoint has been successfully set, continuing (using g or F5) we get hit very quickly. A quick look at the stack trace tells us who is allocating the ManualResetEvent. It’s our old friend: SingleThreadedOperation.
Allowing the app to continue a few more times this stack trace is repeated over and over again. SingleThreadedOperation was a person of interest before, now it’s a prime suspect.
A quick dump of all the SingleThreadedOperations on the heap yields an instance count suspiciously similar to the number of ManualResetEvents on the heap.
Looking at the code we can see that this class aggregates a ManualResetEvent and implements IDisposable. Everything looks good, so far. What’s this, though? A timer is allocating 2-3 instances per second and not disposing them! A look around the codebase confirms that no one is disposing the objects, and they go out of scope very quickly. So while the garbage collector finds them very quickly we can eliminate the performance problem by inserting a using block to dispose of the object before it goes out of scope. With that fix in place the leak is no more, and our handle count is nice and flat once again.
This is very cool.
ReplyDelete