Wherein we identify the culprit behind a resource leak of hundreds of uncollected, dead threads.
The following issue presented itself in as an unusually high GUI memory footprint on a QA machine. Things then began to flake out in seemingly random ways so, naturally, QA got a memory dump and delivered it to development for investigation.
Running !threads as is typical, something alarming presents itself immediately. Notice the astounding number of dead threads. Considering how little concurrent work the GUI is actually doing, 12 threads is high, but 125 is astronomical.
A “XXXX” in the managed ID column indicates that the thread has finished running yet still lives on the managed heap. Unfortunately the ThreadOBJ objects are not valid managed objects so !dumpobject will not work on them. The solution is to use !dumpheap -type System.Threading.Thread to find all the managed thread objects to inspect.
With this information, running !gcroot on the thread objects will show what is keeping them alive. Running this command on many of thread objects reveals a distinct pattern.
There are 3 GC handles pointed directly or indirectly at the thread object. One is a “WeakSh” handle. This cryptic abbreviation indicates a “short-lived” weak reference, meaning that it does not track the thread object after finalization. It is reasonable to conclude that upon creation managed threads create weak references to themselves so that they can be easily accessed but do not prevent garbage collection once they’re finished running. Weak references do not prevent garbage collection, so they are a side-effect, not the cause.
What of the other handles, then? They are not pointed directly at the thread object, but are keeping it alive indirectly via the OverlappedData class. The OverlappedData structure has been “async” pinned which must be like regular pinning, but peculiar to asynchronous I/O in the CLR. When performing async I/O via a BeginXXX method user code supplies a buffer to the framework into which it can write the contents of the async I/O. Ultimately Windows will write data directly to this buffer when the async I/O completes. The garbage collector cannot be allowed to move or collect the buffer before this happens, necessitating a pin. When EndXXX is called on the AsyncResult from BeginXXX the buffer will be unpinned and eventually collected.
There are a large number of these OverlappedData objects on the heap, and they are the ones solely responsible for keeping the threads alive. Understanding that they are all necessarily part of async I/O leads naturally to the question: “what async operation is this and who initiated it?”.
Looking at the !gcroot output, OverlappedData objects all appear to own SocketAsyncEventArgs. A quick !dumpheap shows a number suspiciously close to the number of dead threads. Since no one else appears to be allocating SocketAsyncEventArgs, they should be investigated for clues to the user code which initiated the async I/O.
The APM pattern designates that every BeginXXX call must be supplied with a callback the framework can use to notify user code that the operation is complete. This callback is usually where user code calls the corresponding EndXXX method. Logically the callback should be stored somewhere in the SocketAsyncEventArgs object itself. Using !do to dig deeper into the SocketAsyncEventArgs, it is not immediately obvious where this callback might be stored. Perusing the field names, most everything seems related to the socket itself, the state of the operation or the buffer used to hold the data. There is one notable exception, however, underlined below:
The mysterious m_UserToken has a pointer to an object somewhere on the heap. Dumping out the address might hold a clue to how the SocketEventArgs could encapsulate a callback to user code.
Success! Now things are beginning to take shape. The object in question is a DoubleTake.Gateway.Server+ConnectSocketResult. The + indicates that the ConnectSocketResult class is nested inside of DoubleTake.Gateway.Server. DoubleTake.Gateway.dll is a middleware library which does a lot of raw socket operations. However, since it is middleware and reasonably well-tested, it’s not likely to be the source of the errors at hand. Far more likely is that someone up the chain is using it incorrectly. Happily, the ConnectSocketResult has an AsyncCallback which can be examined to determine exactly which method is the callback for this async I/O.
Notice above the underlined _target field of AsyncCallback. Dumping this object will give half the picture, namely the target object: GatewayClientController, but not the method. To get the actual method requires a little more work.
The _methodPtr field on the AsyncCallback is simply an IntPtr to somewhere in memory. This is probably extremely efficient for the CLR but not very helpful when debugging a crash dump.
The name of field gives a clue that this IntPtr is likely pointing to a method. SOS has a !u command for disassembling a managed method at a given address. Using it could yield some useful information.
It turns out that the method at 11f66c8 is not even managed code! That’s not going to help, itself, but can’t be too far from our target. It’s probably a stub created by the JIT compiler. If so, then it likely calls another method which is managed. Happily, the very first assembly command is to jump to another memory location: 012547b8. Feeding that method address to !u results in something much more interesting:
Another way to get the same information is to dump the method table for the _target object’s type (GatewayClientController) via !dumpmt -md:
The method address from the stub appears in the Entry column of the method table and identifies the callback in question,
The chain is finally complete. From leaked threads to socket event args to async callbacks all the way up to the offending user code. A quick search of the codebase yields the location of the BeginXXX call that started all the trouble. After reviewing the code it becomes clear that under certain circumstances it would fail to call EndXXX on the returned AsyncResult. By not calling EndXXX the code would leaking resources and fragmenting memory because the framework was unable to unpin and collect all the buffers it was creating to hold the socket data.
What CLI tool are you using ?
ReplyDelete