Monday, April 2, 2012

Debugging a Managed Thread Leak

Wherein we identify the culprit behind a resource leak of hundreds of uncollected, dead threads.

The following issue presented itself in as an unusually high GUI memory footprint on a QA machine.  Things then began to flake out in seemingly random ways so, naturally, QA got a memory dump and delivered it to development for investigation.

Running !threads as is typical, something alarming presents itself immediately.  Notice the astounding number of dead threads.  Considering how little concurrent work the GUI is actually doing, 12 threads is high, but 125 is astronomical.

image

A “XXXX” in the managed ID column indicates that the thread has finished running yet still lives on the managed heap.  Unfortunately the ThreadOBJ objects are not valid managed objects so !dumpobject will not work on them.  The solution is to use !dumpheap -type System.Threading.Thread to find all the managed thread objects to inspect.

image

With this information, running !gcroot on the thread objects will show what is keeping them alive.  Running this command on many of thread objects reveals a distinct pattern. 

image

There are 3 GC handles pointed directly or indirectly at the thread object.  One is a “WeakSh” handle.  This cryptic abbreviation indicates a “short-lived” weak reference, meaning that it does not track the thread object after finalization.  It is reasonable to conclude that upon creation managed threads create weak references to themselves so that they can be easily accessed but do not prevent garbage collection once they’re finished running.  Weak references do not prevent garbage collection, so they are a side-effect, not the cause.

What of the other handles, then?  They are not pointed directly at the thread object, but are keeping it alive indirectly via the OverlappedData class.  The OverlappedData structure has been “async” pinned which must be like regular pinning, but peculiar to asynchronous I/O in the CLR.  When performing async I/O via a BeginXXX method user code supplies a buffer to the framework into which it can write the contents of the async I/O.  Ultimately Windows will write data directly to this buffer when the async I/O completes.  The garbage collector cannot be allowed to move or collect the buffer before this happens, necessitating a pin.  When EndXXX is called on the AsyncResult from BeginXXX the buffer will be unpinned and eventually collected.

There are a large number of these OverlappedData objects on the heap, and they are the ones solely responsible for keeping the threads alive.  Understanding that they are all necessarily part of async I/O leads naturally to the question: “what async operation is this and who initiated it?”.

Looking at the !gcroot output, OverlappedData objects all appear to own SocketAsyncEventArgs.  A quick !dumpheap shows a number suspiciously close to the number of dead threads.  Since no one else appears to be allocating SocketAsyncEventArgs, they should be investigated for clues to the user code which initiated the async I/O.

image

The APM pattern designates that every BeginXXX call must be supplied with a callback the framework can use to notify user code that the operation is complete.  This callback is usually where user code calls the corresponding EndXXX method. Logically the callback should be stored somewhere in the SocketAsyncEventArgs object itself.  Using !do to dig deeper into the SocketAsyncEventArgs, it is not immediately obvious where this callback might be stored.  Perusing the field names, most everything seems related to the socket itself, the state of the operation or the buffer used to hold the data.  There is one notable exception, however, underlined below:

image

The mysterious m_UserToken has a pointer to an object somewhere on the heap.  Dumping out the address might hold a clue to how the SocketEventArgs could encapsulate a callback to user code.

image

Success!  Now things are beginning to take shape.  The object in question is a DoubleTake.Gateway.Server+ConnectSocketResult.  The + indicates that the ConnectSocketResult class is nested inside of DoubleTake.Gateway.Server.  DoubleTake.Gateway.dll is a middleware library which does a lot of raw socket operations.  However, since it is middleware and reasonably well-tested, it’s not likely to be the source of the errors at hand.  Far more likely is that someone up the chain is using it incorrectly.  Happily, the ConnectSocketResult has an AsyncCallback which can be examined to determine exactly which method is the callback for this async I/O.

Notice above the underlined _target field of AsyncCallback.  Dumping this object will give half the picture, namely the target object: GatewayClientController, but not the method.  To get the actual method requires a little more work.

The _methodPtr field on the AsyncCallback is simply an IntPtr to somewhere in memory.  This is probably extremely efficient for the CLR but not very helpful when debugging a crash dump.

image

The name of field gives a clue that this IntPtr is likely pointing to a method.  SOS has a !u command for disassembling a managed method at a given address.  Using it could yield some useful information.

image

It turns out that the method at 11f66c8 is not even managed code!  That’s not going to help, itself, but can’t be too far from our target.  It’s probably a stub created by the JIT compiler.  If so, then it likely calls another method which is managed.  Happily, the very first assembly command is to jump to another memory location: 012547b8.  Feeding that method address to !u results in something much more interesting:

image

Another way to get the same information is to dump the method table for the _target object’s type (GatewayClientController) via !dumpmt -md:

image

The method address from the stub appears in the Entry column of the method table and identifies the callback in question,

image

The chain is finally complete.  From leaked threads to socket event args to async callbacks all the way up to the offending user code.  A quick search of the codebase yields the location of the BeginXXX call that started all the trouble.  After reviewing the code it becomes clear that under certain circumstances it would fail to call EndXXX on the returned AsyncResult.  By not calling EndXXX the code would leaking resources and fragmenting memory because the framework was unable to unpin and collect all the buffers it was creating to hold the socket data.

Friday, March 30, 2012

Debugging a Managed Thread Leak

Wherein we identify the culprit behind a resource leak of hundreds of uncollected, dead threads.

The following issue presented itself in as an unusually high GUI memory footprint on a QA machine.  Things were beginning to flake out in seemingly random ways, so, naturally QA got a memory dump and delivered it to development for investigation.

Running !threads as is typical, something alarming presents itself immediately.  Notice the astounding number of dead threads.  Considering how little concurrent work the GUI is actually doing, 12 threads is high, but 125 is astronomical.

image

A “XXXX” in the managed ID column indicates that the thread has finished running yet still lives on the managed heap.  Unfortunately the ThreadOBJ objects are not valid managed objects so !dumpobject will not work on them.  The solution is to use !dumpheap -type System.Threading.Thread to find all the managed thread objects to inspect.

image

With this information, running !gcroot on the thread objects will show what is keeping them alive.  Running this command on many of thread objects reveals a distinct pattern. 

image

There are 3 GC handles pointed directly or indirectly at the thread object.  One is a “WeakSh” handle.  This cryptic abbreviation indicates a “short-lived” weak reference, meaning that it does not track the thread object after finalization.  It is reasonable to conclude that upon creation managed threads create weak references to themselves so that they can be easily accessed but do not prevent garbage collection once they’re finished running.  Weak references do not prevent garbage collection, so they are a side-effect, not the cause.

What of the other handles, then?  They are not pointed directly at the thread object, but are keeping it alive indirectly via the OverlappedData class.  The OverlappedData structure has been “async” pinned which must be like regular pinning, but peculiar to asynchronous I/O in the CLR.  When performing async I/O via a BeginXXX method user code supplies a buffer to the framework into which it can write the contents of the async I/O.  Ultimately Windows will write data directly to this buffer when the async I/O completes.  The garbage collector cannot be allowed to move or collect the buffer before this happens, necessitating a pin.  When EndXXX is called on the AsyncResult from BeginXXX the buffer will be unpinned and eventually collected.

There are a large number of these OverlappedData objects on the heap, and they are the ones solely responsible for keeping the threads alive.  Understanding that they are all necessarily part of async I/O leads naturally to the question: “what async operation is this and who initiated it?”.

Looking at the !gcroot output, OverlappedData objects all appear to own SocketAsyncEventArgs.  A quick !dumpheap shows a number suspiciously close to the number of dead threads.  Since no one else appears to be allocating SocketAsyncEventArgs, they should be investigated for clues to the user code which initiated the async I/O.

image

The APM pattern designates that every BeginXXX call must be supplied with a callback the framework can use to notify user code that the operation is complete.  This callback is usually where user code calls the corresponding EndXXX method. Logically the callback should be stored somewhere in the SocketAsyncEventArgs object itself.  Using !do to dig deeper into the SocketAsyncEventArgs, it is not immediately obvious where this callback might be stored.  Perusing the field names, most everything seems related to the socket itself, the state of the operation or the buffer used to hold the data.  There is one notable exception, however, underlined below:

image

The mysterious m_UserToken has a pointer to an object somewhere on the heap.  Dumping out the address might hold a clue to how the SocketEventArgs could encapsulate a callback to user code.

image

Success!  Now things are beginning to take shape.  The object in question is a DoubleTake.Gateway.Server+ConnectSocketResult.  The + indicates that the ConnectSocketResult class is nested inside of DoubleTake.Gateway.Server.  DoubleTake.Gateway.dll is a middleware library which does a lot of raw socket operations.  However, since it is middleware and reasonably well-tested, it’s not likely to be the source of the errors at hand.  Far more likely is that someone up the chain is using it incorrectly.  Happily, the ConnectSocketResult has an AsyncCallback which can be examined to determine exactly which method is the callback for this async I/O.

Notice above the underlined _target field of AsyncCallback.  Dumping this object will give half the picture, namely the target object: GatewayClientController, but not the method.  To get the actual method requires a little more work.

The _methodPtr field on the AsyncCallback is simply an IntPtr to somewhere in memory.  This is probably extremely efficient for the CLR but not very helpful when debugging a crash dump.

image

The name of field gives a clue that this IntPtr is likely pointing to a method.  SOS has a !u command for disassembling a managed method at a given address.  Using it could yield some useful information.

image

It turns out that the method at 11f66c8 is not even managed code!  That’s not going to help, itself, but can’t be too far from our target.  It’s probably a stub created by the JIT compiler.  If so, then it likely calls another method which is managed.  Happily, the very first assembly command is to jump to another memory location: 012547b8.  Feeding that method address to !u results in something much more interesting:

image

Another way to get the same information is to dump the method table for the _target object’s type (GatewayClientController) via !dumpmt -md:

image

The method address from the stub appears in the Entry column of the method table and identifies the callback in question,

image

The chain is finally complete.  From leaked threads to socket event args to async callbacks all the way up to the offending user code.  A quick search of the codebase yields the location of the BeginXXX call that started all the trouble.  After reviewing the code it becomes clear that under certain circumstances it would fail to call EndXXX on the returned AsyncResult.  By not calling EndXXX the code would leaking resources and fragmenting memory because the framework was unable to unpin and collect all the buffers it was creating to hold the socket data.

Monday, February 6, 2012

Disassembling a Managed Method

Wherein we a dig into the native code for a JITed lambda method to determine what object caused an obscure null reference that crashed our console.

It’s time for another post-mortem crash dump analysis.  This one comes direct from Windows Error Reporting (WER).  WER is a service provided by Microsoft which allows vendors to download dump files collected from their users’ machines that are encountering crashes or other problems.  This issue presents itself as a .CAB file provided by a manager.  The manager is wondering if this crash bug is the same as one already fixed by the latest patch.

To get started, load up Windbg as usual, and select File->Open Crash Dump.  Happily, Windbg supports loading crash dumps from .CAB files directly.  Indeed that feature was probably created specifically for this scenario.  It automatically loads and stitches the minidump and memory dump together so that it can be debugged the same as any full dump file.

image

Don’t be alarmed by the message which shows up when loading SOS.  Everything works fine despite the misleading warning.

image

As usual, the first step is to run !Threads to determine which thread brought down the application.  From the output below it’s clear that the cause of this crash was a System.AggregateException on the finalizer thread.

image

This was almost certainly caused by an unobserved faulted Task somewhere in the application.  A prior post goes into depth about Tasks which go bad and crash applications.  The delayed nature of unobserved Task crashes means that the thread which actually encountered the error has likely returned to the pool and been dispatched to do something else.  In any case the stack is long gone, as are the locals and any other state data that might help to pinpoint exactly what’s causing the error. 

The AggregateException itself doesn’t have much to say, beyond that a Task brought the application down.  In order to get to the root of things, the inner exceptions must be examined.  Once again, Windbg has thought of this and provides a handy prompt (underlined below in red) informing its user of what to do next.

image

After getting to the bottom of all the inner exceptions, it appears that the original NullReferenceException originated in an anonymous delegate located on the JobOptionsImplementor class.  To facilitate local variable capture, each anonymous delegate gets its own class.  The compiler creates these classes automatically and tends to name them <>c__DisplayClassX, where X is a hex value.  The method name for the delegate often takes the form <ContainingMethodName> where ContainingMethodName is the name of the method in which the anonymous delegate is defined.  <GetRecommendedJobOptionsAsync>b__6() is the faulting method in this case, and one can reasonably conclude it’s a lambda contained inside the GetRecommendedJobOptionsAsync method on the JobOptionsImplementor class.

Based on this information it’s sometimes a simple matter to go to the code and try to identify which anonymous delegate caused the problem.  Sometimes, again, that’s enough to figure out where the null reference came from.  If the containing class only had one anonymous method, or if not, they were at least all defined in different methods it would be easy to associate a lambda in code with its compiler-generated method in the dump file.  Unfortunately, that is not the case in this scenario.  The code shows that there are fully four anonymous delegates in the same method.  That’s no help in finding the source of the null reference, so the investigation continues.

Reflector is very useful for determining which compiler-generated class corresponds to which lambda.  Excerpted below is the information gleaned from running Reflector on the dll in which the faulting method is defined.  Finally, a link from the mangled name to some actual code.

image

Having cleared that roadblock, another immediately presents itself.  It is not at all clear which of these lines could have produced the NullReferenceException.  More specifically, there are many lines which could throw due to a null and several of which have chained property calls like “Object.Property.Property” so that even a line number (if there were one available) wouldn’t necessarily identify the culprit definitively.  Nothing is going right, at this point.  There is one final lead to follow, hopefully it can shed some light on this situation.

image

In the image of the stack trace above, there is a mysterious red arrow pointing to a hex value.  This value is actually an offset pointing to the machine code instruction which caused the NullReferenceException.  Figuring out what instruction caused the exception and looking at the surrounding instructions might be just the information needed to close out this case.  Conveniently, SOS has an extension: !U which can dump out the executable code associated with any managed method.  Methods that have not been JITed are stubs, and will always look the same, but once the JIT compiler hits a managed method it generates the assembly code and places it in the dll image in memory.

In order to disassemble the method, !U requires a method description.  It’s a two-step process to get from method name to method description via SOS.  First, use SOS’ !name2ee className to get a pointer to the compiler-generated class' method table.  The className in question is the name of the class from the NullReferenceException’s stack trace.  With the method table address in hand, pass it to SOS’ !dumpmt -md command to get a list of all the methods for the class and their method descriptions.  Each method in this table corresponds to an anonymous delegate in the JobOptionsImplementor classHighlighted below is the method of interest.

image

With the method description in hand, !U will dump out the assembly code.  Notice at the beginning it points out where the method begins in memory: 000006428273de00,  as well as its size: 1ed.  The offset from the stack trace is 0x2e, adding that to the method’s begin address yields 000006428273de2e.  The instruction takes the memory pointed to by the rcx register and compares it to zero.

image

Here’s where things take a turn for the theoretical.  Notice that the result of this comparison is not used anywhere, the next instruction moves rcx elsewhere in memory regardless of the result of the previous compare.  Strange, unless the value in the comparison does not come from its result!  Consider the case when rcx is 0: ptr[0] will cause an access violation because 0x0 is protected memory.  It is probable that the purpose of the comparison is to generate an access violation when the pointer represented by rcx is null.  The CLR can then trap this access violation and generate a more meaningful NullReferenceException.

Accepting this hypothesis, the question remains: which object was null?  The next assembly command of interest appears to be a method call:

image

Specifically, this is a method call to address 00000642`801b35c0.  SOS, always helpful, shows the name of the managed method which lives at this address.  Using !U to sanity-check this output shows that SOS is, in fact, telling the truth.

image

This is important because it gives an idea of what C# code the assembly code represents.  In the image below there is underlined a portion of code which is calls GetRole() via the Engine property which appears to have been inlined by the compiler.

image

Following from this, the assembly instruction preceding the method call must be null-checking the object on which GetRole() is called.  The object, in this case is the TargetServer property of the this.jobConfiguration object.  That, then, is almost certainly the source of the null reference which started all this.  Comparing this information to the bug list confirms that this issue was indeed fixed in the most recent patch.  Disaster: averted!

With .Net’s extensive metadata and some very handy tools even very obscure issues in far-flung locations fall to some diligent investigation and a little guesswork.

Thursday, January 12, 2012

Unobserved Task Exceptions

Wherein we examine a delayed crash due to an unobserved exception in a System.Threading.Task.

Recently we experienced a service crash when a Task we had spun up to do some work threw an exception and no one was there to catch it.  Normally we’d expect an immediate crash when an exception escapes from a ThreadStart routine because the AppDomain immediately unloads, but that’s not how Tasks work.  Contrary to expectations, not only will the crash almost certainly be delayed, it may not happen at all.  To demonstrate this behavior we have a simple application below which creates a task that waits a few seconds then tries to crash the application by throwing an unobserved exception.  Doubtless the task is disappointed to find that the expected crash does not materialize.

image

What’s going on?  Clearly the process is having a crisis of identity.  The reality of the situation is that the framework has caught the exception and will hold on to it until the task is finalized.  Luckily the prompt informs us that we can still crash if only we’ll press any key.

image

Pressing a key as instructed triggers a full GC collection and crashes the process as expected.  The reason is that the collection forced the Task finalizer to run, indeed it was the only object in the finalization queue, here.  Task’s finalizer throws if there was an unobserved exception on the task, which is the whole point of the exercise.

image

Let’s pretend that we’re debugging this issue postmortem.  With a dump file from the suicidal process we’ve got to get to the bottom of what went wrong.  Dumping all the threads with !Threads (always a good first step) shows us that there is an exception on the finalizer thread.  This is a big neon “DANGER” sign waving a red flag.  Unhandled exceptions on the finalizer thread spell disaster for the process, so we’ll investigate that one right away.

image

Switching over to the finalizer thread and running !ClrStack shows us we are indeed finalizing a Task object.

image

Let’s print out the exception that’s causing us so much grief.  It turns out to be an AggregateException which contains the originating exception inside it.  Tasks always wrap the original exception in an AggregateException.  !PrintException is kind enough to point out that there is an inner exception and even gives the command to dump it.

image

Once we’ve printed out the inner exception things are much more clear.  The AggregateException contains the real culprit, which was the intentional InvalidOperation thrown by a lambda inside of Program.Main.  Who would write code like this?

image

We’ve seen above via a contrived example how an unobserved exception in a task can bring down an entire process.  Since the timing of the crash is subject to the whims of the finalizer, there’s no way to know exactly when or if it will crash.  However, with diligence we can reconstruct what went wrong in the Task and hopefully fix it.

Wednesday, January 11, 2012

When !PrintException Isn’t Quite Enough

Wherein we’ll discuss a handy technique to get a little bit more information out of !PrintException.  Sometimes it’s all that’s needed to turn a seemingly impossible issue into a trivial one!

Another customer reporting crashes, this time we’ve got the logs and a crash dump to work from.  Here’s the last thing we logged before dying:

image

 

Everything is pretty straightforward so far.  Digging a little deeper into the logs from the crashed service instance we see a stack trace to help us pinpoint the source of the issue:

image

 

We’d like to verify that the log we’re seeing lines up with the crash dump, so we’ll load it up in WindDbg to do a quick sanity check.  Inspecting the active threads in the crash dump shows a thread with an IOException on it.  This lines up with the output from the log.  Let’s switch over to thread 21 and continue investigating.

image

 

Out of curiosity, let’s dump the stack trace and see where the thread was when this dump file was created:

image

 

Unfortunately the stack trace doesn’t contain much information about where the error occurred.  That’s because things have progressed past that point.  The service is crashing because there was an unhandled exception on one of the worker threads, but in our AppDomain we’ve got an unhandled exception handler for just this eventuality.  It gets called before shutdown so that we can write the dump.  That’s what this thread stack is showing.

 

So far everything is looking good, the last thing to confirm is that the exception is the same one we saw in the logs.  A quick !PrintException (!pe) will show us the exception on the thread, but something doesn’t look right:

image

 

Where is the rest of the stack trace?  The stack trace here looks like the bottom half of the stack trace from the log, but we might reasonably conclude that this exception is not the same as the one in the log.  The key to resolving the disparity is to delve a little deeper into the exception object itself. 

image

 

Since an exception is an object like any other, we can employ !DumpObject (!do) on the address of the exception to inspect it more carefully.

image

 

The thing to notice is the _remoteStackTraceString.  When an exception is thrown it records the stack trace in the _stackTrace field on the exception object.  If the framework needs to marshal an already-thrown exception across a thread or remoting boundary it will need to rethrow it on a new thread in a new context.  But when rethrowing the exception it would replace the original stack trace with the stack from the new throw location.  In order to avoid losing this vital information, the framework “prep”s the exception by storing the original stack trace string in the _remoteStackTraceString.  Exception.ToString() prints out the remote stack trace prefixed by “Server stack trace:” which we saw in the log excerpt at the beginning.  Unfortunately !PrintException does not output this information.  Luckily, dumping the original stack trace is as simple as a !DumpObject on the field’s address.

image

 

And there we have the rest of the stack trace.  This technique can come in handy anytime the exception at hand just doesn’t seem to have all the information we need.  There’s always the possibility that the key to the whole issue is hidden there just beneath the surface, all we need is the proper tool to bring it to light.