Wednesday, December 14, 2011

Using !Syncblk to Debug a Hung Operation

 

A customer has reported that when attempting to delete one of his Jobs that it gets stuck partway through the operation.  He let it sit there for an hour or two before giving up and calling tech support.  A quick look at the logs confirms his suspicions, it’s definitely stuck, but it’s unclear where and even less so why.  The customer was then directed to collect a process dump of the service while the job was stuck and send that back for analysis.

The service log isn’t going to help much in this scenario because logging generally requires the developer to be aware of a potential problem beforehand and add logging during development which will aid post-mortem analysis.  This is an unusual problem, developers don’t expect execution to stop seemingly at random, so there’s not much to go on in the logs.  

WinDbg will come in handy here, as you probably guessed already.  It can provide a full picture of the state of the system while it is hung, which is precisely the thing for this issue.  First things first: a picture of what the threads are doing in the stuck state.  SOS’s !threads will do the job nicely:

image

 

The sheer number of threads sets off alarm bells immediately.  Additionally, two threads own locks and are prime candidates for analysis.  Switching over to the first thread that owns a lock and dumping the managed stack will help figure out what the code is doing:

image

 

The stack is a cause for concern.  Specifically the red underlined portion.  Thread 9 is waiting on a kernel event handle while holding a lock.  Obviously this is not an ideal situation, and based on the customer’s description of the issue the threads in question are likely in one of two states:

  1. A real, live deadlock
  2. A “duck” deadlock.  It’s behaving as though it were but is not technically deadlocked

!SyncBlk can shed some light on what’s going on with the monitor objects.  Remember that calls to lock(object) are just shorthand for Monitor.Enter(object) and a try…finally to exit the monitor, so !SyncBlk is effectively dumping information about anyone using lock() in their code.

image

 

I’ve underlined what I believe to be the important information in !SyncBlk’s output.  Namely, that there is only one lock currently held in the system (I cannot explain why !threads seemed to indicate that there would be two) and that it is held by thread 9, which conforms to the output from !threads.  The MonitorHeld column gives an idea of how many threads are currently waiting on this lock.  The owner adds 1 to the count, and each thread waiting adds 2 more.  So there should be 17 threads waiting and one that currently owns it (17*2+1).

 

To print out all the thread stacks, use thread identifier with * to indicate all threads, then the e (execute) command to execute !clrstack on all the threads.

image

 

Looking at the output, there is a distinct pattern to the stack traces:

image

 

Below I’ve excerpted just the relevant portion that shows up in all of the waiting threads.

image

 

At this point the culprit is clear and it’s time to look at the code.  It ends up being an infinite wait on a ManualResetEvent while holding a global lock.  The reset event represents the status of a socket operation, and this operation is known to occasionally fail to complete.  Unfortunately the code is not making use of a timeout so WaitOne() will never return if the event is never set, so the global lock will never be released.  Now a number of threads are backing up trying to acquire the global lock and will likely never be able to make any forward progress.

 

The simplest solution, in this case, is to add a timeout to the WaitOne() call so that threads cannot get stuck forever waiting on an operation that will never complete.  With the failsafe timeout in place the socket operation will no longer be able to monopolize the global lock forever.  There is still a potential performance problem as the lock will be held until the operation times out, but at least it won’t get stuck forever.

Wednesday, December 7, 2011

Managed Console Handle Leak, Pt. 2

 

The last post addressed a handle leak in a managed console application.  The profiler seemed unable to give us enough information about the source of the leak, so we’re going to turn to our faithful friend WinDbg to get to the bottom of the problem.

First let’s start the console process and let it get to a stable state, then attach to the it by opening WinDbg and pressing F6, which brings up the Attach to Process Window.

image

Once we’re attached the process will be paused, allowing us to load SOS.  We’re going to take the same tack and dump all of the SafeHandle-derived objects on the heap first via !dumpheap.  The arguments “–stat” and “–type SafeHandle” tell !dumpheap to dump statistics rather than all the actual addresses on the heap (the default behavior) and to filter the results to type names containing “SafeHandle”, respectively.

image

Notice that we don’t need a * anywhere to make this work.  That’s because !dumpheap’s text filter is more permissive than is the profiler’s from the first part of this series.

Here are the results:

image

The chief culprit is circled, but sticks out like a sore thumb, anyway.  SOS also nicely sorts in ascending order by the TotalSize column so the suspects are usually arrayed near the bottom of the output from the !dumpheap command.

Now that we know what we need to investigate, we can start to investigate who is allocating these objects.  Ideally we’ll be able to find a path to some user code within short order.  The assumption when investigating these issues is that it’s probably in user code, not in the BCL.  In order to get more information on the actual SafeWaitHandle objects on the heap we’ll need to use !dumpheap again, but with different arguments.  This time we’ll leverage the –mt <methodTableAddr> argument which will dump the addresses of all objects with a given method table.  The method table address is taken from the output of the previous command, highlighted in the below image.

image

The output is a large string of addresses of SafeWaitHandle objects.

image

We’ll choose a few at random and see if we can find out who owns it using the !gcroot <objectAddr> command.

image

The output from the command is a list of all the roots of the object whose address was supplied to the command.  Roots are typically objects on the stack, statics or objects which were explicitly marked as GC roots.  An arrow-delimited list in the form root->child->child->etc shows how the object is reachable from one or more roots.

image

This object is a part of a ManualResetEvent which is itself part of a SingleThreadedOperation object.  SingleThreadedOperation is user code.  Why is it allocating ManualResetEvents?  We need to be able to answer this question in order to confirm that this is or is not the source of the issue at hand.  When we run !gcroot on a few more of the SafeWaitHandles we see some slightly different output.

image

The ownership path on two of the other objects are not present.  While this might seem like a bug with !gcroot it’s actually telling us something very important.  No one owns these objects and they’re candidates for garbage collection.  If we continue to !gcroot objects off the heap we find a vast majority of them are unrooted.  This explains one of the oddities of our profiler session.  Why couldn’t we see all these objects in the snapshots when the handle’s process count clearly indicated that many were being allocated?  The answer appears to be that they were being allocated and then quickly became unreachable.  The profiler performs full collections before a snapshot which renders it unable to find the objects whose entire lifetime occurred in the period between two snapshots.  Luckily WinDbg does not have this limitation!

So, now what?  We can’t trace the ownership of the now-unreachable SafeWaitHandles, but we can set a breakpoint in the class’ constructor and figure out who is allocating them.  Not quite as good, but it will likely yield good information.  To set breakpoints we will use SOSEX.  Loading it is very simple.  In this case it’s located in the same directory as WinDbg so the relative path supplied to .load will suffice (we can omit the .dll).

image

SafeWaitHandles are a little too low-level for our purposes so let’s see if we can find out what higher-level construct is actually creating them.  Knowing that xResetEvents are implemented using WaitHandles we can go ahead and dump information about the ResetEvents on the heap in order to set a breakpoint in the ManualResetEvent’s constructor.

image

The output contains the method table address of the types in question, we’ll use this to set our breakpoint.

image

The number of ManualResetEvents is smaller than the number of SafeWaitHandles above because between a few of these commands a GC was triggered and swept up a large number of them.

In order to actually set a breakpoint using !mbm we need the full name of the method in question.  We can get this from a list of all the method descriptions for the type of interest.  Using !dumpmt with the –md argument we can see a list of all the methods defined on ManualResetEvent.

image

The output looks like this:

image

Setting a breakpoint is very easy, we use the !mbm command from SOSEX to do so, supplying it with the full method name.

image

There will be a message indicating success or failure after issuing the mbm command.

image

The breakpoint has been successfully set, continuing (using g or F5) we get hit very quickly.  A quick look at the stack trace tells us who is allocating the ManualResetEvent. It’s our old friend: SingleThreadedOperation.

image

Allowing the app to continue a few more times this stack trace is repeated over and over again.  SingleThreadedOperation was a person of interest before, now it’s a prime suspect.

A quick dump of all the SingleThreadedOperations on the heap yields an instance count suspiciously similar to the number of ManualResetEvents on the heap.

image

Looking at the code we can see that this class aggregates a ManualResetEvent and implements IDisposable.  Everything looks good, so far.  What’s this, though?  A timer is allocating 2-3 instances per second and not disposing them!  A look around the codebase confirms that no one is disposing the objects, and they go out of scope very quickly.  So while the garbage collector finds them very quickly we can eliminate the performance problem by inserting a using block to dispose of the object before it goes out of scope.  With that fix in place the leak is no more, and our handle count is nice and flat once again.

Managed Console Handle Leak

We were seeing a minor handle leak in our console the other day, so let’s turn on the Performance Monitor, set it up to watch handle usage and point it at the console process.  Here’s what we see:

image

This is not good, but it exhibits a pattern that should be familiar to C# developers.  It’s not a true “leak” in that the handle count goes down after a short amount of time.  So what’s going on? 

The first thing that comes to mind is that we’re regularly allocating SafeHandles either directly (that would be weird) or indirectly (much more likely) through a helper class like a FileStream or a xResetEvent.  That would explain why the handle count is steadily going up for the lifetime of the application.  The periodic drops can be explained by the garbage collector.  Whenever it performs a collection the GC is sweeping up a bunch of unreachable SafeHandles and running their finalizers which release a bunch of kernel handles at once.  The frequency and regularity of the drops leads us to believe the objects being swept up are in Generation 0 which is collected at the highest rate.

Now that we’ve got an idea of what to look for we can run the process in a profiler and try to identify who is allocating all these handles.  SciTech’s .NET Memory Profiler is very good at diagnosing handle leaks (we’ve had a few!) so let’s start it up and tell it to profile an application.

 

image

We’ll start up the wizard by clicking on “Profile Application”.

image

We’ll go through the wizard, selecting all the defaults until reaching the “unmanaged resource tracker” page.  Enabling this because will allow us to determine if we’ve got native code leaking handles. Our console is a mixed-mode process so it can help point us towards the culprit if it’s not in managed code.

image

So here’s the settings at the end of the wizard.  We’ll click start and begin profiling.

image

 

By-and-large we’re not interested in the vagaries of process bootstrapping so we’ll wait until the console process is “stable” before taking our first snapshot.

It’s worth pointing out that resource usage is a little higher than usual while the process is run inside of the profiler.  Don’t panic!  While the absolute numbers are a little higher the pattern we’re investigating remains.

When investigating out-of-control handle usage we can leverage some knowledge of how the .NET BCL is designed to quickly find them.  All kernel handles in .NET are wrapped in SafeHandles, which have finalizers to protect them from leaking system resources in case a developer forgets to dispose of them.  They’re all defined in the Microsoft.Win32.SafeHandles namespace, so we can use a simple filter: *SafeHandle, to show all the live SafeHandle objects in the system:

image

 

The results are surprising: the total number of live SafeHandle objects is 32?  Looking at the process’ handle count we see numbers over 1000, so obviously these objects do not account for all the handles.  Let’s look at the unmanaged resources to see if we can find the rest of the process’ handles.  To do so we simply remove the namespace filter and enter a Name/Resource filter.  That unmanaged resource tracker we turned on earlier will name all the handles in the process with the suffix (HANDLE), so our filter will be *(HANDLE).  These are the results:

image

Again the numbers don’t add up.  There seems to be an awful lot of event handles, but not nearly enough to account for all of them.  Let’s take another snapshot and see what shows up.  We’ll apply our *SafeHandle filter and look at how many new handles were allocated between the two snapshots:

image

Now there’s only 9 SafeHandles?  So we lost three between snapshots!  This does not track with what we’re seeing in the performance monitor.  Let’s check the unmanaged resources:

image

This, too, has not changed from the previous snapshot.  How, then, do we explain the steady uptick and then drop in the handle count if the count is steady between snapshots?  The profiler doesn’t tell us enough, but we’ll find the answer in part 2!