A customer has reported that when attempting to delete one of his Jobs that it gets stuck partway through the operation. He let it sit there for an hour or two before giving up and calling tech support. A quick look at the logs confirms his suspicions, it’s definitely stuck, but it’s unclear where and even less so why. The customer was then directed to collect a process dump of the service while the job was stuck and send that back for analysis.
The service log isn’t going to help much in this scenario because logging generally requires the developer to be aware of a potential problem beforehand and add logging during development which will aid post-mortem analysis. This is an unusual problem, developers don’t expect execution to stop seemingly at random, so there’s not much to go on in the logs.
WinDbg will come in handy here, as you probably guessed already. It can provide a full picture of the state of the system while it is hung, which is precisely the thing for this issue. First things first: a picture of what the threads are doing in the stuck state. SOS’s !threads will do the job nicely:
The sheer number of threads sets off alarm bells immediately. Additionally, two threads own locks and are prime candidates for analysis. Switching over to the first thread that owns a lock and dumping the managed stack will help figure out what the code is doing:
The stack is a cause for concern. Specifically the red underlined portion. Thread 9 is waiting on a kernel event handle while holding a lock. Obviously this is not an ideal situation, and based on the customer’s description of the issue the threads in question are likely in one of two states:
- A real, live deadlock
- A “duck” deadlock. It’s behaving as though it were but is not technically deadlocked
!SyncBlk can shed some light on what’s going on with the monitor objects. Remember that calls to lock(object) are just shorthand for Monitor.Enter(object) and a try…finally to exit the monitor, so !SyncBlk is effectively dumping information about anyone using lock() in their code.
I’ve underlined what I believe to be the important information in !SyncBlk’s output. Namely, that there is only one lock currently held in the system (I cannot explain why !threads seemed to indicate that there would be two) and that it is held by thread 9, which conforms to the output from !threads. The MonitorHeld column gives an idea of how many threads are currently waiting on this lock. The owner adds 1 to the count, and each thread waiting adds 2 more. So there should be 17 threads waiting and one that currently owns it (17*2+1).
To print out all the thread stacks, use thread identifier with * to indicate all threads, then the e (execute) command to execute !clrstack on all the threads.
Looking at the output, there is a distinct pattern to the stack traces:
Below I’ve excerpted just the relevant portion that shows up in all of the waiting threads.
At this point the culprit is clear and it’s time to look at the code. It ends up being an infinite wait on a ManualResetEvent while holding a global lock. The reset event represents the status of a socket operation, and this operation is known to occasionally fail to complete. Unfortunately the code is not making use of a timeout so WaitOne() will never return if the event is never set, so the global lock will never be released. Now a number of threads are backing up trying to acquire the global lock and will likely never be able to make any forward progress.
The simplest solution, in this case, is to add a timeout to the WaitOne() call so that threads cannot get stuck forever waiting on an operation that will never complete. With the failsafe timeout in place the socket operation will no longer be able to monopolize the global lock forever. There is still a potential performance problem as the lock will be held until the operation times out, but at least it won’t get stuck forever.