Wherein we a dig into the native code for a JITed lambda method to determine what object caused an obscure null reference that crashed our console.
It’s time for another post-mortem crash dump analysis. This one comes direct from Windows Error Reporting (WER). WER is a service provided by Microsoft which allows vendors to download dump files collected from their users’ machines that are encountering crashes or other problems. This issue presents itself as a .CAB file provided by a manager. The manager is wondering if this crash bug is the same as one already fixed by the latest patch.
To get started, load up Windbg as usual, and select File->Open Crash Dump. Happily, Windbg supports loading crash dumps from .CAB files directly. Indeed that feature was probably created specifically for this scenario. It automatically loads and stitches the minidump and memory dump together so that it can be debugged the same as any full dump file.
Don’t be alarmed by the message which shows up when loading SOS. Everything works fine despite the misleading warning.
As usual, the first step is to run !Threads to determine which thread brought down the application. From the output below it’s clear that the cause of this crash was a System.AggregateException on the finalizer thread.
This was almost certainly caused by an unobserved faulted Task somewhere in the application. A prior post goes into depth about Tasks which go bad and crash applications. The delayed nature of unobserved Task crashes means that the thread which actually encountered the error has likely returned to the pool and been dispatched to do something else. In any case the stack is long gone, as are the locals and any other state data that might help to pinpoint exactly what’s causing the error.
The AggregateException itself doesn’t have much to say, beyond that a Task brought the application down. In order to get to the root of things, the inner exceptions must be examined. Once again, Windbg has thought of this and provides a handy prompt (underlined below in red) informing its user of what to do next.
After getting to the bottom of all the inner exceptions, it appears that the original NullReferenceException originated in an anonymous delegate located on the JobOptionsImplementor class. To facilitate local variable capture, each anonymous delegate gets its own class. The compiler creates these classes automatically and tends to name them <>c__DisplayClassX, where X is a hex value. The method name for the delegate often takes the form <ContainingMethodName> where ContainingMethodName is the name of the method in which the anonymous delegate is defined. <GetRecommendedJobOptionsAsync>b__6() is the faulting method in this case, and one can reasonably conclude it’s a lambda contained inside the GetRecommendedJobOptionsAsync method on the JobOptionsImplementor class.
Based on this information it’s sometimes a simple matter to go to the code and try to identify which anonymous delegate caused the problem. Sometimes, again, that’s enough to figure out where the null reference came from. If the containing class only had one anonymous method, or if not, they were at least all defined in different methods it would be easy to associate a lambda in code with its compiler-generated method in the dump file. Unfortunately, that is not the case in this scenario. The code shows that there are fully four anonymous delegates in the same method. That’s no help in finding the source of the null reference, so the investigation continues.
Reflector is very useful for determining which compiler-generated class corresponds to which lambda. Excerpted below is the information gleaned from running Reflector on the dll in which the faulting method is defined. Finally, a link from the mangled name to some actual code.
Having cleared that roadblock, another immediately presents itself. It is not at all clear which of these lines could have produced the NullReferenceException. More specifically, there are many lines which could throw due to a null and several of which have chained property calls like “Object.Property.Property” so that even a line number (if there were one available) wouldn’t necessarily identify the culprit definitively. Nothing is going right, at this point. There is one final lead to follow, hopefully it can shed some light on this situation.
In the image of the stack trace above, there is a mysterious red arrow pointing to a hex value. This value is actually an offset pointing to the machine code instruction which caused the NullReferenceException. Figuring out what instruction caused the exception and looking at the surrounding instructions might be just the information needed to close out this case. Conveniently, SOS has an extension: !U which can dump out the executable code associated with any managed method. Methods that have not been JITed are stubs, and will always look the same, but once the JIT compiler hits a managed method it generates the assembly code and places it in the dll image in memory.
In order to disassemble the method, !U requires a method description. It’s a two-step process to get from method name to method description via SOS. First, use SOS’ !name2ee className to get a pointer to the compiler-generated class' method table. The className in question is the name of the class from the NullReferenceException’s stack trace. With the method table address in hand, pass it to SOS’ !dumpmt -md command to get a list of all the methods for the class and their method descriptions. Each method in this table corresponds to an anonymous delegate in the JobOptionsImplementor class. Highlighted below is the method of interest.
With the method description in hand, !U will dump out the assembly code. Notice at the beginning it points out where the method begins in memory: 000006428273de00, as well as its size: 1ed. The offset from the stack trace is 0x2e, adding that to the method’s begin address yields 000006428273de2e. The instruction takes the memory pointed to by the rcx register and compares it to zero.
Here’s where things take a turn for the theoretical. Notice that the result of this comparison is not used anywhere, the next instruction moves rcx elsewhere in memory regardless of the result of the previous compare. Strange, unless the value in the comparison does not come from its result! Consider the case when rcx is 0: ptr[0] will cause an access violation because 0x0 is protected memory. It is probable that the purpose of the comparison is to generate an access violation when the pointer represented by rcx is null. The CLR can then trap this access violation and generate a more meaningful NullReferenceException.
Accepting this hypothesis, the question remains: which object was null? The next assembly command of interest appears to be a method call:
Specifically, this is a method call to address 00000642`801b35c0. SOS, always helpful, shows the name of the managed method which lives at this address. Using !U to sanity-check this output shows that SOS is, in fact, telling the truth.
This is important because it gives an idea of what C# code the assembly code represents. In the image below there is underlined a portion of code which is calls GetRole() via the Engine property which appears to have been inlined by the compiler.
Following from this, the assembly instruction preceding the method call must be null-checking the object on which GetRole() is called. The object, in this case is the TargetServer property of the this.jobConfiguration object. That, then, is almost certainly the source of the null reference which started all this. Comparing this information to the bug list confirms that this issue was indeed fixed in the most recent patch. Disaster: averted!
With .Net’s extensive metadata and some very handy tools even very obscure issues in far-flung locations fall to some diligent investigation and a little guesswork.