Home » Windows»NMI Crash Dumps

If you have a hung server, the obvious resolution is usually to reboot it. This may be both quick and easy, but it also robs you of the chance to find out why that server is in a hung state to begin with.

Non-Maskable Interrupts

Rather than power-cycling the computer, you should generate a diagnostic non-maskable interrupt (NMI) to it if possible. When configured appropriately (we'll come to this in a minute), Windows will crash with a bughcheck code of 0x80 NMI_HARDWARE_FAILURE and will write a memory dump to disk. Why is this better than just rebooting the server? Well, this way you have a memory dump to analyse, instead of hoping that something relevant got written to the event logs.

Configuring for NMI

On Windows Server 2012, the OS will respond to a diagnostic NMI with a bugcheck. On Windows Server 2008 and 2008 R2, the bugcheck will be generated so long as the registry key at HKLM\System\CurrentControlSet\Control\CrashControl\NMICrashDump is left at the default value of 1. On Windows 2000 and Windows Server 2003, you will have to create and set this key manually, and it will require a reboot before it takes effect.

Configuring for Crash Dumps

Don't forget, that even if you have managed to configure Windows to bugcheck when a diagnostic NMI is issued, you need to have it configured to generate a memory dump when it encounters a bugcheck - or you'll end up with a bluescreen crash and nothing of value to show for it. If you're troubleshooting a server hang, you will want at least a kernel memory dump to look at. A minidump will be of very limited use to you.
Briefly, this means you need:

If I had a penny for every time somebody has configured NMI crash dumps, but forgotten to enable crash dumps, I'd have... probably about enough to buy a small pack of gum.

Generating the NMI

OK - so you have your system configured to write memory dumps when it crashes, and to crash when it gets a diagnostic NMI. So how do you give it a diagnostic NMI?
Well, that all depends on the hardware platform or hypervisor you are using.

HP ProLiant iLO

ProLiant servers usually have an iLO device (integrated lights out) or an OA (onboard administrator), which allow you to control the power to the server, to view and interact with the console - and to generate an NMI. You will find the NMI option under the diagnostics page, where there is a big friendly button marked Generate NMI to system Click it... but... carefully... if all is configured correctly, it will cause a bugcheck, which can be disconcerting for people not expecting this sort of thing (of course, being acquainted with the proper methods, *you* know better). Oh - and don't forget to disable ASR on HP servers, else you'll likely find that ASR has rebooted the server long before you get a chance to NMI it. I have asked HP if they would be so kind as to alter ASR, so that it tries to NMI a server before it power-cycles it, but apparently there's no demand for such a feature. So, please, feel free to raise a support ticket with HP to ask them if there's a way to tell ASR to issue an NMI. It can only help in the long run.

VMware ESX

If you have a virtual machine which is running on a VMware ESX host, you need to SSH to the service console to generate the NMI. Run vm-support -x to get a list of the running VMs and their world IDs. You need the world ID to tell the hypervisor where to send the NMI. Once you have the World ID (let's say it's 1234), you can run /usr/lib/vmware/bin/vmdumper 1234 nmi, to generate an NMI on whichever poor server is running in world ID 1234.

VMware ESXi

ESXi is a little different to ESX, in that you don't get a full shell over an ssh connection. You can, however, use the vSphere Client to initiate an NMI.

To do this, locate the guest VM in the tree structure, and then chose File... Export... Export System Logs. You will then be taken to the Export System Logs window.

Expand the HungVM node, and check the box next to Send_NMI_To_Guest. Do not tick the Coredump_VM or Suspend_VM options, as on a hung VM, these will likely timeout, and you will then go to retry the process and end up with two NMIs, the second of which will overwrite the useful memory dump with a useless one. Also - having the Suspend_VM option checked may cause the VM to be suspended before the memory dump has finished being written to disk.

You will be prompted for a location to store the core dump and performance data, even if you haven't selected those options. Just choose any writable location on your computer, and delete the empty tar/gzip file afterwards.

Egenera pServer

You will need to SSH to the control blade for the frame in which your target pServer is running. Once you have done this, you need to identify which pServer it is. Let's say the frame is called frame1.local and the pServer is in p10. You would run blade -n frame1.local/p10 to generate the NMI.

Dell DRAC

On a Dell iDRAC 7, you can issue an NMI from the Power Control screen. To get to this, expand Server in the left-hand pane and click on Power/Thermal, then choose Power Configuration from the menu bar across the top. Then choose NMI (Non-Masking (sic) Interrupt) from the radio buttons and click on Apply.

Unfortunately, Dell have omitted to include the NMI functionality in their racadm command line tool.

Other IPMI-based BMCs

Diagnostic NMI is part of the IPMI standard, so whichever baseboard management controller you have, there should be a way to generate an NMI to it. But you'll have to dig out the manuals/user guides to look it up.

Other hardware

Many motherboard manufacturers, even for desktop machines, have a way to generate an NMI - usually by setting a jumper switch or shorting two pins. Check your motherboard's documentation to find out if you can do this. If you have to go unscrewing the case in order to do it, it probably isn't going to be much use unless it's a chronic, recurring system hang that you want to investigate.

What Now?

Now, we analyse the memory dump which has been generated. The way to do this is with WinDBG.

WinDBG

WinDBG is part of the Debugging Tools for Windows, which can be installed as part of the Windows SDK. If you are experiencing blue screens, hangs - or have a misbehaving application, WinDBG is an invaluable weapon in your arsenal. To paraphrase Raymond Chen - don't theorise over why your process is crying - plug in the debugger and find out why it's crying.

Once you have WinDBG installed and configured with symbols/symbol servers, it's time to open your memory dump.

As soon as you do this, WinDBG (sometimes affectionately called Windbag) will instantly start to run the !analyze macro against it. Ignore this, as you're troubleshooting a hang, not a crash.

Instead - run !analyze -v -hang, which will help you find driver deadlocks and suchlike. It might be that you are running low on system resources, in which case !vm will be of more use to you.