Purple Screen of Death – PSOD

By | December 2, 2017

What is Purple Screen of Death – PSOD

The Purple Screen of Death (PSOD) is a fatal crash of VMware ESX/ESXi hosts which kills all active Virtual Machines. A diagnostic screen with white type on a purple background. The term Purple Screen of Death is a play on the Blue Screen of Death, the informal name given by users to the Windows general protection fault error. Typically, the PSOD details the memory state at the time of the crash and includes other information such as the ESXi version and build, the exception type, register dump, what was running on each CPU at the time of the crash, backtrace, server uptime, error messages and core dump information.

psod vmware esxi troubleshooting

Interpreting the purple diagnostic screen

If the VMkernel experiences an error, the error appears in a purple diagnostic screen. The purple diagnostic screen looks similar to:

VMware ESX Server [Releasebuild-98103
PCPU 1 locked up. Failed to ack TLB invalidate.
frame=0x3a37d98 ip=0x625e94 cr2=0x0 cr3=0x40c66000 cr4=0x16c
es=0xffffffff ds=0xffffffff fs=0xffffffff gs=0xffffffff
eax=0xffffffff ebx=0xffffffff ecx=0xffffffff edx=0xffffffff
ebp=0x3a37ef4 esi=0xffffffff edi=0xffffffff err=-1 eflags=0xffffffff
*0:1037/helper1-4 1:1107/vmm0:Fagi 2:1121/vmware-vm 3:1122/mks:Franc
0x3a37ef4:[0x625e94]Panic+0x17 stack: 0x833ab4, 0x3a37f10, 0x3a37f48
0x3a37f04:[0x625e94]Panic+0x17 stack: 0x833ab4, 0x1, 0x14a03a0
0x3a37f48:[0x64bfa4]TLBDoInvalidate+0x38f stack: 0x3a37f54, 0x40, 0x2
0x3a37f70:[0x66da4d]XMapForceFlush+0x64 stack: 0x0, 0x4d3a, 0x0
0x3a37fac:[0x652b8b]helpFunc+0x2d2 stack: 0x1, 0x14a4580, 0x0
0x3a37ffc:[0x750902]CpuSched_StartWorld+0x109 stack: 0x0, 0x0, 0x0
0x3a38000:[0x0]blk_dev+0xfd76461f stack: 0x0, 0x0, 0x0
VMK uptime: 7:05:43:45.014 TSC: 1751259712918392
Starting coredump to disk Starting coredump to disk Dumping using slot 1 of 1…using slot 1 of 1… log

Here is a breakdown of each section of the above purple diagnostic screen:

  • The Product and Build:VMware ESX Server [Releasebuild-98103]This section of the purple diagnostic screen identifies the product and build that has experienced the error. In this example, the product is VMware ESX Server build 98103.
  • The Error Message:PCPU 1 locked up. Failed to ack TLB invalidate

    This section of the purple diagnostic screen identifies the reported error message. There are only a finite number of error messages that can be reported. These error messages are discussed in this article.
  • The CPU Registers:frame=0x3a37d98 ip=0x625e94 cr2=0x0 cr3=0x40c66000 cr4=0x16c
    es=0xffffffff ds=0xffffffff fs=0xffffffff gs=0xffffffff
    eax=0xffffffff ebx=0xffffffff ecx=0xffffffff edx=0xffffffff
    ebp=0x3a37ef4 esi=0xffffffff edi=0xffffffff err=-1 eflags=0xffffffff


    These are the values that were in the physical CPU registers at the time of the error. The information in these registers may vary greatly between VMkernel errors. These registers can only be used internally when debugging a core dump of the VMkernel error. For more information about these registers, see http://www.intel.com/products/processor/manuals/ for Intel and http://support.amd.com/us/psearch/Pages/psearch.aspx for AMD. At the AMD site, search for the Architecture Programmer’s manual for your specific processor type.Note: The preceding links were correct as of March 28, 2013. If you find the links to be broken, provide feedback on the article and a VMware employee will update the article as necessary.
  • The Physical CPU:*0:1037/helper1-4 1:1107/vmm0:Fagi 2:1121/vmware-vm 3:1122/mks:FrancThis section of the purple diagnostic screen identifies the physical CPU that was running instructions during the VMkernel error. In the example, the * beside the 0 indicates that physical CPU 0 was running an operation at the time of the failure. In newer versions of ESX, instead of including an *, the preceding letters CPU are included. For example, if the same error as the above were to occur in newer versions of VMware ESX, the same line appears as:CPU0:1037/helper1-4 cpu1:1107/vmm0:Fagi cpu2:1121/vmware-vm cpu3:1122/mks:Franc. 

    This section of the purple diagnostic screen also describes the world (process) that was running on the CPU at the time of the error. In the above example, the userworld running was helper1-4.

    Note
    : The name of the process may be truncated.
  • The Stack Trace:0x3a37ef4:[0x625e94]Panic+0x17 stack: 0x833ab4, 0x3a37f10, 0x3a37f48
    0x3a37f04:[0x625e94]Panic+0x17 stack: 0x833ab4, 0x1, 0x14a03a0
    0x3a37f48:[0x64bfa4]TLBDoInvalidate+0x38f stack: 0x3a37f54, 0x40, 0x2
    0x3a37f70:[0x66da4d]XMapForceFlush+0x64 stack: 0x0, 0x4d3a, 0x0
    0x3a37fac:[0x652b8b]helpFunc+0x2d2 stack: 0x1, 0x14a4580, 0x0
    0x3a37ffc:[0x750902]CpuSched_StartWorld+0x109 stack: 0x0, 0x0, 0x0
    0x3a38000:[0x0]blk_dev+0xfd76461f stack: 0x0, 0x0, 0x0
    The stack represents what the VMkernel was doing at the time of the error. In this example, it was trying to clear memory page tables (TLB). This information is a vital tool in the diagnosis of purple screen errors by evaluating the actions of the kernel at the time of the error.
  • The Uptime:VMK uptime: 7:05:43:45.014 TSC: 1751259712918392This section indicates how long a server is running since the last boot. In this example, the ESX host was running for 7 days, 5 hours, 43 minutes and 45.014 seconds. The TSC value is the number of CPU clock cycles that have elapsed since the server was started.
  • The Core Dump:Starting coredump to disk Starting coredump to disk Dumping using slot 1 of 1...using slot 1 of 1... log

    This section of the purple diagnostic screen indicates that the contents of the VMkernel memory are being copied to the vmkcore partition.

Using the error message of the purple diagnostic screen to troubleshoot a vmkernel error

The VMkernel error message generated by the purple screen can be used to identify the cause of the issue. The number of error messages that can be produced are finite. This is a list of known VMkernel error messages.

  • Type: Console Oops
    Example Error: COS Error: Oops
    Description: An ESX host fails and causes a purple screen when there is a Service Console oops. Unlike most purple screen errors, it is not triggered by the VMkernel. Instead the error is triggered by the Service Console and occurs at the Linux level. These purple screen errors contain additional information from the Linux kernel. For more information about Console Oops, see Understanding an “Oops” purple diagnostic screen (1006802).
  • Type: Lost Heartbeat
    Example Error: Lost Heartbeat
    Description: The ESX VMkernel and the Service Console Linux kernel run at the same time on ESX. The Service Console Linux kernel runs a process called vmnixhbd, which heartbeats the VMkernel as long as it is able to allocate and free a page of memory. If no heartbeats are received before a timeout period of 30 minutes, the VMkernel triggers a COS Panic and a purple diagnostics screen that mentions a Lost Heartbeat. For more information on Lost Heatbeats, see Understanding a “Lost Heartbeat” purple diagnostic screen (1009525) .
  • Type: Assert
    Example Error: ASSERT bora/vmkernel/main/pframe_int.h:527 
    Description: Assert errors are software errors, because they are related to assumptions on which the program is based. This type of purple screen error is primarily caused by software issues. For more information on the assert error message, see Understanding ASSERT and NOT_IMPLEMENTED purple diagnostic screens (1019956).
  • Type: Not Implemented
    Example Error: NOT_IMPLEMENTED /build/mts/release/bora-84374/bora/vmkernel/main/util.c:83
    Description: A not implemented error message occurs when the code encounters a situation that it was not designed to handle. For more information, see Understanding ASSERT and NOT_IMPLEMENTED purple diagnostic screens (1019956).
  • Type: Spin count exceeded / Possible deadlock
    Example Error: Spin count exceeded (iplLock) - possible deadlock
    Description: A VMware ESX host may report a Spin count exceeded and possible deadlock in a purple diagnostic screen when a thread is attempting to execute in the critical section of code. Since it was trying to enter the critical section, the thread needed to poll a mutex for a lock prior to executing the code by conducting a spinlock operation. The thread continues to poll the mutex during the spinlock operation, but there is a certain limit of how many times it polls the mutex. For more information on Spin count exceeded errors, see Understanding a “Spin count exceeded” purple diagnostic screen (1020105).
  • Type: Failed to ack TLB invalidate
    Example Error: PCPU 1 locked up. Failed to ack TLB invalidate.
    Description: Physical CPUs fail when trying to clear memory page tables. For more information, see Understanding a Failed to ack TLB invalidate purple diagnostic screen (1020214).

A purple diagnostic screen can also come in the form of an Exception. An Exception Handler is a computer hardware mechanism designed to handle some condition that changes the normal flow of execution (Division by Zero, Page Fault, etc). There is no trace from handlers, so you need logging to determine if handler faulted (or single step debugging). This is a list of common exceptions:

  • Type: Exception 13 (General Protection Fault)
    Example Error: #GP Exception(13) in world 4130:helper13-0 @ 0x41803399e303
    Description: A general protection fault (Exception 13) occurs under one of the following circumstances: the page being requested does not belong to the program requesting it (and not mapped in program memory), or the program does not have rights to perform a read or write operation on the page. For more information on Exception 13 or Page Fault, see Understanding Exception 13 and Exception 14 purple diagnostic screen events (1020181).
  • Type: Exception 14 (Page Fault)
    Example Error: #PF Exception type 14 in world 136:helper0-0 @ 0x4a8e6e
    Description: A page fault (Exception 14) occurs when the page being requested has not been successfully loaded into memory. For more information on Exception 14 or Page Fault, see Understanding Exception 13 and Exception 14 purple diagnostic screen events (1020181).
  • Type: Exception 18 (Machine Check Exception)
    Example Error: Machine Check Exception: Unable to continue
    Example Error: Hardware (Machine) Error
    Description: A Machine Check Exception (MCE) is generated by the hardware and reported by the host. Consult your hardware vendor in the event of an MCE. By evaluating the information presented, it is possible to identify the individual component reporting the error. For more information on MCE, see Decoding Machine Check Exception (MCE) output after a purple screen error (1005184).

Complete list of exceptions:

  • Exception Type 0 #DE: Divide Error
  • Exception Type 1 #DB: Debug Exception
  • Exception Type 2 NMI: Non-Maskable Interrupt
  • Exception Type 3 #BP: Breakpoint Exception
  • Exception Type 4 #OF: Overflow (INTO instruction)
  • Exception Type 5 #BR: Bounds check (BOUND instruction)
  • Exception Type 6 #UD: Invalid Opcode
  • Exception Type 7 #NM: Coprocessor not available
  • Exception Type 8 #DF: Double Fault
  • Exception Type 10 #TS: Invalid TSS
  • Exception Type 11 #NP: Segment Not Present
  • Exception Type 12 #SS: Stack Segment Fault
  • Exception Type 13 #GP: General Protection Fault
  • Exception Type 14 #PF: Page Fault
  • Exception Type 16 #MF: Coprocessor error
  • Exception Type 17 #AC: Alignment Check
  • Exception Type 18 #MC: Machine Check Exception
  • Exception Type 19 #XF: SIMD Floating-Point Exception
  • Exception Type 20-31: Reserved
  • Exception Type 32-255: User-defined (clock scheduler)
If your VMware ESX or ESXi host experiences an error similar to one of these that does not point you to a general article, search for the error message and stack trace information within the Knowledge Base. If the error has not been documented within the Knowledge Base, collect the diagnostic information from the VMware ESX host and submit a support request.

Prevent:

  • Make sure patches for vCenter and ESXi are applied
  • Keep drivers and firmware up to date
  • Check if HW is on Hardware Compatibility list
  • Use Runecast Analyzer and scan systems for known bugs, driver issues or configurations which led to PSOD

Be prepared:

  1. Leverage vSphere HA / FT
  2. Configure dump locations for troubleshooting
  3. Have remote console to ESXi (iLO, iDRAC, IMM)
  4. Configure ESXi to restart after PSOD
  5. Know your enemy! (research what scenarios led to PSOD – logs, syslogs using Runecast Log Analysis)

 

Being proactive like this will greatly help you avoid future critical PSOD-related service outages. Runecast Analyzer was designed to minimize and even completely eliminate PSOD crashes of ESXi hosts. Many of the root causes behind PSODs are not easy to detect manually because it is typically a combination of several factors, not just a software problem. Automating this process is the most viable way to ensure your datacentres are as reliable as possible.

Troubleshooting:

Step 1: Sometimes, we might miss out on the screenshot of PSOD. Well that’s alright! If we have core-dump configured for the ESXi, we can extract the dump files to gather the crash logs.

Once the host is back up from accident reboot post PSOD, login to the SSH/Putty of the host and go to the core directory. The core directory is the location where your PSOD logging go to.

The most important one is the “Core” folder which contains the kernel dump, the PSOD will purge what was in memory to a file called vmkernel-zdump.1 or something to that affect and place it in that directory.

So go to

# cd var/core

Then list out the files here using ls –ltr . You will see the below file.

Vmkernel-zdump.1

Step 2: How do we extract it?

Well, we have a nice extract script that does all the job, “vmkdump_extract “. This command must be executed against the zdump.1 file, which looks something like this:

# vmkdump_extract vmkernel-zdump.1

It creates multiple below  files as mentioned in the screenshot.

Note: – All we require for analysis is the vmkernel-log.1 file.

Step 3: Open the vmkernel-log.1 file using one of the below method:

  1. WinSCP (GUI)
  2. Less VMkernel-log.1   (Command line)

I am windows plus VMware support engineer, so defiantly I would preferred GUI method to analyze the log file J

Let’s use  WinSCP:

Step 4. Connect your ESXI host using WinSCP and browse /var/core path and copy vmkernel-log.1 to your local machine.

Step 5. As you have already copied vmkernel-log.1 to your local machine. Now, You will have to use something like Notepad++ to open the vmkernel-log.1 file, right click on it and edit the log file in notepad++ editor and search for keyword “BlueScreen” and it will take you to the below events.

The first line @BlueScreen: Tells the crash exception like Exception 13/14, in my case issue it is pointed to “LINT1/NMI (motherboard nonmaskable interrupt), undiagnosed. This may be a hardware problem; please contact your hardware vendor” Which is pointing to hardware issue.

The VMKuptime tells the Kernel up-time before the crash.

The logging after that is the information that we need to be looking for, the cause as to why the crash occurred.

Note:- The crash dump varies for every crash. These issues can range from hardware errors / driver issues / issues with ESXi build and a lot more.

While using the b method, skip to the end of the file by pressing Shift+G.and slowly go to the top by pressing Page Up. You will come across a line that says @BlueScreen: <event> and after that you know what exactly need to check J

Each dump analysis would be different, but fundamental is same.

Leave a Reply

Your email address will not be published. Required fields are marked *