On System Reliability and Why the (Conceptual) Design of the Blue Screen on Windows Is Actually a Good Thing

As many people, unfortunately, experienced recently (hello, CrowdStrike), getting repetitive Blue Screen of Death (BSOD) crashes on Windows machines is an utterly unpleasant experience, especially at that scale. This distress led to occasional public calls of frustration ranging from “Windows would've been better without blue screens” to “a good OS would automatically recover from such repetitive reboots.” In this blog post, we want to discuss the conceptual design principles behind system reliability against faulty software and why we think the OS recovery mechanisms behaved reasonably, given the circumstances.

Sharing in Isolation

Before we venture into the driver territory, we need to introduce the core concepts that define Windows architecture. Most applications that we deal with on an everyday basis operate in so-called user mode. User mode is an unprivileged environment designed to limit what programs can do, protecting the essential OS components from tampering and isolating programs from each other. Running application instances exist in logical containers called processes that offer an environment where a program thinks it has exclusive access to memory (see virtual address space) and processing power (see preemptive multitasking). If it all sounds like an early virtualization/containerization technology, you are correct, conceptually it is.

Figure: Process isolation vs. kernel sharing.

 

Living in their separate worlds, user-mode programs can perform arbitrary computations, but the moment they need to draw on the screen, read a file, or talk over the network, they need to ask the operating system for help. This design allows Windows to enforce restrictions that, essentially for our discussion, help it preserve integrity. For compatibility/usability reasons, Windows authorizes programs running with Administrator permissions to perform countless potentially dangerous operations, making it possible for them to cause system instability. However, if you discover that an unprivileged program can cause a BSOD (without using the shutdown privilege), feel free to report it to Microsoft and collect their bug bounty for a Denial-of-Service vulnerability.

Process isolation also implies that one program instance cannot directly interfere with another. A buggy implementation that overwrites random memory will likely wreak havoc within the affected process, but it shouldn’t matter for anybody else. The operating system, of course, offers a variety of primitives that break process isolation and allow having shared resources (that one of the using parties, therefore, can corrupt), but these are all opt-in features. In other words, the consequences of sloppy programming in user mode are limited and depend on the privilege level and ability to access shared resources.

The discussion so far promotes the idea that a bug in one component should not affect and crash others. However, such isolation is not always possible or desirable. One example where sharing is inherent is when loading library code. The default Windows implementation – Dynamic Link Libraries – allows making reusable modules that programs can load and ask for various services. DLLs run inside existing processes and, hence, share all runtime resources with the hosting application. This complete sharing implies higher expectations of code quality, as buggy library code can accidentally corrupt the data belonging to the program or other loaded libraries, causing them to misbehave.

How does it all relate to the king of our discussion? In short, antivirus driver developers must consider the caveats of running in a shared environment and the impact of having high privileges, as both are inherent to the kernel.

In the Kernel We Trust

As another security researcher accurately summarized: “The kernel is like the backstage crew in a theater production; they handle all the important, behind-the-scenes work that makes sure the show runs smoothly.” This superior position offers kernel-mode code practically unlimited power and guarantees isolation from regular applications. But most importantly for our discussion, the kernel is a shared environment: there can be many user-mode processes yet there is a single OS kernel. So, what exactly runs there? In addition to hosting crucial Windows components, kernel mode supports special programs called drivers. Drivers are, effectively, plugins for the core of the operating system. They can integrate custom hardware, enforce additional security policies, offer better software troubleshooting capabilities, etc.

Drivers are the kernel-mode analog of DLLs. They even share the file format, with the primary technical difference being the dependencies each can bring. From the practical perspective (dictated by the unique position and surroundings), their development demands even higher quality and reliability standards. Finally, from the organizational perspective, while anyone can develop drivers, Microsoft limits who can distribute them. Loading code into the kernel requires the executables to be signed with a dedicated type of certificate.

What is the reason for this extra trouble? By design, the kernel is supposed to be completely trusted. Microsoft naturally wants to limit it to only essential OS components plus necessary hardware, antivirus, and other drivers. Ideally, unless something requires direct and unimpeded access to the operating system state, memory, hardware, or similar resources, it does not belong in the kernel.

Microsoft made mistakes of moving some non-essential components into kernel mode in the past in violation of this principle (overruling it based on performance benefits or ease of implementation) and regretted it. One of the most famous examples (that happened before cybersecurity became a noticeable concern in the industry) is the win32k graphical subsystem. Microsoft started working on this problem on several fronts since then, moving features like font handling into user mode and introducing additional security mitigations that lower the available attack surface. After all, complex legacy code that runs with the highest privileges is both a reliability and a security hazard. And so is redundant kernel code.

“Exceptional” Exceptions

Operating in a shared environment with direct access to sensitive information makes the cost of mistakes much higher. This challenge encourages developers to rethink how software should treat and handle erroneous conditions. The popular and, perhaps, the most effortless approach to error handling in programming is to rely on exceptions. An exception is an event triggered manually or by trying to perform an invalid operation. This event interrupts code execution redirecting it into a different location (the exception handler) and can abort an operation mid-function. Exception-centric error handling allows the developer to focus on the desired execution flow, delegating the task of recovering from problems to some other upstream code.

An alternative solution is to dedicate exceptions for genuinely exceptional situations and report all non-fatal errors via other means. Windows’s lowest level (Native) API that exposes the core OS services to user mode and kernel drivers closely follows this principle and defines every function to return a status code. The caller must check this status for success/failure and, if necessary, can extract additional details about the kind of failure. Native API routines can still raise exceptions, but solely under circumstances that should not appear during careful programming. Some examples include division by zero, access to invalid memory, or attempts to close invalid handles.

Driver developers are welcome to use exceptions in their internal code (given some constraints like allowed IRQLs); yet, they cannot rely on anybody else to handle exceptions during cross-component interaction.

To BSOD or Not to BSOD

What happens during an exception when the upstream code is not exception-aware? This scenario typically results in a crash. While an application crash affects a single process (that the user can restart), a driver crash causes a Blue Screen of Death. BSODs are not limited to unhandled exceptions, though. Various OS components perform extra integrity and sanity checks aiming to detect memory corruption, deadlocks, code patching, and many other issues. The result is always the same: an immediate reboot. And that is an intentional design decision.

Figure: an example of a BSOD

People often misunderstand the purpose of the Blue Screen of Death. BSOD is not meant to be a developer's punishment for blunders (although it certainly can serve as one), it is an OS defense mechanism and a safety feature. That is why, internally, BSOD is also known as a bug check.

A problem with exceptions (especially in the paradigm described above) is that they can leave the data the code just operated on in an inconsistent state. An unhandled exception or an assertion failure means that something unexpected happened. And we have no idea how bad the situation is. Again, kernel mode is a trusted, highly privileged environment with direct access to memory and hardware. BSOD must shut down the system because it is the best choice you can make in this case. Masking the issue and operating on a potentially corrupted state can cause more damage than an immediate shutdown. Ah, and do not forget that the kernel is also shared and the code of a faulty driver can accidentally overwrite the state of another driver or some data belonging to the core OS facilities.

So yes, saying that Windows should not shut down if it encounters a BSOD is like saying that an automobile assembly line should not stop when a robotic manipulator faces an unexpected obstacle that blocks its movement.

Interestingly, Windows 3.1 offered an option to ignore general protection faults (effectively, memory access exceptions) in user-mode applications which sort of worked. Doing so for the kernel in the modern world sounds completely crazy, though.

The Liar Paradox

Okay, even if rebooting is the best option, does it have to repeat? Why can't Windows unload or disable the faulting driver and continue without it? This approach might indeed be beneficial in some circumstances, but it might also be detrimental in others. In an ideal world (at least according to the design principles described above), BSOD should occur at the moment a kernel component starts to misbehave. In contrast, in practice, it only happens when the kernel discovers misbehavior. Time might have passed since then, masking the underlying cause. Besides, prompting an already misbehaving system to diagnose itself and find the faulting component is a somewhat paradoxical query.

Probably the best available automated solution for determining the cause of a crash is the WinDbg !analyze extension command. Aside from depending on debug symbols, it occasionally produces false results and blames the wrong component. It can happen for a multitude of reasons. For instance, say we see that driver A tries to access invalid memory. Does it mean that this driver is faulty? Not necessarily, because it might be that driver B happened to release or overwrite A’s memory by mistake, causing A to fail later. NotMyFault tool by Sysinternals can showcase this blame confusion in action.

Figure: The NotMyFault demo

The Last Known Good Recovery

Nevertheless, Windows does have a recovery mechanism that can automatically disable incompatible or problematic drivers and rollback breaking configuration changes. The key to this feature is a registry location called HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet. Its content includes the entire Service Control Manager database (which dictates the list of drivers to load during startup with their parameters), various low-level system settings, and some aspects of device configuration. The CurrentControlSet key itself, however, is merely a symbolic link to one of its siblings, such as ControlSet001, ControlSet002, etc.

Figure: The Last Known Good registry keys

Windows supports creating control set checkpoints before applying potential breaking changes such as installing or reconfiguring drivers or modifying system startup settings. It also records which configurations result in successful and failed boots in the HKEY_LOCAL_MACHINE\SYSTEM\Select key. The most recent successful configuration is called Last Known Good. While the system can automatically classify boots, programs can also control this behavior via NtEnableLastKnownGood and NtDisableLastKnownGood.

This feature can be extremely helpful, but it has a crucial weakness that we are sure you already spotted: it cannot roll back to the last known good checkpoint if it does not exist. It can happen if the change that caused a kernel component to start misbehaving resides outside of CurrentControlSet (i.e., unrelated to installing or updating drivers, adjusting startup settings, or reconfiguring devices). The backup option for this scenario is to let the user fix the issue manually (usually via the safe mode) because, at this point, Windows doesn’t know how.

The Closing Words

So why did Windows behave the way it did after facing the faulty CrowdStrike driver? Hopefully, we finally have enough background knowledge to understand and tell if it was reasonable. According to the official CrowdStrike report, the update that triggered the issue changed one of the driver’s configuration files. This change caused an out-of-bound memory read in CrowdStrike’s driver (CSAgent.sys), resulting in an unhandled exception. The system treated this exception as an unrecoverable error (as it should), causing a BSOD and a shutdown. Because CSAgent.sys is an ELAM (Early Launch AntiMalware) driver, Windows tried to load it after every reboot, failing repeatedly. And, ultimately, the Last Known Good mechanisms were unsuccessful in reverting this behavior because even if a given machine happened to have a control set checkpoint, it couldn’t undo the update of an arbitrary configuration file that Windows is not aware of. So, can you blame the operating system for not trying hard enough?

Having hindsight knowledge of this scenario and observing its impact might help Microsoft invent an improved recovery mechanism for similar issues. Yet, it’s not going to be a silver bullet. Kernel driver developers need to be incredibly careful in their programming due to the countless ways they can break the system, and it doesn’t seem to be changing anytime soon.

Keep me informed

Sign up for the newsletter