CVE-2021-46961 - Nested NMIs and the GICv3 Spurious Interrupt Bug in Linux Kernel

---

In December 2021, a subtle but dangerous bug (CVE-2021-46961) was resolved in the Linux kernel affecting systems running on ARM architectures with the GICv3 interrupt controller. In simple terms, this bug could lead to a nasty kernel panic, making systems crash under specific conditions involving spurious (unwanted, unexpected) interrupts.

This post breaks down what happened, how it was triggered, what the patch does, and why this bug was so risky. Let’s get hands-on with code snippets and walk you through the internals of Linux’s interrupt handling.

What Is the Vulnerability?

The vulnerability occurs when the Linux kernel running on ARM with the GICv3 (Generic Interrupt Controller version 3) wrongly enables interrupts while handling what’s called a “spurious interrupt.” A spurious interrupt isn’t a real, meaningful interrupt—it’s kind of like a ghost signal. But in this bug, the kernel was handling these as if they were real, leading to a cascade of problems.

When a spurious interrupt happens, the GICv3 handler should ideally just ignore it and return to what it was doing. Instead, the kernel enabled interrupts again, allowing yet another (NMI-priority) interrupt to be raised—even inside Non-Maskable Interrupt (NMI) context. This led to nested NMIs, which the kernel does not support.

Here’s an example of the kind of error message you’d see

[   14.816231] ------------[ cut here ]------------
[   14.816231] kernel BUG at irq.c:99!
[   14.816232] Internal error: Oops - BUG:  [#1] SMP
[...]
[   14.816233] Hardware name: evb (DT)
[   14.816234] pstate: 80400085 (Nzcv daIf +PAN -UAO)
[...]
[   14.816251] Call trace:
[   14.816251]  asm_nmi_enter+x94/x98
[   14.816251]  el1_irq+x8c/x180                    (IRQ C)
[   14.816252]  gic_handle_irq+xbc/x2e4
[   14.816252]  el1_irq+xcc/x180                    (IRQ B)
[   14.816253]  arch_timer_handler_virt+x38/x58
[...]
[   14.816255]  el1_irq+xcc/x180                    (IRQ A)
[...]
[   15.103093] Kernel panic - not syncing: Fatal exception in interrupt

IRQ C (with higher priority) arrived at NMI priority—while the NMI lock was already held.

This leads to the kernel tripping over itself in nmi_enter(), hitting the infamous BUG_ON(in_nmi()).

The Root Cause

In Linux’s GICv3 interrupt handler, spurious interrupts were checked after enabling interrupts again:

// Pseudo-code based on the relevant gic-v3 handler

void gic_handle_irq(struct pt_regs *regs)
{
    ...
    // Enable IRQs (unsafe before checking)
    local_irq_enable();

    if (is_spurious_interrupt()) {
        // Oops: we're already in an unsafe state!
        return;
    }
    ...
}

If is_spurious_interrupt() returns true, the function should just return—but IRQs have already been re-enabled! This opens the door to another NMI-level interrupt arriving, causing nested NMI hell.

The Fix — Early Spurious Handling

The solution is straightforward and elegant: Handle spurious interrupts *before* enabling interrupts. That way, if it’s a spurious interrupt, the function bails out silently, and IRQs never get toggled.

Here’s what the fix looks like (simplified)

void gic_handle_irq(struct pt_regs *regs)
{
    ...
    // Check for spurious interrupt before touching IRQs!
    if (is_spurious_interrupt()) {
        return;
    }

    // Only enable IRQs after confirming it's a genuine interrupt
    local_irq_enable();
    ...
}

This minor surgery ensures normal system operation even in nights with lots of spurious activity.

How Could Attackers Exploit This?

Although it’s a bit of a stretch for a *classic* remote exploit, local attackers (or buggy device drivers, or oddball hardware) could intentionally generate spurious or edge-case interrupts. On affected kernels (for example, with some pseudo-NMI support code backported), these could push the kernel into a nested NMI, triggering crashes at will. This makes machines unstable or easily DoS’d—"Denial of Service" in plain English.

Stack looks like

el1_irq (IRQ A)
 → gic_handle_irq
   → el1_irq (IRQ B, spurious & interrupts reenabled early)
     → gic_handle_irq
       → el1_irq (IRQ C, NMI, now nested)
         → ... crash ...

Original Patch and Commit:

irqchip/gic-v3: Do not enable irqs when handling spurious interrupts

CVE Entry:

NVD - CVE-2021-46961

arch/arm64/kernel/irq.c

Conclusion

CVE-2021-46961 might sound arcane, but it’s a crystal-clear example of why interrupt order and state must be handled very carefully in kernel programming. Even a tiny misplacement can turn theoretical bugs into real-world system crashes.

If you’re running your own Linux kernel on ARM, especially with pseudo-NMI features or GICv3 tweaks, it’s extra important to patch up. As always: stay safe, patch early, and keep an eye on the upstream!

If you want to dig deeper, check the original kernel patch and your distribution's advisories for full details.

Timeline

Published on: 02/27/2024 19:04:06 UTC
Last modified on: 12/11/2024 14:49:59 UTC