A new Linux kernel vulnerability, CVE-2024-26962, was recently resolved. This bug could leave dm-raid456 (the device-mapper RAID for RAID4, RAID5, and RAID6) stuck, causing your disk operations to *hang forever* during a reshape (when the layout or size of a RAID array changes) if active IO (input/output) overlapped with the reshaping process. In this article, we’ll walk through what happened, how it was fixed, what the danger was, and how it applies to anyone using Linux software RAID. We'll keep explanations simple, include all key technical details, and link to official kernel sources and patches.
Component affected: Linux kernel’s dm-raid456 (device-mapper managed RAID arrays)
- Problem: IO operations that crossed the "reshape position" (the place in a RAID array where reshaping is happening) could wait forever under some conditions, causing system deadlocks.
- Consequence: Any disk activity (like reading or writing files) could hang—making the system seem frozen or unresponsive.
What Triggers the Deadlock?
When you reshape a RAID4/5/6 array, the kernel tracks the progress (called the "reshape position"). New IO requests that want to read or write data right where the reshape is happening must pause and wait. This ensures your data doesn’t get corrupted.
But: As of a certain kernel commit (c467e97f079f), it was possible for an IO to *wait forever* at the reshape spot in these three situations:
MD_RECOVERY_FROZEN flag is set
While these flags mean the array *should* be idle, the IO would never be processed, and your system could be stuck—especially in automated test environments or if external tools or scripts accidentally left these flags set.
Example of the Hang – Kernel Stacktrace
The following stacktrace (from /proc/<pid>/stack) shows how a process can indefinitely hang because the reshape cannot make progress:
wait_woken+x7d/x90
raid5_make_request+x929/x1d70 [raid456]
md_handle_request+xc2/x3b [md_mod]
raid_map+x2c/x50 [dm_raid]
__map_bio+x251/x380 [dm_mod]
dm_submit_bio+x1f/x760 [dm_mod]
__submit_bio+xc2/x1c
submit_bio_noacct_nocheck+x17f/x450
submit_bio_noacct+x2bc/x780
submit_bio+x70/xc
mpage_readahead+x169/x1f
blkdev_readahead+x18/x30
...
This means a user process trying to read a file is stuck—waiting for the RAID reshape to move on, which never happens.
Why Did This Happen?
- Original fix: The commit (c467e97f079f) made reshape IO’s wait for completion if they cross the active reshape region.
- Bug: If the reshape is *blocked* by one of the "no progress" states above, the wait becomes infinite—since nothing will ever restart reshape in that state.
- Special for dm-raid456: Unlike md/raid456 (the standard MD RAID subsystem), dm-raid456 did *not* have other mechanisms to unblock or reset the condition automatically.
The fix focused on two components
1. Blocking certain admin commands: It prevents admin commands (like raid_message()) from changing the RAID sync thread state after presuspend, which might have allowed reconfiguration that left the array in a bad state.
2. Proactive detection: Before waiting for an IO in these "no forward progress" states, the RAID code checks if any of the trigger conditions (read-only, MD_RECOVERY_WAIT, MD_RECOVERY_FROZEN) are active. If so, instead of trying to wait, it requeues the IO request—avoiding the deadlock.
Patch Example (simplified)
if (is_read_only(mddev) ||
(mddev->recovery & MD_RECOVERY_WAIT) ||
(mddev->recovery & MD_RECOVERY_FROZEN)) {
// Don’t wait - requeue IO to prevent deadlock
bio_endio(bio, -EAGAIN);
return;
}
// else, safe to wait for reshape
wait_event(...);
In English: If the RAID is in a state where reshape can never continue, don’t wait, just reschedule or fail the IO.
Exploit Scenario: How Could Attackers Abuse This Issue?
While this vulnerability is not a direct "remote code execution" exploit, it *is* a Denial of Service (DoS) risk. For example:
- An attacker with admin rights could force the array into read-only mode or freeze reshape, then trigger IO that overlaps the reshape position.
- All IO across those sectors would hang, potentially freezing backup jobs, databases, or other critical system functions.
- In automated environments, scripts could get stuck forever until someone manually recovers the system.
What Should System Administrators Do?
- Update your kernel: Make sure you’re running a kernel version that includes the CVE-2024-26962 fix (check your Linux distributor’s security updates).
- Be careful with RAID admin actions: Avoid switching arrays to read-only or issuing reshape commands unless you know the consequences.
- Monitor: If you see unexplained system hangs involving RAID volumes during reshape, this bug may be the culprit if you haven’t updated yet.
References and Further Reading
- CVE-2024-26962 entry (Mitre/NIST)
- Linux Kernel Patch, commit
- dm-raid456 bug discussion (LKML)
- RAID reshape documentation
- man 4 md (Linux RAID documentation)
Final Words
CVE-2024-26962 is a textbook example of a subtle bug: it won’t let a hacker run code directly, but it can leave your RAID array *frozen* at the worst possible time—during a reshape. The fix is straightforward; it makes sure the system doesn’t wait for something that can never happen. It’s a good reminder: RAID administration is powerful, but a small edge-case mistake can have big consequences. Patch your servers if you use Linux RAID!
*Feel free to share this article—understanding RAID internals helps keep Linux systems safe and reliable.*
Timeline
Published on: 05/01/2024 06:15:12 UTC
Last modified on: 12/23/2024 13:39:33 UTC