In the Linux kernel, a vulnerability has been discovered that can lead to a deadlock situation. This vulnerability has been addressed by using a dedicated workqueue for cgroup bpf destruction. Let's dive into the details of the issue, how it was uncovered, and the solution to fix it.
Set watchdog_thresh repeatedly.
Please refer to LINK for the pressure test scripts that demonstrate this issue.
This issue occurs due to the cgroup_mutex and cpu_hotplug_lock being acquired in different tasks, which may lead to a deadlock. The deadlock can happen through the following steps:
1. A large number of cpusets are deleted asynchronously, which puts a significant amount of cgroup_bpf_release works into the system_wq queue. This results in a situation where all active works are cgroup_bpf_release works, and many such works are put into the inactive queue.
2. Setting watchdog_thresh holds cpu_hotplug_lock.read and puts smp_call_on_cpu work into the system_wq. However, due to step 1 filling the system_wq, the 'sscs.work' is put into the inactive queue and will be blocked for a while.
CPU offline operations require cpu_hotplug_lock.write, which is blocked by step 2.
4. The deleted cpusets in step 1 put cgroup_release works into the cgroup_destroy_wq, where they constantly compete for cgroup_mutex. When cgroup_mutex is acquired by work at css_killed_work_fn, it calls cpuset_css_offline, which needs to acquire cpu_hotplug_lock.read. However, this call is blocked due to step 3.
5. At this point, there are 256 works in the active queue (cgroup_bpf_release) trying to acquire cgroup_mutex, resulting in them being blocked. This means that sscs.work cannot be executed, ultimately leading to a deadlock with four processes being blocked.
To address this issue, the solution is to place cgroup_bpf_release works on a dedicated workqueue, breaking the deadlock situation.
This fix was implemented by modifying the Linux kernel source code in the cgroup/bpf section, using a dedicated workqueue for cgroup BPF destruction. Doing so allows the system to avoid deadlocks that could otherwise be caused by the improper ordering of locks and work items.
By using a dedicated workqueue for cgroup BPF destruction, the Linux kernel avoids potential deadlocks. For more details on the issue and the actual code changes, refer to the original references on the Linux Kernel Mailing List. If you are running a vulnerable version of the Linux kernel, it is essential to apply this fix to prevent potential deadlock situations in your system.
Timeline
Published on: 11/19/2024 18:15:25 UTC
Last modified on: 11/22/2024 17:11:42 UTC