---
Introduction
CVE-2025-21662 is a recently resolved issue in the Linux kernel's net/mlx5 driver that could cause processes to hang indefinitely under certain error conditions. This vulnerability was caused by a missing function call, which failed to notify waiting tasks after a command entry allocation error. In this post, we’ll break down how the vulnerability worked, what was fixed, and provide a code snippet and guidance for affected users.
## What is net/mlx5?
net/mlx5 is the kernel module in Linux responsible for Mellanox (NVIDIA) ConnectX-5 series network adapters (and some older ConnectX chips). It's widely used in data centers and high-performance computing.
The Root Cause
When a function in the driver called cmd_alloc_index() failed (often due to resource exhaustion), the cmd_work_handler() function would exit early. However, it didn't properly call complete() on a structure (ent->slotted) used for thread synchronization.
This meant any process waiting for the command (e.g., via wait_for_completion()) would hang, sometimes forever, leading to stuck processes and potential system-wide performance issues.
If your system was affected, you may have seen logs like
mlx5_core 000:01:00.: cmd_work_handler:877:(pid 3880418): failed to allocate command entry
INFO: task kworker/13:2:4055883 blocked for more than 120 seconds.
Not tainted 4.19.90-25.44.v2101.ky10.aarch64 #1
"echo > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
...
Stack Trace Example
kworker/13:2 D 4055883 2 x00000228
Workqueue: events mlx5e_tx_dim_work [mlx5_core]
Call trace:
__switch_to+xe8/x150
__schedule+x2a8/x9b8
schedule+x2c/x88
schedule_timeout+x204/x478
wait_for_common+x154/x250
wait_for_completion+x28/x38
cmd_exec+x7a/xa00 [mlx5_core]
mlx5_cmd_exec+x54/x80 [mlx5_core]
mlx5_core_modify_cq+x6c/x80 [mlx5_core]
mlx5_core_modify_cq_moderation+xa/xb8 [mlx5_core]
mlx5e_tx_dim_work+x54/x68 [mlx5_core]
process_one_work+x1b/x448
worker_thread+x54/x468
kthread+x134/x138
ret_from_fork+x10/x18
Any process that called these routines could freeze, causing cascading issues.
Here’s a simplified version of the problematic logic
int cmd_work_handler() {
...
if (cmd_alloc_index(...) < ) {
// missing complete(&ent->slotted);
return;
}
...
}
The function should have always completed ent->slotted (using complete(&ent->slotted)) regardless of allocation success.
The Fixed Code
The fix calls complete() before every early return, ensuring no waiting process is ever left hanging:
int cmd_work_handler() {
...
if (cmd_alloc_index(...) < ) {
complete(&ent->slotted); // Now properly signals waiting tasks
return;
}
...
}
Reference Patch:
Linux kernel commit fixing CVE-2025-21662
(*Replace with the actual commit when available*)
Exploit Details
This bug is more of a *denial of service* vulnerability than something an attacker could use to run unauthorized code. That said, a user with the ability to trigger allocation failures (e.g., by exhausting system resources or controlling network activity) could potentially cause parts of the system to freeze or hang.
For example
- Malicious or erroneous activity could cause the network interface to hammer through commands, eventually hitting the allocation bug and locking up system work queues.
- In cloud environments, one guest's high resource usage might impact the host kernel, especially if Mellanox NICs are widely used.
Note: There's no remote code execution – only a local user with the right hardware and access could trigger hangs.
How Can I Know If I’m Affected?
- You are running Linux kernel versions before the patch was applied (check your distribution's kernel change logs).
- You are using Mellanox/NVIDIA ConnectX-4/5/6 network cards.
You’ve seen hung processes, especially with mlx5_core in their trace.
Check your logs (dmesg or /var/log/messages) for signs like:
failed to allocate command entry
task <name> blocked for more than 120 seconds
Mitigation:
If you can’t upgrade right away, reduce command allocation failures by monitoring resource usage closely and avoiding low-memory scenarios.
More Information
- CVE-2025-21662 record at NIST (when available)
- Linux kernel mailing list discussion
- Mellanox (NVIDIA) Linux drivers and documentation
Conclusion
CVE-2025-21662 is a subtle but impactful Linux kernel bug affecting high-performance networking. It mainly leads to system hangs instead of more dangerous attacks. The issue is fixed in mainline Linux; all users of Mellanox cards should check their systems and update as soon as possible to avoid stability problems.
*If this post helped you understand CVE-2025-21662, share it with your sysadmin friends and subscribe for more practical Linux security updates!*
*This article is exclusive and written in plain, easy-to-understand language. For permission to republish, contact the author.*
Timeline
Published on: 01/21/2025 13:15:09 UTC
Last modified on: 11/03/2025 21:19:03 UTC