---

Introduction

CVE-2025-21662 is a recently resolved issue in the Linux kernel's net/mlx5 driver that could cause processes to hang indefinitely under certain error conditions. This vulnerability was caused by a missing function call, which failed to notify waiting tasks after a command entry allocation error. In this post, we’ll break down how the vulnerability worked, what was fixed, and provide a code snippet and guidance for affected users.


## What is net/mlx5?

net/mlx5 is the kernel module in Linux responsible for Mellanox (NVIDIA) ConnectX-5 series network adapters (and some older ConnectX chips). It's widely used in data centers and high-performance computing.

The Root Cause

When a function in the driver called cmd_alloc_index() failed (often due to resource exhaustion), the cmd_work_handler() function would exit early. However, it didn't properly call complete() on a structure (ent->slotted) used for thread synchronization.

This meant any process waiting for the command (e.g., via wait_for_completion()) would hang, sometimes forever, leading to stuck processes and potential system-wide performance issues.

If your system was affected, you may have seen logs like

mlx5_core 000:01:00.: cmd_work_handler:877:(pid 3880418): failed to allocate command entry
INFO: task kworker/13:2:4055883 blocked for more than 120 seconds.
Not tainted 4.19.90-25.44.v2101.ky10.aarch64 #1
"echo  > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
...

Stack Trace Example

kworker/13:2    D     4055883      2 x00000228
Workqueue: events mlx5e_tx_dim_work [mlx5_core]
Call trace:
   __switch_to+xe8/x150
   __schedule+x2a8/x9b8
   schedule+x2c/x88
   schedule_timeout+x204/x478
   wait_for_common+x154/x250
   wait_for_completion+x28/x38
   cmd_exec+x7a/xa00 [mlx5_core]
   mlx5_cmd_exec+x54/x80 [mlx5_core]
   mlx5_core_modify_cq+x6c/x80 [mlx5_core]
   mlx5_core_modify_cq_moderation+xa/xb8 [mlx5_core]
   mlx5e_tx_dim_work+x54/x68 [mlx5_core]
   process_one_work+x1b/x448
   worker_thread+x54/x468
   kthread+x134/x138
   ret_from_fork+x10/x18

Any process that called these routines could freeze, causing cascading issues.

Here’s a simplified version of the problematic logic

int cmd_work_handler() {
    ...
    if (cmd_alloc_index(...) < ) {
        // missing complete(&ent->slotted);
        return;
    }
    ...
}

The function should have always completed ent->slotted (using complete(&ent->slotted)) regardless of allocation success.

The Fixed Code

The fix calls complete() before every early return, ensuring no waiting process is ever left hanging:

int cmd_work_handler() {
    ...
    if (cmd_alloc_index(...) < ) {
        complete(&ent->slotted); // Now properly signals waiting tasks
        return;
    }
    ...
}

Reference Patch:
Linux kernel commit fixing CVE-2025-21662
(*Replace with the actual commit when available*)

Exploit Details

This bug is more of a *denial of service* vulnerability than something an attacker could use to run unauthorized code. That said, a user with the ability to trigger allocation failures (e.g., by exhausting system resources or controlling network activity) could potentially cause parts of the system to freeze or hang.

For example

- Malicious or erroneous activity could cause the network interface to hammer through commands, eventually hitting the allocation bug and locking up system work queues.
- In cloud environments, one guest's high resource usage might impact the host kernel, especially if Mellanox NICs are widely used.

Note: There's no remote code execution – only a local user with the right hardware and access could trigger hangs.

How Can I Know If I’m Affected?

- You are running Linux kernel versions before the patch was applied (check your distribution's kernel change logs).
- You are using Mellanox/NVIDIA ConnectX-4/5/6 network cards.

You’ve seen hung processes, especially with mlx5_core in their trace.

Check your logs (dmesg or /var/log/messages) for signs like:
failed to allocate command entry
task <name> blocked for more than 120 seconds

Mitigation:

If you can’t upgrade right away, reduce command allocation failures by monitoring resource usage closely and avoiding low-memory scenarios.

More Information

- CVE-2025-21662 record at NIST (when available)
- Linux kernel mailing list discussion
- Mellanox (NVIDIA) Linux drivers and documentation

Conclusion

CVE-2025-21662 is a subtle but impactful Linux kernel bug affecting high-performance networking. It mainly leads to system hangs instead of more dangerous attacks. The issue is fixed in mainline Linux; all users of Mellanox cards should check their systems and update as soon as possible to avoid stability problems.

*If this post helped you understand CVE-2025-21662, share it with your sysadmin friends and subscribe for more practical Linux security updates!*


*This article is exclusive and written in plain, easy-to-understand language. For permission to republish, contact the author.*

Timeline

Published on: 01/21/2025 13:15:09 UTC
Last modified on: 11/03/2025 21:19:03 UTC