Skip to main content

NVMe over RoCE

NVMe over RoCE (RDMA over Converged Ethernet) is a transport option within the NVMe over Fabrics (NVMe-oF) specification that carries NVMe commands across an Ethernet network using RDMA — Remote Direct Memory Access. By bypassing the host TCP/IP stack entirely, RoCE allows NVMe I/O to move directly between application memory and storage device memory via the network, with the NIC handling the data transfer rather than the CPU. The result is end-to-end latency that approaches local PCIe NVMe performance over a network — typically 80–150 µs versus 20–70 µs for local NVMe.

Key Facts NVMe over RoCE
Transport RDMA over Converged Ethernet — kernel bypass, zero-copy DMA
Latency 80–150 µs end-to-end · requires lossless Ethernet fabric (DCB + PFC)
Hardware requirement RDMA-capable NICs (RNIC) on both initiator and target
Versus NVMe/TCP Lower latency, higher infrastructure complexity and cost

RoCE is the highest-performance option in the NVMe-oF family for Ethernet environments, but that performance comes with infrastructure requirements that limit where it can practically be deployed. NVMe/TCP, which uses standard TCP/IP and works on any Ethernet network, trades some latency for significantly lower operational complexity and broader hardware compatibility.

What is NVMe over RoCE: application sends I/O via RDMA NIC over lossless Ethernet to NVMe storage targets

How NVMe over RoCE Works

In an NVMe/RoCE deployment, the NVMe initiator (the host running the application) and the NVMe target (the storage node) both have RDMA-capable NICs (RNICs). When an application issues a storage I/O:

  1. The RNIC on the initiator generates an RDMA read or write operation encapsulating the NVMe command.
  2. The frame is transmitted across the Ethernet fabric — without involving the host CPU or kernel networking stack for the data path.
  3. The RNIC on the storage target places the data directly into the NVMe device’s DMA buffer.
  4. A completion is returned to the initiator, again bypassing the kernel.

This zero-copy, kernel-bypass path is what eliminates most of the latency overhead present in TCP/IP-based protocols. The tradeoff is that RDMA is highly sensitive to packet loss: a single dropped packet triggers RDMA retransmission logic that causes latency spikes. This is why RoCE requires a lossless Ethernet fabric — enforced through Data Center Bridging (DCB), Priority Flow Control (PFC), and Enhanced Transmission Selection (ETS).

RoCE v1 operates at Layer 2 (within a single broadcast domain). RoCE v2 adds IP and UDP headers, enabling routing across Layer 3 boundaries — making it practical for larger data centers and multi-rack deployments.

🚀 Need one platform for both NVMe/TCP and NVMe/RoCE? simplyblock supports both transports, so teams can keep NVMe/TCP as the broad default and enable NVMe/RoCE where RDMA fabrics are justified. 👉 Explore NVMe-oF Storage for Kubernetes and Private Cloud →

NVMe over RoCE vs. Other NVMe-oF Transports

FeatureNVMe over RoCENVMe over TCPNVMe over FC
TransportRDMA over EthernetTCP/IP over EthernetFibre Channel
Latency~80–150 µs~300–500 µs~100–300 µs
NIC requirementRDMA-capable (RNIC)Standard NICFC HBA
Network requirementLossless (DCB + PFC)Standard EthernetFC fabric
CPU overheadVery low (NIC offload)Moderate (kernel path)Low (HBA offload)
Kubernetes fitComplex, limitedGood — CSI-nativeLegacy environments
Operational complexityHighLowModerate–high

For a detailed performance comparison, see our NVMe/TCP vs NVMe/RoCE analysis.

Use Cases for NVMe over RoCE

RoCE is the right transport when the absolute lowest network storage latency is required and the infrastructure investment is justified:

  • High-frequency trading: Latency is a direct revenue factor; RDMA eliminates network stack jitter.
  • AI/ML training clusters: GPU-to-storage bandwidth for checkpoint writing and dataset loading; RDMA NICs are often already present in GPU clusters.
  • HPC environments: Parallel filesystem access patterns benefit from RDMA’s CPU efficiency at scale.
  • Tier-0 database storage: Where consistent sub-200 µs storage response is a hard requirement.

In each of these environments, the RDMA network infrastructure (RNIC on every node, lossless fabric configuration, PFC tuning) is typically already present or justified by the workload.

NVMe over RoCE in Kubernetes

RoCE is technically usable in Kubernetes, but it is rarely deployed there in practice. Kubernetes clusters are dynamic — pods are created, deleted, and rescheduled across nodes continuously. This creates challenges for RDMA fabrics:

  • RDMA connections are connection-oriented and state-heavy; dynamic Kubernetes scheduling breaks connection affinity.
  • PFC misconfiguration or fabric changes cause RDMA pause frames that create latency spikes across the cluster.
  • Configuring lossless Ethernet for every node in a Kubernetes cluster requires consistent switch configuration that is difficult to maintain as the cluster scales.

For Kubernetes storage, NVMe/TCP is usually the more practical path because it works on standard Ethernet, integrates cleanly with CSI, and handles the dynamic connection patterns Kubernetes requires. RoCE still matters in specialized AI or latency-sensitive clusters where the RDMA fabric is already part of the design and the operational trade-off is intentional.

simplyblock: One Platform for NVMe/TCP and NVMe/RoCE

simplyblock can support both NVMe/TCP and NVMe/RoCE within the same software-defined storage platform. That matters for teams that do not want separate storage products for general-purpose Kubernetes workloads and low-latency RDMA clusters.

In practice, the split is usually straightforward:

  • NVMe/TCP is the broader default for Kubernetes, OpenShift, private cloud, and standard Ethernet networks.
  • NVMe/RoCE is the fit when a team already runs or intentionally designs a lossless RDMA fabric for AI, HPC, or other latency-critical environments.

The practical latency difference between NVMe/TCP and NVMe/RoCE is still real, but the operational gap matters just as much. simplyblock lets teams keep the same storage control plane, snapshots, multi-tenant QoS, and CSI integration while selecting the transport that fits each environment.

That means teams can standardize on one platform and choose protocol by workload and network design instead of treating NVMe/TCP and NVMe/RoCE as mutually exclusive infrastructure choices.

RoCEv2 NVMe over TCP NVMe Latency InfiniBand

Questions and Answers

Why use NVMe over RoCE instead of NVMe over TCP?

NVMe/RoCE delivers lower latency and lower CPU overhead by using RDMA instead of the full TCP/IP stack. That makes it the better fit when the extra network complexity is justified by the workload, such as specialized AI clusters, HPC, or other environments chasing the lowest possible storage latency.

How does NVMe over RoCE compare to NVMe over TCP?

NVMe/RoCE usually delivers lower latency and better CPU efficiency, but it requires RDMA-capable NICs and a lossless Ethernet fabric. NVMe/TCP is easier to deploy on standard Ethernet and is usually the operational default for Kubernetes and private-cloud environments.

Does simplyblock support both NVMe/TCP and NVMe/RoCE?

Yes. simplyblock can support both transports in the same software-defined storage platform. Teams can keep NVMe/TCP as the common choice for standard Ethernet environments and enable NVMe/RoCE where RDMA-backed clusters need lower latency.

Can NVMe over RoCE be used in Kubernetes environments?

Technically yes, but it is harder to operate than NVMe/TCP in dynamic Kubernetes environments. RoCE makes the most sense when the cluster already has a well-managed RDMA fabric and the application benefits justify the added network discipline.

What are the limitations of NVMe over RoCE?

Specialized RDMA NICs are required on every host and storage node. The network fabric must be lossless, which requires consistent PFC configuration on all switches. Layer 2 RoCE (v1) cannot route across subnets without RoCEv2. And in dynamic environments like Kubernetes, the operational overhead of maintaining an RDMA fabric is significant compared with NVMe/TCP.