NVMe/TCP vs local NVMe is a fundamental architecture decision for teams building Kubernetes clusters, bare-metal database servers, and high-performance compute infrastructure. Local NVMe delivers the lowest absolute block latency — typically 50–150 µs — because there is no network hop between the CPU and the drive. NVMe/TCP disaggregates storage from compute by carrying NVMe commands over a standard TCP/IP network, adding 150–350 µs of network latency but enabling capabilities that local storage cannot provide: replication, live pod migration, independent scaling, and consistent PVC lifecycle management across a Kubernetes cluster.
The latency difference between local NVMe and NVMe/TCP is real but often smaller in practice than architecture diagrams suggest. Modern NVMe/TCP implementations over 25GbE Ethernet achieve 200–500 µs end-to-end latency for random 4K reads — a range that falls within the acceptable budget for most database and analytics workloads. The architecturally important question is not whether local NVMe is faster in a benchmark, but whether the operational constraints of local storage — pod pinning, manual data migration, no replication, no live failover — are acceptable for the workload and team.
Latency: Local NVMe vs. NVMe/TCP vs. NVMe/RoCE
Local NVMe latency is determined by the PCIe bus speed, NVMe queue depth, and drive controller firmware. Consumer-grade NVMe drives read at 70–120 µs; enterprise drives optimized for low latency achieve 50–80 µs. There is no network path and no software TCP stack in this path, which is why local NVMe sets the latency floor.
NVMe/TCP adds a kernel TCP stack and a network round-trip. On a well-configured 25GbE network with low switch latency, the additional overhead is typically 150–350 µs, placing total read latency at 200–500 µs. This is the range where most workloads operate comfortably. Applications with sub-200 µs requirements — some in-memory databases, high-frequency trading, and real-time control systems — may find NVMe/TCP insufficient.
NVMe/RoCE closes the gap considerably. RDMA bypasses the CPU’s TCP stack, reducing network overhead to 30–80 µs on a lossless Ethernet or InfiniBand fabric. NVMe/RoCE total read latency of 80–150 µs is competitive with local NVMe enterprise drives while still providing all the operability benefits of disaggregated storage. The trade-off is that NVMe/RoCE requires a lossless network fabric (Priority Flow Control, ECN tuning), which adds infrastructure complexity.
Operational Differences: What Local NVMe Cannot Do
The latency advantage of local NVMe comes with operational constraints that compound as cluster scale and workload complexity increase:
No live pod migration: a Kubernetes pod using a local NVMe volume (via hostPath or a local PersistentVolume) is pinned to a specific node. If that node needs maintenance, the pod must be stopped, the data manually moved or synced, and the pod restarted elsewhere. This breaks Kubernetes’ node drain workflow and requires operator intervention.
No replication: a locally attached drive has no built-in replication mechanism. Data loss on drive failure means data loss in the application unless the application handles its own replication (as some distributed databases do). Storage-level replication is absent.
No independent scaling: adding storage capacity requires adding a node with local drives. Adding compute capacity without additional storage means uneven storage distribution. The two resources cannot grow independently.
No thin provisioning: local NVMe volumes consume their full declared capacity immediately. Disaggregated storage with thin provisioning allocates capacity as data arrives, reducing waste for workloads with variable growth patterns.
When Local NVMe Is the Right Choice
Local NVMe remains appropriate when:
- The workload requires sub-200 µs block latency and NVMe/RoCE infrastructure is not available.
- The application manages its own replication (Apache Cassandra, ClickHouse native replication, some PostgreSQL configurations with logical replication).
- The cluster is small and static enough that manual node drain procedures are manageable.
- The workload is compute-and-storage co-located by design (analytics engines reading from local scratch space, build caches, ephemeral workloads).
For anything requiring Kubernetes-native PVC lifecycle management — StatefulSets with rolling upgrades, node drains, pod disruption budgets, CSI snapshots — disaggregated storage for Kubernetes is a substantially better operational fit.
NVMe/TCP vs Local NVMe vs NVMe/RoCE Compared
| Factor | Local NVMe | NVMe/TCP | NVMe/RoCE |
|---|---|---|---|
| Read latency (random 4K) | 50–150 µs | 200–500 µs | 80–150 µs |
| Storage replication | None (application-level only) | Storage-level, policy-driven | Storage-level, policy-driven |
| Live pod migration | Not supported | Full Kubernetes scheduling freedom | Full Kubernetes scheduling freedom |
| Independent scaling | Tied to compute nodes | Storage scales independently | Storage scales independently |
| Kubernetes PVC lifecycle | Limited (local PV, no expansion) | Full CSI lifecycle | Full CSI lifecycle |
| Hardware requirements | NVMe drives in compute nodes | Standard Ethernet (10/25GbE+) | RDMA-capable NICs, lossless fabric |
How Simplyblock Bridges the Gap
Simplyblock offers disaggregated storage over both NVMe/TCP and NVMe/RoCE, allowing teams to choose the transport that matches their latency requirements and network infrastructure. For workloads where absolute latency matters — sub-200 µs reads are required — NVMe/RoCE closes the gap with local NVMe while retaining all the operability benefits of disaggregated storage: live migration, storage-level replication, CSI snapshots, thin provisioning, and independent scaling.
For teams on standard Ethernet without RDMA infrastructure, NVMe/TCP delivers 200–500 µs latency that is sufficient for the large majority of database and stateful Kubernetes workloads. The operational savings — no node pinning, automated failover, CSI-native snapshots and clones — typically outweigh the latency delta for any workload that does not have a hard sub-200 µs requirement.
See NVMe over TCP latency characteristics for detailed latency benchmarking data, and what is NVMe over RoCE for RDMA fabric requirements and performance profiles.
Related Terms
These entries cover the transport, architecture, and performance concepts central to the NVMe/TCP vs. local NVMe decision.
- What Is NVMe over TCP
- What Is NVMe over RoCE
- Disaggregated Storage for Kubernetes
- NVMe over TCP Latency Characteristics
- Software-defined Block Storage
Questions and Answers
Is NVMe/TCP slower than local NVMe?
Yes, NVMe/TCP adds network round-trip latency that does not exist with local NVMe. Local NVMe delivers 50–150 µs random read latency depending on drive quality; NVMe/TCP over 25GbE Ethernet typically adds 150–350 µs of network overhead, resulting in total latency of 200–500 µs. For workloads with hard sub-200 µs requirements, this difference matters. For the majority of database, analytics, and message queue workloads, 200–500 µs block latency is within acceptable range and the operational benefits of disaggregated storage outweigh the latency cost.
When should I choose NVMe/TCP over local NVMe in Kubernetes?
Choose NVMe/TCP when your Kubernetes workloads need standard PVC lifecycle management — StatefulSet rolling upgrades, node drains, pod disruption budgets, CSI snapshots, PVC expansion — and when latency requirements are above 200 µs. Local NVMe is appropriate only when the application manages its own data replication, pod pinning to specific nodes is acceptable, and the team can tolerate manual procedures for node maintenance. For most Kubernetes platforms running databases, caches, or message queues, NVMe/TCP disaggregated storage is the operationally correct choice.
How does NVMe/RoCE compare to local NVMe latency?
NVMe/RoCE uses RDMA to bypass the CPU’s TCP stack, reducing network overhead to 30–80 µs on a properly configured lossless fabric. Total NVMe/RoCE read latency is typically 80–150 µs, which is competitive with enterprise local NVMe drives. For workloads that need both low latency and the operational benefits of disaggregated storage — replication, live migration, CSI lifecycle management — NVMe/RoCE is the architecture that eliminates the trade-off. It requires RDMA-capable network adapters and a lossless Ethernet fabric (with PFC and ECN configured), which adds infrastructure prerequisites compared to NVMe/TCP.
Can disaggregated NVMe/TCP replace local SSDs for databases?
For most production database workloads, yes. PostgreSQL, MySQL, Redis, and MongoDB running on NVMe/TCP storage with 200–500 µs block latency perform within a few percent of local NVMe configurations for typical query loads. The latency difference becomes significant primarily at very high concurrency with sub-millisecond query targets. Distributed databases like Cassandra and ClickHouse that manage their own data distribution often benefit less from disaggregated storage because they already handle replication at the application layer — local NVMe may be preferred for those specific systems. For single-instance databases on Kubernetes, NVMe/TCP disaggregated storage provides the Kubernetes operability that local NVMe cannot.