Skip to main content

Chris Engelbert Chris Engelbert

SPDK Storage on Kubernetes: Performance vs CPU Efficiency

Mar 25, 2026  |  8 min read

Last edited: Mar 31, 2026

SPDK Storage on Kubernetes: Performance vs CPU Efficiency

SPDK — the Storage Performance Development Kit — has become a serious consideration for platform teams running stateful workloads on Kubernetes. The framework enables user-space NVMe drivers and kernel bypass, removing large portions of the operating system’s IO path from the latency equation. For teams managing high-IO databases or analytics infrastructure, that matters. But SPDK is not a universal upgrade. The decision to build on SPDK-style storage requires a clear-eyed look at where the gains actually land and what they cost.

What SPDK Does — and Why the Kernel Path Matters

Traditional storage stacks in Linux route IO through the kernel: system calls, interrupt handling, block layer processing, driver scheduling. Each step adds latency, and under high concurrency that latency compounds. The kernel path is well-optimized for general workloads, but it was not designed around modern NVMe drives capable of sub-100 microsecond device latency.

SPDK addresses this by moving the NVMe driver entirely into user space. Applications interact with storage without crossing the kernel boundary for each IO. The result is a dramatically shorter software path, and in practice, this means storage latency that more closely reflects the hardware’s actual capability rather than the overhead layered on top of it.

For Kubernetes environments, where workloads share a distributed storage fabric, removing kernel-path overhead also reduces per-node CPU time spent processing IO interrupts. At high IO rates, that freed capacity is real — it shows up in benchmark results and, more importantly, in production workload headroom.

The Core Tradeoff: Polling vs Interrupts

The mechanism SPDK uses to avoid kernel overhead is polling. Instead of waiting for the hardware to signal completion via an interrupt, the SPDK thread continuously checks for IO completions in a tight loop. This is called busy-polling or spin-polling, and it is the source of both SPDK’s performance advantage and its resource cost.

Polling eliminates interrupt latency and jitter. Interrupt-driven IO has inherent variability: the time from IO completion to interrupt delivery to handler execution is not fixed. Under load, interrupt coalescing can delay notification further. Polling sidesteps all of this, which is why SPDK-based storage typically shows tighter p99 and p99.9 latency distributions than interrupt-driven alternatives — the tail is shorter and more predictable.

The cost is that polling consumes CPU continuously, regardless of whether IO is happening. A polling thread dedicated to storage processing is not available for application work. In configurations with dedicated storage CPU cores, this is an acceptable architectural choice. In CPU-constrained environments, or workloads with variable IO demand, that constant CPU draw becomes a real efficiency penalty.

When SPDK Architecture Wins

The strongest use cases for SPDK-based storage are those where storage latency has direct downstream effects on application performance and where IO rates are consistently high.

High-IO databases are the clearest example. PostgreSQL, MySQL, and distributed databases like Cassandra or ClickHouse all show measurable query latency improvements when the storage stack stops adding variable overhead. At p95 and p99, kernel-path jitter can be the dominant latency contributor — not the database engine itself. SPDK-style storage removes that variable from the equation.

Analytics workloads that scan large datasets also benefit. Sequential IO throughput improves with kernel bypass because the software path is not the bottleneck, and CPU overhead per IO unit drops, leaving more processing capacity for the analytical computation itself.

NVMe over Fabrics scenarios are another natural fit. NVMe-oF extends NVMe semantics over a network fabric, and SPDK is a common implementation choice for both the initiator and target sides. The combination produces a storage architecture where the network-attached device behaves much more like a local NVMe drive than a traditional SAN target would.

When It Adds Unnecessary Overhead

Not every Kubernetes workload justifies SPDK-style storage. For workloads with low or bursty IO demand — CI/CD artifact storage, log aggregation, lightweight web application backends — the polling CPU cost is paid constantly while the performance benefit is rarely exercised.

CPU-constrained environments are a particular concern. If a node is already running CPU-intensive application containers alongside storage processing, committing cores to polling can reduce overall workload density and force teams to over-provision hardware to compensate. The net result is worse platform economics, even if per-IO latency looks better in isolation.

Shared multi-tenant environments also require careful consideration. Polling CPU consumption is relatively fixed, so a storage layer that is efficient for a single high-IO tenant becomes relatively less efficient when averaged across many low-IO tenants on shared infrastructure.

How to Evaluate: The Right Metrics

Platform teams evaluating SPDK-based storage should not stop at throughput numbers. The most revealing metrics are:

Latency percentiles under realistic load. Measure p50, p95, p99, and p99.9 under workload-representative IO patterns. Synthetic benchmarks with a single IO depth often show best-case numbers that do not reflect production contention behavior.

CPU per IO unit. Measure how many CPU cycles are consumed per thousand IOPS under target load. This reveals whether the SPDK polling cost is proportional to IO volume or whether it dominates at low loads.

Behavior under mixed workload contention. Run storage-intensive and CPU-intensive workloads simultaneously and observe how each is affected. This exposes interference patterns that single-workload benchmarks hide.

Recovery-time performance. Measure IO latency before, during, and after a simulated node failure or drive replacement. Consistency during recovery events is often more operationally important than peak performance under ideal conditions.

How simplyblock Uses SPDK

simplyblock uses SPDK internally to deliver NVMe-grade storage performance over TCP networks. The architecture uses kernel bypass and user-space NVMe drivers on the storage node side, enabling high-throughput, low-latency IO delivery without requiring specialized network hardware like InfiniBand or RDMA-capable NICs.

For Kubernetes teams, this means accessing NVMe over Fabrics performance characteristics — tight latency distributions, high IOPS — over standard TCP/IP infrastructure. The CSI driver integration handles persistent volume provisioning and lifecycle through the standard Kubernetes API, so the storage architecture difference is largely transparent to application teams.

The polling CPU cost is managed through dedicated storage processing resources on the simplyblock storage nodes, keeping that overhead off the application node pool. Platform teams get the performance benefit without needing to allocate polling cores from their workload capacity.

A Workload-Driven Adoption Strategy

The most effective way to adopt SPDK-based storage is to start with the workload class where storage latency is a demonstrated bottleneck, not a theoretical concern. Identify workloads where p99 latency is actively limiting application throughput or service-level achievement. Benchmark those workloads against both architectures under production-representative traffic.

Once a high-IO workload class shows clear, measurable benefit, the expansion case becomes evidence-based rather than architectural preference. This avoids platform-wide decisions made on incomplete assumptions and reduces rollout risk significantly.

Common Measurement Mistakes

Teams evaluating SPDK storage frequently make a few repeatable mistakes. Running benchmarks at a single queue depth understates real-world throughput and overstates latency predictability. Testing only during low-contention periods misses how the stack behaves when compute and storage compete for CPU. Comparing throughput without normalizing for CPU cost makes SPDK look better than it is in CPU-constrained situations. And focusing on average latency instead of p99 latency misses the tail behavior that actually affects user-visible application performance.

The strategic objective is not maximum theoretical speed. It is predictable service performance with sustainable resource economics across the workload mix that actually runs in production.

Questions and Answers

Does SPDK automatically improve every Kubernetes workload?

No. SPDK-style storage is most valuable where IO path overhead is a genuine, measurable bottleneck and where performance consistency at high IO rates has direct business impact. For lightweight or bursty workloads, the CPU overhead of polling typically outweighs the performance benefit.

Why is CPU efficiency part of the SPDK evaluation?

Because polling-based IO processing consumes CPU continuously, independent of actual IO volume. In environments where application and storage workloads share CPU capacity, that constant draw reduces workload density and can push teams toward over-provisioned hardware. The performance gain must be weighed against this cost per workload class.

What should be measured beyond throughput?

Latency percentiles at p95, p99, and p99.9 under realistic load are more actionable than average throughput. CPU per IO unit reveals the true resource cost. Behavior under mixed workload contention shows how storage affects application CPU availability. Recovery-time latency reveals how consistent the stack is during the fault conditions that matter most in production.

How should teams roll out SPDK-oriented storage safely?

Start with the one or two workload classes where storage latency is a documented bottleneck and IO rates are consistently high. Run both architectures under representative traffic, measure the full metric set, and make the expansion decision based on that evidence. Avoid platform-wide adoption decisions based on single-workload benchmark results.

You may also like:

Simplyblock Replaces Your VMware and Database Architecture
Simplyblock Replaces Your VMware and Database Architecture

The VMware + database stack was never designed for modern workloads. Here's how simplyblock and PostgreSQL replace it with a decoupled, API-driven, Kubernetes-native data architecture.

Kubernetes Storage: Disaggregated or Hyper-converged?
Kubernetes Storage: Disaggregated or Hyper-converged?

Modern cloud-native environments demand more from storage than ever before. As Kubernetes becomes the dominant platform for deploying applications at scale, teams are confronted with a critical…

NVMe/TCP vs NVMe/RoCE: Which Protocol For High-Performance Storage?
NVMe/TCP vs NVMe/RoCE: Which Protocol For High-Performance Storage?

As modern workloads become faster, smarter, and more distributed, the infrastructure behind them must keep up. Enterprise applications, especially those driven by AI, analytics, and cloud-native…