Most Kubernetes deployments treat availability zones as scheduling boundaries, not as active-active storage domains. Stretched cluster storage goes further: it extends a single Kubernetes or storage cluster across two geographically separate sites or availability zones, replicating data synchronously or asynchronously between them so that a complete site failure does not cause data loss or a manual recovery procedure. The goal is zero-RPO for writes that complete before the failure, with automatic failover when one site goes down.
The practical constraint for synchronous stretched clusters is inter-site latency. Every write must be acknowledged by both sites before the application receives a completion response. A 4 ms round-trip adds 4 ms to every synchronous write — which is tolerable for many workloads but becomes noticeable for latency-sensitive databases operating at high IOPS. The commonly accepted upper limit for synchronous replication is 5 ms RTT, corresponding to roughly 500 km between sites on a direct fiber path.
How Stretched Cluster Storage Works
A stretched cluster presents a single Kubernetes control plane — one API server, one scheduler, one etcd quorum — with nodes distributed across two (or more) sites. Storage volumes are replicated between sites at the block level, so the data set at both sites is identical (for synchronous) or near-identical (for asynchronous replication).
The Kubernetes scheduler places workloads using topology labels (e.g., topology.kubernetes.io/zone) and affinity rules. In an active-active stretched cluster, workloads run at both sites simultaneously with their storage replicated underneath. In an active-passive configuration, workloads run at the primary site and fail over to the secondary when the primary becomes unavailable.
Quorum is critical for stretched clusters. With two sites, a network partition creates a split-brain scenario where both sites believe the other has failed. Most stretched cluster implementations use a third tiebreaker: a lightweight quorum node or arbitrator at a third location — often a cloud availability zone or small VM — that casts the deciding vote in a partition event.
Synchronous vs Asynchronous Replication
The replication mode determines the RPO and the performance impact on the write path.
Synchronous replication writes data to both sites before acknowledging the write to the application. RPO is zero for any write that completes — the application cannot observe a write that is not durably stored at both sites. The cost is added write latency equal to the inter-site round-trip time. This mode requires low-latency links, typically under 5 ms RTT, which limits deployment to metro-area distances or colocation facilities with direct fiber interconnects.
Asynchronous replication acknowledges writes at the primary site and ships data to the secondary site in the background. This decouples performance from inter-site latency, allowing stretched clusters across larger distances. The RPO is non-zero — if the primary site fails, any writes buffered for async transfer but not yet committed at the secondary are lost. RPO is typically measured in seconds to minutes depending on replication lag.
Stretched Cluster vs Cross-Cluster Replication vs Backup
| Attribute | Stretched cluster | Cross-cluster replication | Backup |
|---|---|---|---|
| RPO | Zero (sync) or low seconds (async) | Seconds to minutes | Hours to days |
| RTO | Seconds to low minutes (automatic failover) | Minutes — requires cluster promotion | Hours — full restore required |
| Network requirement | Low latency (<5 ms for sync) or any (async) | Any bandwidth available | Periodic bulk transfer |
| Operational complexity | High — single cluster spanning two sites | Medium — two separate clusters | Low — restore only when needed |
| Failover type | Automatic, stateful | Manual or automated, may lose recent writes | Manual, point-in-time restore |
Table 1: Stretched cluster, cross-cluster replication, and backup compared
Planning a stretched cluster for Kubernetes storage? Simplyblock supports cross-zone and cross-cluster replication with NVMe/TCP or NVMe/RoCE data paths that keep replication overhead low on sub-5 ms RTT links. Explore simplyblock cross-cluster replication →
Kubernetes Patterns for Stretched Storage
Several patterns appear in production stretched Kubernetes storage deployments:
Pod anti-affinity across zones. StatefulSet replicas use podAntiAffinity to force each replica to a different zone. This distributes the application workload while the storage layer replicates the backing volumes between zones.
Topology-spread constraints. topologySpreadConstraints distribute pods evenly across zones, ensuring that workloads continue running if one zone loses all its nodes.
Zone-aware StorageClasses. CSI drivers that support topology can provision volumes in the pod’s zone and replicate asynchronously to the other zone, reducing read latency while maintaining synchronous writes for the primary replica.
Tiebreaker quorum nodes. A small node in a third zone or region acts as an etcd voter and storage arbitrator to prevent split-brain during a network partition.
Latency Requirements for Synchronous Stretched Clusters
The 5 ms RTT guideline for synchronous replication is a practical limit, not a hard protocol boundary. The actual impact depends on the workload’s write pattern:
A database performing 10,000 sequential synchronous writes per second adds 5 ms to each write, translating to a maximum throughput of 200 writes per second at 5 ms RTT — far below target. Most production databases use group commit and write batching to amortize the RTT cost across multiple transactions. With group commit and batching enabled, the same 5 ms RTT may be invisible at the application level for moderate write rates.
For high-IOPS NVMe workloads, even 1–2 ms of added replication latency can widen p99 tail latency significantly under concurrent load. This is why NVMe/TCP and NVMe/RoCE efficiency matters in stretched cluster storage: the replication path must add as little protocol overhead as possible to the inherent RTT cost.
How Simplyblock Supports Stretched Cluster Storage
Simplyblock supports cross-zone replication and cross-cluster replication for stretched storage configurations. The NVMe/TCP and NVMe/RoCE data paths minimize protocol overhead, keeping the effective replication latency close to the raw inter-site RTT rather than adding protocol stack overhead on top.
For stretched deployments, simplyblock replicates at the logical volume level, which means replication is transparent to the application and managed through standard Kubernetes StorageClass and PVC configuration. Teams configure replication factor and topology constraints in the StorageClass, and the simplyblock CSI driver enforces placement and replication across zones.
For compliance-driven zero-RPO requirements, simplyblock supports synchronous replication for metro-distance deployments. For wider geographic deployments where latency exceeds 5 ms, asynchronous replication with configurable lag monitoring is available. Teams evaluating high availability block storage design and zero-RPO requirements will find both modes supported.
Related Terms
These glossary pages cover the replication, high availability, and topology concepts related to stretched cluster storage.
- Cross-Cluster Replication
- Cross-Zone Replication
- What Is RPO
- High Availability Block Storage Design
- Asynchronous Storage Replication
Questions and Answers
What is a stretched cluster in Kubernetes storage?
A stretched cluster is a single Kubernetes cluster with nodes distributed across two or more geographically separate sites or availability zones, with storage data replicated between sites at the block level. Unlike a multi-cluster setup with separate control planes, a stretched cluster uses a single API server, etcd, and scheduler — pods can be scheduled to any node in either site, and the storage layer ensures that data written at one site is available at the other. This enables automatic failover if one site goes down, without requiring cluster promotion or manual DNS changes.
What latency is required for synchronous stretched cluster replication?
The widely accepted guideline is under 5 ms round-trip time (RTT) between sites for synchronous replication. This limit exists because synchronous replication adds one inter-site round trip to every write acknowledgment — a 4 ms RTT adds 4 ms to every write. At 5 ms RTT, most applications with batching and group commit can absorb the latency without visible performance degradation. Beyond 5 ms, write throughput drops and write latency increases enough to impact databases and other latency-sensitive workloads. Asynchronous replication removes this constraint by decoupling the write acknowledgment from the inter-site transfer, but introduces a non-zero RPO.
How does stretched cluster storage differ from backup?
Stretched cluster storage provides continuous, real-time data replication with RPO measured in seconds (async) or zero (sync) and RTO measured in seconds to low minutes with automatic failover. Backup is a periodic, point-in-time copy of data stored separately, with RPO typically measured in hours and RTO measured in hours — restore takes time proportional to data size. A backup cannot replace a stretched cluster for zero-RPO or fast-RTO requirements; it is a complementary practice for protecting against logical corruption, ransomware, or accidental deletion rather than site failures.
When should teams use stretched cluster storage vs cross-cluster replication?
Stretched cluster storage is appropriate when the priority is automatic failover with zero or near-zero RPO and minimal RTO, and when the inter-site network supports the required latency (under 5 ms for sync, any latency for async). It requires a shared Kubernetes control plane across both sites, which adds architectural complexity. Cross-cluster replication uses two independent Kubernetes clusters, each with its own control plane. Failover requires promoting the secondary cluster, which takes longer and typically involves manual steps or automation tooling. Cross-cluster is better for wide-area geographic distribution, independent cluster lifecycle management, and scenarios where the two sites need to operate completely independently under normal conditions.