Kubernetes was originally designed for containerized workloads, but many platform teams are now running virtual machines on the same cluster through KubeVirt. This convergence simplifies infrastructure operations — one control plane, one networking layer, one storage platform — but it also creates a new data protection problem. VM backup and container backup have historically been separate disciplines, and the tooling assumptions behind each do not map cleanly onto the other when both workload types live inside Kubernetes.
This post is aimed at platform engineering teams who need to design or harden backup and restore workflows for mixed KubeVirt and container environments. The goal is practical: what actually needs protecting, what tools are available, where the gaps are, and how to build recovery workflows that hold up under pressure.
What KubeVirt Changes About VM Backup
KubeVirt extends Kubernetes with custom resource definitions that represent virtual machines as native Kubernetes objects. A running VM is represented as a VirtualMachineInstance (VMI). The persistent disk storage for that VM is backed by a DataVolume, which is itself backed by a PVC. The Containerized Data Importer (CDI) manages importing disk images into those volumes.
This architecture is powerful because it allows VMs to be managed through the same GitOps workflows and namespace-level policies as containers. But it creates a backup complexity that traditional VM backup tools are not designed for. Legacy VM backup agents understand hypervisor APIs — they talk to vSphere or Hyper-V to freeze guests and capture consistent snapshots. In KubeVirt, there is no hypervisor API. The VM’s disk is just a PVC, and the VM’s configuration is a Kubernetes object in etcd.
If you only back up the PVC, you have the disk data but not the VM configuration. If you only export the Kubernetes manifest, you have the object but not the data. Neither is sufficient for recovery. Backup tooling for KubeVirt needs to handle both the Kubernetes API objects and the underlying storage volumes together.
How KubeVirt VM Storage Works
Understanding the storage model is prerequisite to getting backup right. KubeVirt VMs typically use DataVolumes, which are managed by CDI. A DataVolume describes a desired disk state — it might import from a container image, a URL, or an existing PVC — and CDI handles the actual data movement. Once the import completes, the DataVolume is backed by a standard PVC, which a storage CSI driver provisions on the cluster.
For backup purposes, this means the disk state lives in a PVC, and the PVC can be snapshotted using standard Kubernetes VolumeSnapshot objects if the CSI driver supports it. Simplyblock’s storage driver supports CSI snapshots, which makes this path efficient — snapshots are copy-on-write at the storage layer and do not require copying the entire disk image out of the cluster.
The VM’s configuration — its CPU and memory allocation, network interfaces, boot order, and references to its data volumes — lives as a VirtualMachine object in the Kubernetes API. This is what needs to be captured alongside the PVC snapshot.
Backup Tooling for KubeVirt
The most practical tool for backing up KubeVirt workloads today is OADP (OpenShift API for Data Protection), which is a Red Hat-supported distribution of Velero with a KubeVirt-aware plugin. The KubeVirt Velero plugin understands the relationship between VirtualMachine, VirtualMachineInstance, DataVolume, and PVC objects. It coordinates the backup so that the Kubernetes manifest and the storage snapshot are captured together as a consistent unit.
For teams not using OpenShift, upstream Velero with the kubevirt-velero-plugin is an option, though it requires more manual configuration. The plugin adds support for freezing the guest filesystem before the snapshot (via the guest agent) and for restoring the full VM object graph rather than just individual PVCs.
Traditional VM backup approaches — agent-based tools that hook into hypervisor APIs — do not apply here. There is no vSphere API or Hyper-V VSS integration to call. Teams that try to adapt legacy VM backup tooling to KubeVirt inevitably end up with crash-consistent snapshots at best and missing manifest data at worst.
Application-Consistent VM Backups
Crash-consistent snapshots capture the disk at a point in time, but the guest OS may have in-flight writes that leave the filesystem in an inconsistent state. For many workloads this is recoverable — Linux filesystems replay journals on mount. But for database workloads running inside a KubeVirt VM, crash-consistent snapshots can leave transactions uncommitted, corrupted, or partially written.
Application-consistent backups require the guest OS to flush dirty writes and freeze IO before the snapshot is taken. In KubeVirt, this is done through the QEMU guest agent running inside the VM. When the guest agent is present and the backup tool supports it, a freeze/thaw cycle is issued: the guest freezes its filesystems, the storage snapshot is taken, and then the guest is thawed. The result is a snapshot that looks to the guest OS like a clean shutdown, making recovery straightforward.
The guest agent must be installed and running inside the VM image. Teams often discover this gap during incident response rather than during planning. It belongs in the VM image build checklist, not the incident runbook.
Container Workload Backups and Namespace-Level Coordination
Container workloads in the same cluster are backed up using Velero’s standard namespace-level backup capability. Kubernetes objects — Deployments, Services, ConfigMaps, Secrets — are captured from the API, and PVCs are snapshotted via CSI VolumeSnapshots.
The key coordination challenge in mixed environments is that business services often span namespace boundaries or combine VM and container components. A restore operation that brings back only the container side of a service without restoring the VM-hosted component leaves the service degraded at best.
Teams should map service dependencies before designing backup schedules. If a containerized API tier depends on a KubeVirt-hosted database VM, those two components need backup schedules that are close enough in time to produce a coherent restore point. Restoring the API namespace to a snapshot from 11:00 AM while restoring the VM to a snapshot from 9:45 AM creates a 75-minute data gap that the service will surface as corruption or errors.
RPO and RTO Design Across Workload Types
RPO and RTO targets should be defined per business service, not per runtime type. A service with a low RPO — say, 15 minutes — needs frequent coordinated snapshots of both its container namespaces and its KubeVirt VMs. A service with a looser RPO can use less frequent backups and simpler tooling.
RTO drives tooling decisions differently. Fast restores require that backup data is accessible nearby — ideally as storage-layer snapshots that can be cloned rather than large backup archives that need to be transferred. CSI-level snapshots restore much faster than restoring from off-cluster object storage because the data never leaves the storage layer. For critical workloads, keeping recent snapshots close and shipping older ones off-cluster gives the best balance.
Document recovery sequences explicitly. Which components come up first? What health checks confirm readiness before dependent services start? These sequences need to be rehearsed — not just written down.
Recovery Validation
A backup that has never been restored is an assumption, not a guarantee. Recovery validation should be a scheduled, recurring activity for any service that matters.
For KubeVirt VMs, isolated restore means bringing the VM up in a separate namespace or cluster segment with network isolation, verifying that the guest OS boots cleanly, running application-level checks, and confirming that the restored data matches expectations. For containers, it means restoring the namespace into an isolated context and running smoke tests.
Common failure modes to validate against: VM boots but the guest agent is not running (affects future consistent snapshots); restored database is missing recent transactions; containerized service cannot reach its restored VM-hosted dependency because network policies were not restored correctly.
Common Mistakes
Backing up PVCs without the VM manifest is the most frequent error. The disk data alone is not sufficient — the VM configuration, resource requests, network attachment definitions, and device configuration all need to be captured and restored together.
Skipping the guest agent installation is the second most common gap. Without it, snapshots of database or stateful workloads are crash-consistent only, which may not be recoverable cleanly.
Missing coordinated timing between VM and container backups produces inconsistent restore points for services that span both. This is a policy design problem, not a tooling problem — the tooling supports coordination, but teams have to configure it deliberately.
For more on KubeVirt storage architecture and how storage design decisions affect operational outcomes, see the KubeVirt storage use case.
Questions and Answers
Why can’t traditional VM backup tools be used for KubeVirt?
Traditional VM backup tools communicate with hypervisor APIs like vSphere or Hyper-V. KubeVirt does not expose those APIs. VMs in KubeVirt are represented as Kubernetes objects, and their disks are standard PVCs. Backup tooling needs to understand the Kubernetes object model and coordinate with CSI snapshots, not hypervisor hooks.
What happens if you restore a KubeVirt VM from a PVC snapshot without the VM manifest?
You end up with a disk volume but no VM definition. Kubernetes does not know what CPU, memory, network, or boot configuration to use. You cannot start the VM without manually recreating or restoring the VirtualMachine object. Always capture both the manifest and the PVC snapshot together.
How do you get application-consistent snapshots for a database running inside a KubeVirt VM?
Install the QEMU guest agent inside the VM image and use a backup tool that supports guest agent freeze/thaw coordination, such as OADP with the KubeVirt Velero plugin. Before the storage snapshot is taken, the tool issues a filesystem freeze through the guest agent, which flushes pending writes. The snapshot is taken in the frozen state, and the guest is then thawed. The result is a consistent snapshot that the database can recover from cleanly.
How should RPO targets be coordinated between VM and container components of the same service?
Define the RPO at the service level, not the component level. If a service requires a 15-minute RPO, both its VM and container components need backup schedules that satisfy that window and that are close enough in time to produce a coherent joint restore point. Misaligned backup schedules produce a data gap between VM and container state that the service will surface as inconsistencies during recovery.