OpenShift Virtualization can make virtual machines part of the OpenShift operating model, but storage determines whether that model feels production-ready. VM disks still need predictable latency, recovery, snapshot, clone, and migration behavior.
Use this runbook when planning KubeVirt storage, OpenShift Virtualization onboarding, a VMware migration to OpenShift, or a broader OpenShift HCI design.
Runbook Scope
This runbook is for the storage portion of an OpenShift Virtualization readiness review. It does not replace cluster hardening, networking, security, application modernization, or VM migration tooling. It helps answer a narrower question: can the storage layer support VM disks and stateful platform workloads in a way the operations team can trust?
Use it before:
- Moving production VMs onto OpenShift Virtualization.
- Replacing vSAN or SAN assumptions during a VMware-exit program.
- Standardizing VM disk StorageClasses.
- Running mixed VM and container workloads on the same OpenShift platform.
- Creating golden-image or clone-heavy workflows for VM onboarding.
- Defining production recovery procedures for VM disks and attached data volumes.
Step 1: Classify VM Disk Workloads
Do not treat every VM disk the same. Start by classifying workloads:
| Workload class | Storage concern | Runbook question |
|---|---|---|
| Boot disks | Provisioning, clone speed, image consistency | Can the team clone or rebuild VMs without manual storage tickets? |
| Database disks | p99 latency, write behavior, recovery | Does the storage class protect latency under shared load? |
| Application disks | Balanced performance and capacity | Is this class separate from latency-critical databases? |
| Migration candidates | Rollback and coexistence | Can old and new storage paths coexist during phased migration? |
| Temporary validation disks | Short-lived migration, QA, and staging work | Is there an owner and cleanup policy? |
The output of this step should be a small workload map, not a long taxonomy. If the team cannot explain why a disk belongs to a class, the class probably is not useful yet.
Step 2: Validate StorageClass Design
Each VM disk StorageClass should have an explicit purpose. At minimum, define:
- A VM disk class for ordinary VM workloads.
- A latency-sensitive class for databases and critical data services.
- A capacity-efficient class for lower-pressure workloads.
- A snapshot/clone policy for image-based workflows.
- A failure-domain policy that matches the platform topology.
- A migration-staging class if the VMware exit plan uses temporary clone or test volumes.
If every VM lands on the default class, the platform team loses control over performance, recovery, and cost behavior. If there are too many classes, the platform becomes hard to support. The right target is usually a small set of named intents with clear acceptance criteria.
Step 3: Check Access and Migration Assumptions
Live migration is one of the first places where storage architecture becomes visible. Before committing to a design, confirm:
- Whether the VM storage mode and access mode support the migration path the team expects.
- Whether shared storage behavior is required for the target VM workflow.
- Whether node drain, maintenance, and failover procedures are documented.
- Whether migration performance is acceptable under normal platform load.
- Whether the team knows which VM disks can tolerate shutdown migration instead of live migration.
- Whether storage behavior differs between boot disks and attached data disks.
Red Hat documents live migration behavior and requirements in the OpenShift Virtualization documentation. Treat that as platform-specific implementation guidance, not as a substitute for workload testing.
Step 4: Validate Snapshots, Clones, and Recovery
For VM workloads, storage is part of the day-2 operating model. Test these workflows before production:
| Workflow | What to test | Pass condition |
|---|---|---|
| Snapshot | Create, restore, and delete snapshots for representative VM disks | Operators can recover without manual backend intervention. |
| Clone | Create a test VM from a source disk or template | Clone workflow is fast enough for actual platform use. |
| Rollback | Recover after a failed change | Recovery path is documented and repeatable. |
| Backup handoff | Confirm backup tooling and snapshot behavior | Ownership between platform and backup teams is clear. |
| Golden image | Provision multiple VMs from a known-good image | The image workflow does not create uncontrolled storage sprawl. |
| Deletion | Delete test VMs and attached disks | Volumes, snapshots, and orphaned objects are cleaned up predictably. |
Do not test these only with empty VMs. Use at least one VM that looks like a real production candidate: realistic disk size, write pressure, backup policy, and recovery expectation.
Step 5: Test Node Drain and Maintenance
OpenShift Virtualization storage has to survive the boring operational events, not only the dramatic failures.
Run these checks:
- Drain a worker node hosting VM workloads and observe volume behavior.
- Restart relevant storage components in a controlled test window.
- Validate what happens if a node becomes unreachable while VM disks are under write pressure.
- Confirm whether the storage layer rebuilds, reconnects, or remaps paths in the expected time.
- Check whether p99 latency returns to baseline after the event.
- Confirm that alerting fires on the right symptoms instead of only on total outage.
The output should be a runbook entry with owner, command sequence, expected symptoms, rollback procedure, and a pass/fail threshold.
Step 6: Measure Tail Latency
Do not accept “the storage is fast” based on average latency. For mixed VM and container platforms, measure:
- p95 and p99 latency during normal workload pressure.
- Latency during snapshot, clone, rebuild, and maintenance activity.
- VM disk behavior when databases and platform services share the cluster.
- Latency by StorageClass, not only by node or cluster.
- CPU and network saturation during the test.
- Whether latency changes after live migration, restart, or failover events.
| Metric | Why it matters |
|---|---|
| p99 latency | Shows whether VM disks and database workloads stay predictable under pressure. |
| attach time | Shows whether platform operations remain acceptable during rescheduling. |
| clone time | Shows whether image-based workflows are practical. |
| restore time | Shows whether recovery procedures are operationally usable. |
| time-to-stable-p99 | Shows how quickly storage returns to normal after failure or maintenance. |
This is where a storage platform like simplyblock matters. Simplyblock is designed for low-latency block storage with CSI-native operations, snapshots, clones, and deployment flexibility across HCI, hybrid, and disaggregated models.
Step 7: Define Production Acceptance Criteria
Before onboarding production VMs, define acceptance criteria in writing:
- Which StorageClass each VM disk type uses.
- Which teams own restore, rollback, and snapshot approval.
- Which latency and recovery metrics count as failed tests.
- Which workloads are allowed to live migrate and which require maintenance windows.
- Which failure domains are acceptable for replicas or protection groups.
- Which logs and metrics operators should check during an incident.
- Which team can approve changes to VM disk storage policy.
This does not need to become a huge document. It should be specific enough that an on-call engineer can make the same decision the architecture team intended.
Reviewing OpenShift Virtualization storage?
Talk through your VM disk classes, live migration assumptions, and recovery model with a storage architect.
Questions and Answers
What is the most common OpenShift Virtualization storage mistake?
The most common mistake is using one default storage class for every VM disk. That hides different latency, recovery, and availability requirements until production workloads create pressure.
Does OpenShift Virtualization require shared storage for live migration?
Requirements depend on the exact OpenShift Virtualization configuration and storage mode. Teams should check the current Red Hat documentation and validate the expected behavior with their own VM workloads.
Should VM disks and container PVCs share one storage platform?
They can share one platform if storage policies are explicit. The benefit is simpler operations, but only if VM disks, databases, and general PVCs can use different classes and protection policies.
How does simplyblock fit OpenShift Virtualization?
Simplyblock provides Kubernetes-native block storage for OpenShift, with NVMe/TCP and NVMe/RoCE support, CSI workflows, snapshots, clones, and deployment flexibility for HCI, hybrid, and disaggregated models.
When should teams run this runbook?
Run it before production onboarding, before a VMware migration wave, and before standardizing StorageClasses for OpenShift Virtualization.