Running YugabyteDB Safely on VMware: A Practical DRS & Time-Sync Checklist

Why this follow-up tip exists

In Tip #1, Motion, DRS, and Clock Skew… Why Distributed Databases Aren’t “Just Another VM”, it was explained why clock skew appears in VMware environments using DRS and vMotion… and why distributed databases are uniquely sensitive to it.

This follow-up tip focuses on practical, production-proven guidance for running YugabyteDB safely on VMware across any industry.

Many of the real-world examples come from large on-prem VMware environments with frequent DRS-driven vMotion (a pattern commonly seen in financial services), but the guidance applies to any enterprise deployment where VM mobility is unconstrained.

If you are seeing recurring “too big clock skew” errors, start with Rule 4 (disable automatic DRS-driven vMotion) before changing any database configuration.

🧠 Core design principle

YugabyteDB does not require perfectly synchronized clocks… but it does require predictable, monotonic clocks.

The goal is not to eliminate vMotion everywhere.

The goal is to constrain mobility where correctness depends on time.

📋 VMware Checklist for YugabyteDB (Operational Baseline)

📘 What This Checklist Covers… and What It Does Not
This checklist focuses on constraining uncontrollable VM mobility (e.g., DRS, vMotion) and stabilizing clocks via NTP, since these are the most common root causes of “too big clock skew” errors in YugabyteDB clusters on VMware.It does not replace guidance on resource sizing (CPU/memory), storage performance guarantees, networking best practices, or monitoring. Those will be covered in our full VMware deployment best-practices guide… available soon in the official YugabyteDB documentation.

Use this checklist as a design and production readiness gate.

⏱️ Time Synchronization

Disable continuous VMware Tools time sync
Ensure NTP is slew-based, not step-based
All ESXi hosts use the same NTP sources

💡 PTP Support for YugabyteDB on VMware

YugabyteDB supports VMware PTP devices, where the YugabyteDB guest VM uses a PTP Hardware Clock (PHC) that stays synchronized with the host’s PTP clock. This provides tighter clock-drift bounds and helps ensure that all nodes in the cluster remain within a few milliseconds of each other, which can improve performance for certain classes of multi-shard transactions.

YugabyteDB requires clocks to be monotonic and bounded. “Step-based” time adjustments or inconsistent NTP sources can produce transient clock jumps that violate this assumption and trigger safety shutdowns.

🧠 CPU & Memory Considerations
In addition to clock synchronization:

● 100% memory reservation for YugabyteDB VMs is recommended to avoid host contention.
● High CPU shares should be configured to minimize noisy neighbor effects.
● Avoid aggressive over-commitment of vCPUs, especially across NUMA boundaries.

These factors influence predictable performance and stability.

🔄 vMotion & DRS

YugabyteDB VMs are in a dedicated DRS VM group
Automatic vMotion is disabled or tightly constrained
No storage-initiated mobility
Maintenance vMotion is manual and planned

🧩 Placement & Failure Domains

One YugabyteDB node per ESXi host (anti-affinity)
Hosts aligned to rack / power / fault domains
Cross-datacenter DR handled by xCluster, not storage replication

⚠️ Unsupported Patterns and Why

● VMDK-level replication: Storage replication at VM disk level cannot preserve raft state coordination and can cause diverging replicas.
● Snapshot restore of active clusters: Restoring from a snapshot while cluster is live can lead to inconsistent state or split-brain scenarios.
● “Let DRS decide”: Default automation can trigger unconstrained vMotions that violate clock assumptions.

⚠️ Important Clarification
This checklist is designed to prevent clock skew by constraining VM mobility. It does not recommend disabling YugabyteDB clock-skew safety mechanisms or relying on configuration flags as a long-term solution.

🧩 Sample VMware DRS Rules (Battle-Tested)

The following DRS patterns are production-proven for running YugabyteDB on VMware and do not require disabling DRS globally or re-architecting the entire environment.

These rules are listed in order of impact.

🟢 Rule 1: Dedicated YugabyteDB VM Group

Why this rule exists
A dedicated VM group ensures YugabyteDB nodes are treated as a single, coherent workload by DRS.
This allows you to apply constraints (anti-affinity, automation overrides) only to database nodes, without impacting unrelated VMs.

How to validate

● In vCenter, confirm the VM group contains only YugabyteDB yb-tserver and yb-master VMs
● Verify that no application, utility, or monitoring VMs are included

Common pitfalls

● Including non-database VMs “for convenience”
● Mixing YugabyteDB nodes from different clusters into the same VM group
● Forgetting to add new DB nodes after scale-out events

Key: The YugabyteDB VM group should contain only YugabyteDB yb-tserver and yb-master nodes. Avoid including utility, monitoring, or application VMs, as this dilutes DRS rule intent and can reintroduce unpredictable scheduling behavior.

🟢 Rule 2: YugabyteDB Host Group

(Strongly Recommended Where Feasible)

Why this rule exists
Host groups define the fault domain boundary for the cluster.
By constraining YugabyteDB nodes to a known set of ESXi hosts, you reduce unpredictable scheduling behavior and prevent DRS from moving nodes onto unsuitable hosts.

This rule improves placement stability but does not replace VM-VM anti-affinity or constrained vMotion controls.

How to validate

● Confirm the host group contains at least N hosts for an RF=N cluster (e.g., 3 hosts for RF=3)
● Ensure hosts are homogeneous (CPU generation, memory capacity, networking)

Common pitfalls

● Host group too small to satisfy anti-affinity rules
● Mixing hosts with different CPU generations or performance profiles
● Forgetting to update the host group during hardware refreshes

🟢 Rule 3: VM-VM Anti-Affinity (One Node per ESXi Host)

Why this rule exists
YugabyteDB’s replication factor (typically RF=3) assumes replicas live in independent failure domains.
Without VM-VM anti-affinity, a single ESXi host failure can take down multiple replicas of the same tablet… breaking quorum.

How to validate

● In vCenter, confirm anti-affinity rules are “Must run on separate hosts”, not “Should”
● Check current VM placement to ensure no two YugabyteDB nodes share the same host

Common pitfalls

● Using “soft” anti-affinity that DRS can ignore under pressure
● Over-constraining placement so DRS has no legal placement options
● Assuming rack or cluster diversity implies host-level separation (it doesn’t)

🟢 Rule 4: Disable or Constrain Automated vMotion (DRS Automation)

(Most Important Rule)

Why this rule exists
vMotion is not time-neutral. Even short VM stuns can introduce clock skew that violates YugabyteDB’s timing assumptions and triggers node self-shutdowns or recovery behavior.

How to validate

● Confirm DRS is set to Manual or Partially Automated for YugabyteDB VMs
● Review recent vMotion history to ensure migrations occur only during planned maintenance

Common pitfalls

● Leaving DRS fully automated “temporarily”
● Assuming NTP alone is sufficient to absorb clock jumps
● Forgetting that storage-initiated or maintenance vMotions still count

🧠 Summary of Rule Importance

Rule	Priority	Purpose
Rule 1: Dedicated YugabyteDB VM Group	Required	Scopes DRS controls to database nodes
Rule 2: YugabyteDB Host Group (Strongly Recommended Where Feasible)	Recommended	Improves placement stability and predictability
Rule 3: VM-VM Anti-Affinity (One Node per ESXi Host)	Required	Preserves fault domain isolation for RF quorum
Rule 4: Disable or Constrain Automated vMotion (DRS Automation)	Critical	Prevents clock-skew events and timing violations

🧪 YCQL-Only Workloads

YCQL is often used for:

● Event ingestion
● Time-series data
● High-throughput key-value workloads

(These patterns are common in financial services, telemetry pipelines, and real-time analytics.)

YCQL characteristics

● Leader-based writes
● High concurrency
● Sensitive to leader churn

What clock skew causes in YCQL

● Sudden write rejections
● Tablet leader step-downs
● Bursty failures during vMotion
● xCluster lag amplification

YCQL-specific recommendations

● Strictest vMotion constraints
● Conservative clock-skew tolerance
● Monitor:
- ○ Leader changes per tablet
- ○ Clock skew warnings
- ○ Repeated clock skew warnings should be treated as an infrastructure signal, not a database tuning issue
● Prefer static placement over elasticity

YCQL failures tend to be loud and immediate.

🧪 YSQL-Only Workloads

YSQL is commonly used for:

● OLTP systems
● Transactional applications
● Strong consistency use cases

YSQL characteristics

● Distributed transactions
● Serializable isolation
● Heavy reliance on HLC ordering

What clock skew causes in YSQL

● Increased transaction retries
● Higher commit latency
● Serializable transaction aborts
● “Database feels slow” symptoms

YSQL-specific recommendations

● Zero tolerance for backward time jumps
● Avoid frequent vMotion entirely
● Monitor:
- ○ Transaction retry rate
- ○ Commit latency spikes
- ○ Repeated clock skew warnings should be treated as an infrastructure signal, not a database tuning issue

YSQL failures are often silent performance degradations, not hard errors.

🚫 Disaster Recovery: What Not to Do

Do NOT rely on:

● VMDK replication
● Storage snapshots
● VM-level restore workflows

These approaches:

● Violate Raft assumptions
● Capture partial consensus state
● Risk split-brain and corruption

Supported model

● Use xCluster replication
● Treat DR as a logical database concern, not a storage concern

🧠 One sentence that works with VMware teams

“YugabyteDB nodes are not independent VMs… they are time-coordinated members of a distributed consensus group. Mobility must be constrained where correctness depends on time.”

This framing avoids blame and aligns expectations.

🏁 Final takeaway

This checklist focuses on the most common and most damaging failure modes when running YugabyteDB on VMware: uncontrolled VM mobility and unstable time synchronization.

Rules that constrain automated vMotion and enforce host-level fault isolation have the highest impact, because they protect YugabyteDB’s core correctness and availability assumptions. VM grouping and host affinity further improve stability, but they complement, not replace, strict mobility and placement controls.

Treat clocks, VM mobility, and fault domains as deterministic inputs, not best-effort infrastructure behavior. When those guarantees are enforced, YugabyteDB can run predictably and safely on shared vSphere environments.

Have Fun!