Handling “Disk Full” Scenarios Gracefully in YugabyteDB

In distributed databases, running out of disk space on even a single node can cause cascading issues—replica instability, write failures, and worst-case: data corruption or crash loops. Fortunately, YugabyteDB includes a built-in mechanism to proactively detect and reject writes when disk space is critically low.

This tip explores how YugabyteDB handles “disk full” scenarios and how you can configure it with fine-grained GFlags to protect your cluster under storage pressure.

The Problem: Disk Exhaustion in Distributed Systems

When a tablet server or master node runs out of disk, several issues can occur:

▪️ WAL (Write-Ahead Logs) can’t be persisted → writes fail silently or hang
▪️ Compaction can stall → read performance degrades
▪️ Leader re-elections may happen frequently due to IO stalls
▪️ Manual cleanup or restarts are often required, which introduces downtime

To avoid these issues, YugabyteDB includes proactive write-rejection logic that kicks in before you hit that critical low-disk threshold.

The Solution: Reject Writes When Disk Is Low

YugabyteDB introduces a safety mechanism: if disk space falls below a threshold, writes are rejected gracefully on that node. This preserves cluster stability and prevents data inconsistency or corruption.

This behavior is controlled by three key GFlags:

1. `--reject_writes_when_disk_full`

✅ Enables write rejection logic.
If this is true, YugabyteDB monitors available disk and prevents further writes if space drops below the configured threshold.

				
					--reject_writes_when_disk_full=true

2. `--reject_writes_min_disk_space_mb`

📉 Minimum free space (in MB) to allow writes.
If disk space drops below this threshold, writes are rejected on the WAL and data directories.

				
					--reject_writes_min_disk_space_mb=2048   # Reject writes below 2GB free

If set to 0, it defaults to:

				
					max_disk_throughput_mbps * min(10, reject_writes_min_disk_space_check_interval_sec)

3. `--reject_writes_min_disk_space_check_interval_sec`

🔁 Interval (in seconds) for checking disk space.
If disk usage crosses the aggressive threshold, checks happen every 10s.
Setting this below 10 forces always-on aggressive checks, which may impact performance.

				
					--reject_writes_min_disk_space_check_interval_sec=60

🧪 Example: Safe Config in Production

				
					--reject_writes_when_disk_full=true
--reject_writes_min_disk_space_mb=4096
--reject_writes_min_disk_space_check_interval_sec=60

✅ This ensures that:

▪️ Writes are blocked when <4GB is available
▪️ Disk space is checked every 60 seconds (and more aggressively if needed)
▪️ The system avoids dangerous edge conditions before they become outages

📌 Best Practices

▪️ Always monitor disk usage with external tools or metrics (e.g., Prometheus, YBA)
▪️ Set realistic thresholds for min_disk_space_mb based on your WAL/data growth rate
▪️ Avoid setting check_interval_sec below 10 unless for testing/debugging

“Disk full” may sound like a basic problem, but in distributed systems, it’s a silent killer. YugabyteDB’s reject-write mechanism turns an unpredictable failure into a graceful, recoverable event.

If you’re running large-scale transactional workloads, we strongly recommend reviewing these flags and ensuring your cluster is resilient to disk pressure!

Have Fun!