Handling “Disk Full” Scenarios Gracefully in YugabyteDB

In distributed databases, running out of disk space on even a single node can cause cascading issues—replica instability, write failures, and worst-case: data corruption or crash loops. Fortunately, YugabyteDB includes a built-in mechanism to proactively detect and reject writes when disk space is critically low.

This tip explores how YugabyteDB handles “disk full” scenarios and how you can configure it with fine-grained GFlags to protect your cluster under storage pressure.

The Problem: Disk Exhaustion in Distributed Systems

When a tablet server or master node runs out of disk, several issues can occur:

  • ▪️ WAL (Write-Ahead Logs) can’t be persisted → writes fail silently or hang

  • ▪️ Compaction can stall → read performance degrades

  • ▪️ Leader re-elections may happen frequently due to IO stalls

  • ▪️ Manual cleanup or restarts are often required, which introduces downtime

To avoid these issues, YugabyteDB includes proactive write-rejection logic that kicks in before you hit that critical low-disk threshold.

The Solution: Reject Writes When Disk Is Low

YugabyteDB introduces a safety mechanism: if disk space falls below a threshold, writes are rejected gracefully on that node. This preserves cluster stability and prevents data inconsistency or corruption.

This behavior is controlled by three key GFlags:

1. --reject_writes_when_disk_full

✅ Enables write rejection logic.
If this is true, YugabyteDB monitors available disk and prevents further writes if space drops below the configured threshold.

				
					--reject_writes_when_disk_full=true

				
			
2. --reject_writes_min_disk_space_mb

📉 Minimum free space (in MB) to allow writes.
If disk space drops below this threshold, writes are rejected on the WAL and data directories.

				
					--reject_writes_min_disk_space_mb=2048   # Reject writes below 2GB free

				
			

If set to 0, it defaults to:

				
					max_disk_throughput_mbps * min(10, reject_writes_min_disk_space_check_interval_sec)
				
			
3. --reject_writes_min_disk_space_check_interval_sec

🔁 Interval (in seconds) for checking disk space.
If disk usage crosses the aggressive threshold, checks happen every 10s.
Setting this below 10 forces always-on aggressive checks, which may impact performance.

				
					--reject_writes_min_disk_space_check_interval_sec=60
				
			
🧪 Example: Safe Config in Production
				
					--reject_writes_when_disk_full=true
--reject_writes_min_disk_space_mb=4096
--reject_writes_min_disk_space_check_interval_sec=60
				
			

✅ This ensures that:

  • ▪️ Writes are blocked when <4GB is available
  • ▪️ Disk space is checked every 60 seconds (and more aggressively if needed)
  • ▪️ The system avoids dangerous edge conditions before they become outages
📌 Best Practices
  • ▪️ Always monitor disk usage with external tools or metrics (e.g., Prometheus, YBA)

  • ▪️ Set realistic thresholds for min_disk_space_mb based on your WAL/data growth rate

  • ▪️ Avoid setting check_interval_sec below 10 unless for testing/debugging

“Disk full” may sound like a basic problem, but in distributed systems, it’s a silent killer. YugabyteDB’s reject-write mechanism turns an unpredictable failure into a graceful, recoverable event.

If you’re running large-scale transactional workloads, we strongly recommend reviewing these flags and ensuring your cluster is resilient to disk pressure!

Have Fun!

One of our backyard American Goldfinches enjoying the final sunflower seeds of the season.