Prometheus is a widely used standard for time-series monitoring in cloud-native infrastructure, utilizing time-series data as a source for generating alerts.
Every node in a YugabyteDB universe exports detailed time-series metrics, available in both Prometheus exposition format and JSON for easy integration with Prometheus.
What metrics are avaiable?
You can view YB-TServer metrics in Prometheus format directly in a browser or via the CLI with the following command:
curl :9000/prometheus-metrics
And view YB-Master server metrics in Prometheus format using the following command in a browser or via the CLI:
curl :7000/prometheus-metrics
We can store the metric names and their descriptions in a table for easy querying!
Example:
yugabyte=# CREATE TABLE available_prometheus_metrics (server TEXT, metric TEXT, description TEXT);
CREATE TABLE
Load the avaiable YB Master metric names and descriptions:
yugabyte=# SELECT inet_server_addr();
inet_server_addr
------------------
***.**.**.248
(1 row)
yugabyte=# yugabyte=# \COPY available_prometheus_metrics FROM PROGRAM 'curl -s http://***.**.**.248:7000/prometheus-metrics | grep HELP | sed "s/# HELP //" | sed "s/ /|/1" | sed -e "s/^/MASTER|/" | uniq' DELIMITER '|';
COPY 2578
Load the YB T-Server metric names and descriptions:
yugabyte=# \COPY available_prometheus_metrics FROM PROGRAM 'curl -s http://***.**.**.248:9000/prometheus-metrics | grep HELP | sed "s/# HELP //" | sed "s/ /|/1" | sed -e "s/^/TSERVER|/" | uniq' DELIMITER '|';
COPY 1894
Now it’s super simple to look for a partilar metric of interest to scrape…
yugabyte=# SELECT metric, description FROM available_prometheus_metrics WHERE server = 'MASTER' AND description ILIKE '%clock%' ORDER BY metric;
metric | description
-------------------------------------------------------------+---------------------------------------------------------------------------------
handler_latency_yb_server_GenericService_ServerClock | Microseconds spent handling yb.server.GenericService.ServerClock() RPC requests
handler_latency_yb_server_GenericService_ServerClock_count | Microseconds spent handling yb.server.GenericService.ServerClock() RPC requests
handler_latency_yb_server_GenericService_ServerClock_sum | Microseconds spent handling yb.server.GenericService.ServerClock() RPC requests
hybrid_clock_error | Server clock maximum error.
hybrid_clock_hybrid_time | Hybrid clock hybrid_time.
hybrid_clock_skew | Server clock skew.
service_request_bytes_yb_server_GenericService_ServerClock | Bytes received by yb.server.GenericService.ServerClock() RPC requests
service_response_bytes_yb_server_GenericService_ServerClock | Bytes sent in response to yb.server.GenericService.ServerClock() RPC requests
(8 rows)
yugabyte=# SELECT metric, description FROM available_prometheus_metrics WHERE server = 'TSERVER' AND description ILIKE '%conflict%' ORDER BY metric;
metric | description
--------------------------------------------+-----------------------------------------------------------------------------------------
conflict_resolution_latency_count | Microseconds spent on conflict resolution across all transactions at the current tablet
conflict_resolution_latency_sum | Microseconds spent on conflict resolution across all transactions at the current tablet
conflict_resolution_num_keys_scanned_count | Number of keys scanned during conflict resolution)
conflict_resolution_num_keys_scanned_sum | Number of keys scanned during conflict resolution)
transaction_conflicts | Number of conflicts detected among uncommitted distributed transactions.
(5 rows)
Have Fun!
