DataHub + YugabyteDB: Client-Side Load Balancing with the Python Smart Driver (psycopg2-yugabytedb)

Yugabyte’s docs show how to connect DataHub to YugabyteDB via the Postgres interface (YSQL) and even call out running the DataHub quickstart against YB. But those pages don’t cover using Yugabyte’s Python smart driver to get client-side connection load balancing. Today’s YugabyteDB Tip fills that gap.

We’ll…

  • ● Bring up DataHub quickstart on a single AlmaLinux box.
  • ● Point DataHub’s Postgres source at a local three-node YB cluster (yugabyted) on 127.0.0.1/.2/.3 (YSQL on 5433).

  • ● Swap in the Yugabyte psycopg2 smart driver (drop-in for psycopg2) so new connections load-balance across nodes.

  • ● Prove it works with a tiny Python demo that shows which node each connection hits.

Our lab topology
  • ● YB cluster (3 local nodes) on one host:
    127.0.0.1:5433, 127.0.0.2:5433, 127.0.0.3:5433 (YSQL)

  • ● DataHub quickstart (Docker Compose: GMS, UI, Kafka, etc.).

  • ● Ingestion uses DataHub’s Postgres source (SQLAlchemy + psycopg2), but we replace psycopg2 with psycopg2-yugabytedb-binary to enable load balancing.

DataHub’s Postgres connector runs in Python (SQLAlchemy/psycopg2), not JDBC. That’s why the Python smart driver is the right tool here.

Step 0: A 3 Node Local YugabyteDB DB Cluster
				
					[root@localhost yb]# ysqlsh -h 127.0.0.1 -c "SELECT host, cloud, region, zone FROM yb_servers() ORDER BY host;"
   host    | cloud  |    region    |     zone
-----------+--------+--------------+---------------
 127.0.0.1 | onprem | us-east-1    | us-east-1a
 127.0.0.2 | onprem | us-west-1    | us-west-1a
 127.0.0.3 | onprem | us-central-1 | us-central-1a
(3 rows)
				
			
Step 1: Bring up DataHub quickstart
				
					# (on AlmaLinux)
sudo dnf -y install git curl

# install Docker + Compose if needed, then:
git clone https://github.com/acryldata/datahub.git
cd datahub/docker/quickstart
docker compose -p datahub -f docker-compose.quickstart.yml up -d
				
			
  • UI: http://localhost:9002 (default creds: datahub / datahub)

  • GMS (REST): http://localhost:8080
    Quickstart details + caveats (dev-only defaults) are in the official docs.

Step 2: Prepare a DataHub recipe for YB (with smart driver)

Create /opt/recipes/yb_recipe.yml on the host:

				
					source:
  type: postgres
  config:
    # Validator likes to see this even if we pass connect_args
    host_port: "localhost:5433"

    database: "yugabyte"
    username: "yugabyte"
    password: "YOUR_PASSWORD"

    # Hand-off to SQLAlchemy/psycopg2 (YB smart driver)
    options:
      connect_args:
        host: "127.0.0.1,127.0.0.2,127.0.0.3"
        port: "5433"          # single value (all nodes use 5433)
        load_balance: "true"  # string, not boolean

sink:
  type: datahub-rest
  config:
    server: "http://localhost:8080"
				
			
  • The Postgres source schema includes host_port and works with options.connect_args, which SQLAlchemy forwards to the driver.

  • The YB smart driver honors host lists and load_balance=true.

Note on networking: because our YB nodes are bound to loopback (127.x), the ingestion container must see the host network (next step). Containers can’t reach the host’s 127.x otherwise.

Step 3: Run ingestion with the smart driver installed

Use the official ingestion CLI image, install the driver inside it, and run with host networking so it can reach 127.0.0.1/.2/.3:

				
					docker run --rm -it \
  --network host \                        # reach 127.x nodes
  --user 0:0 \                            # allow pip to write
  -v /opt/recipes:/recipes \
  --entrypoint /bin/sh \
  acryldata/datahub-ingestion:head \      # align with quickstart services
  -c "python -m pip uninstall -y psycopg2 psycopg2-binary && \
      python -m pip install --no-cache-dir psycopg2-yugabytedb-binary && \
      datahub ingest -c /recipes/yb_recipe.yml"

				
			

The datahub-ingestion image ships the CLI; using :head keeps it in step with the quickstart stack’s API.

If it runs, you’ll see tables discovered and posted to GMS.

Step 4: Prove client-side load balancing (tiny demo)

Save this script on the host:

				
					cat >/opt/recipes/lb_demo.py <<'PY'
import os, collections
import psycopg2
from contextlib import closing

pw = os.environ.get("YB_PW", "")
n  = int(os.environ.get("N", "12"))

dsn = (
    "host=127.0.0.1,127.0.0.2,127.0.0.3 "
    "port=5433 dbname=yugabyte user=yugabyte "
    f"password={pw} load_balance=true connect_timeout=5"
)

print("Driver:", psycopg2.__version__, end=" ")
try:
    import psycopg2.loadbalanceproperties
    print("(YB smart driver)")
except Exception:
    print("(stock psycopg2)")

hits = collections.Counter()
for i in range(n):
    with closing(psycopg2.connect(dsn)) as conn:
        with conn.cursor() as cur:
            cur.execute("select inet_server_addr(), inet_server_port()")
            ip, port = cur.fetchone()
            print(f"{i+1:02d} -> {ip}:{port}")
            hits[(ip, port)] += 1

print("\nTally:")
for (ip, port), count in sorted(hits.items()):
    print(f"{ip}:{port} = {count}")
PY
				
			

Run it inside a clean Python container with a venv (pip warnings removed):

				
					docker run --rm -it --network host \
  -e YB_PW='YOUR_PASSWORD' -e N=12 \
  -v /opt/recipes:/work -w /work \
  -e PIP_DISABLE_PIP_VERSION_CHECK=1 -e PIP_ROOT_USER_ACTION=ignore \
  python:3.11-slim sh -lc '
    set -e
    python -m venv .venv
    . .venv/bin/activate
    pip -qq install --no-cache-dir psycopg2-yugabytedb-binary
    python lb_demo.py
  '
				
			

You should see output like:

				
					Driver: 2.9.3.5 (dt dec pq3 ext lo64) (YB smart driver)
01 -> 127.0.0.2:5433
02 -> 127.0.0.1:5433
03 -> 127.0.0.1:5433
04 -> 127.0.0.3:5433
05 -> 127.0.0.2:5433
06 -> 127.0.0.3:5433
07 -> 127.0.0.3:5433
08 -> 127.0.0.1:5433
09 -> 127.0.0.1:5433
10 -> 127.0.0.3:5433
11 -> 127.0.0.2:5433
12 -> 127.0.0.1:5433

Tally:
127.0.0.1:5433 = 5
127.0.0.2:5433 = 3
127.0.0.3:5433 = 4
				
			

That distribution across 127.0.0.1/.2/.3 confirms the smart driver is balancing new connections. Yugabyte’s smart-driver docs describe this behavior and options.

Production notes (what to do outside the quickstart lab)

1) Install the driver once in your ingestion environment (venv, container, or host):

				
					pip uninstall -y psycopg2 psycopg2-binary
pip install psycopg2-yugabytedb-binary
				
			

(Use this instead of hot-installing inside the container each run.)

2) Use the same recipe shape you used above—options.connect_args is the most portable way to pass smart-driver params through DataHub → SQLAlchemy → psycopg2.

3) Networking:

  • If your YB nodes are on normal IPs (not 127.x), standard Docker bridging is fine.

  • If you keep a local multi-loopback lab, run ingestion with --network host as shown.

Alternate source config (generic SQLAlchemy)

If you prefer a single DSN string, the generic SQLAlchemy source accepts connect_uri and avoids the Postgres validator entirely:

				
					source:
type: sqlalchemy
config:
platform: postgres
connect_uri: "postgresql+psycopg2://yugabyte:YOUR_PASSWORD@/yugabyte?host=127.0.0.1,127.0.0.2,127.0.0.3&port=5433&load_balance=true"


sink:
type: datahub-rest
config:
server: "http://localhost:8080"
				
			
Why the Python Smart Driver Matters (Quick Summary)
  • ● Drop‑in upgrade: Install psycopg2‑yugabytedb‑binary and keep using DataHub’s Postgres source… no code changes to pipelines.

  • ● Better performance on distributed DBs: New connections are spread across tservers, avoiding single‑node hotspots and smoothing parallel metadata scans on large catalogs.

  • ● Higher resilience: On node issues or rolling upgrades, the next connection is routed to a healthy host—helping ingestion complete instead of flaking.

  • ● Topology options when you need them: You can steer traffic with topology keys (not required in this post) for AZ/region locality.

  • ● Works in dev and prod: Local 127.x labs (with --network host) and real clusters on routable IPs both benefit. Can also pair with YSQL Connection Manager/HAProxy—client‑side and server‑side balancing aren’t mutually exclusive.

  • ● Easy to prove: The tiny Python script shows, in seconds, that connections are actually balanced—great for demos and change reviews.

Bottom line: flipping to the Yugabyte Python smart driver turns a single‑endpoint DataHub crawl into a cluster‑aware, load‑balanced pipeline… exactly what you want for a distributed SQL database. 🔥 

Have Fun!

Statue or Maple? Spoiler: it’s Maple... our daughter’s Golden Retriever, permanently on standby for her next trip outside!