Introduction
Finding the right U.S. National Park to explore is often about vibes. Some people want jagged mountains and glaciers, others want red rock arches, and others want the best night skies for stargazing.
Wouldn’t it be cool if you could just ask a database:
● “Which park has amazing stargazing?”
● “Where can I hike among red rock canyons?”
● “I want a coastal park with rocky shorelines.”
In this YugabyteDB Tip, we’ll use pgvector + YugabyteDB to build a small semantic search demo over National Parks using free local embeddings from Hugging Face. (You can swap in paid models like OpenAI if you prefer; details at the end.)
Setup Checklist
Before you start, make sure you have:
● YugabyteDB with the
pgvectorextension enabled● Python 3 + pip
● Required Python libraries:
pip3 install sentence-transformers psycopg2-binary
- ● (Optional) OpenAI API key if you want to try paid embedding models
Step 1: Install the extension (In YB)
CREATE EXTENSION IF NOT EXISTS vector;
Step 2: Create tables & seed 20 parks
We store park text + vectors in national_parks and query prompts + vectors in search_queries.
DROP TABLE IF EXISTS search_queries;
DROP TABLE IF EXISTS national_parks;
CREATE TABLE national_parks (
park_id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
location TEXT,
description TEXT,
embedding vector(384) -- 384 dims for Hugging Face MiniLM
);
CREATE TABLE search_queries (
name TEXT PRIMARY KEY,
prompt TEXT NOT NULL,
embedding vector(384) -- same dimension as parks
);
Here are a few to get started (add more anytime):
INSERT INTO national_parks (name, location, description) VALUES
('Acadia', 'Maine', 'Dramatic rocky coastline, Cadillac Mountain sunrise, granite peaks, ocean views.'),
('Arches', 'Utah', 'Over 2,000 natural sandstone arches, red rock desert, Delicate Arch hikes.'),
('Badlands', 'South Dakota', 'Eroded buttes and pinnacles, fossil beds, mixed-grass prairie.'),
('Big Bend', 'Texas', 'Chisos Mountains and Chihuahuan Desert, Rio Grande canyons, starry night skies.'),
('Bryce Canyon', 'Utah', 'Hoodoo amphitheaters, orange spires, high-elevation viewpoints, great stargazing.'),
('Carlsbad Caverns', 'New Mexico', 'Massive limestone caverns, stalactites, evening bat flights.'),
('Crater Lake', 'Oregon', 'Deep blue caldera lake, Wizard Island, scenic rim drives.'),
('Everglades', 'Florida', 'Subtropical wetlands, mangroves, alligators and manatees.'),
('Glacier', 'Montana', 'Jagged peaks, glaciers, alpine lakes, Going-to-the-Sun Road.'),
('Grand Canyon', 'Arizona', 'Vast layered canyon, rim-to-river hikes, Colorado River vistas.'),
('Grand Teton', 'Wyoming', 'Granite skyline, alpine lakes, wildlife, float trips on the Snake River.'),
('Great Basin', 'Nevada', 'Remote desert mountains, ancient bristlecone pines, internationally dark skies ideal for stargazing.'),
('Great Smoky Mountains', 'Tennessee–North Carolina', 'Lush forests, misty ridges, wildflowers, Appalachian culture.'),
('Joshua Tree', 'California', 'Mojave & Colorado deserts, Joshua trees, jumbo boulders, night sky.'),
('Mount Rainier', 'Washington', 'Glaciated stratovolcano, wildflower meadows, subalpine trails.'),
('Olympic', 'Washington', 'Rugged coastline, temperate rainforest, Olympic peaks, tidepools.'),
('Rocky Mountain', 'Colorado', 'High alpine passes, tundra, elk herds, Trail Ridge Road.'),
('Yellowstone', 'Wyoming–Montana–Idaho', 'Geysers and hot springs, wildlife, Grand Prismatic Spring.'),
('Yosemite', 'California', 'Granite cliffs, waterfalls, Yosemite Valley, Half Dome, El Capitan.'),
('Zion', 'Utah', 'Navajo sandstone canyons, The Narrows, Angels Landing.');
Add at least one query prompt you want to ask in SQL:
INSERT INTO search_queries (name, prompt)
VALUES ('stargazing',
'Which U.S. National Park is best for stargazing with the darkest skies, astronomy viewing, Milky Way visibility, low light pollution?')
ON CONFLICT (name) DO UPDATE SET prompt = EXCLUDED.prompt;
Step 3: Embed from the database (generic, no hard-coding)
This tiny script reads rows from your tables, creates normalized embeddings, and writes them back.
● It fills missing vectors by default.
● If you edit text later, set that row’s
embedding = NULLand re-run; or run with--reembed-all.
Save this as load_embeddings.py:
#!/usr/bin/env python3
import os, argparse
import psycopg2
from sentence_transformers import SentenceTransformer
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--model", default=os.getenv("EMBED_MODEL", "sentence-transformers/all-MiniLM-L6-v2"))
ap.add_argument("--reembed-all", action="store_true")
ap.add_argument("--batch-size", type=int, default=int(os.getenv("BATCH_SIZE", "32")))
args = ap.parse_args()
conn = psycopg2.connect(
dbname=os.getenv("YB_DB", "yugabyte"),
user=os.getenv("YB_USER", "yugabyte"),
host=os.getenv("YB_HOST", "127.0.0.1"),
port=int(os.getenv("YB_PORT", "5433")),
password=os.getenv("YB_PASSWORD") or None,
)
print(f"Loading model: {args.model}")
model = SentenceTransformer(args.model)
def encode(texts):
return model.encode(texts, normalize_embeddings=True, batch_size=args.batch_size).tolist()
with conn, conn.cursor() as cur:
# Parks to embed
cur.execute(
"SELECT park_id, description FROM national_parks "
+ ("" if args.reembed_all else "WHERE embedding IS NULL ")
+ "ORDER BY park_id;"
)
parks = cur.fetchall()
if parks:
ids, texts = zip(*parks)
vecs = encode(list(texts))
for pid, vec in zip(ids, vecs):
cur.execute("UPDATE national_parks SET embedding = %s WHERE park_id = %s", (vec, pid))
print(f"✅ embedded/updated {len(parks)} park rows")
else:
print("ℹ️ no park rows to embed (use --reembed-all to force)")
# Queries to embed
cur.execute(
"SELECT name, prompt FROM search_queries "
+ ("" if args.reembed_all else "WHERE embedding IS NULL ")
+ "ORDER BY name;"
)
queries = cur.fetchall()
if queries:
names, prompts = zip(*queries)
qvecs = encode(list(prompts))
for name, vec in zip(names, qvecs):
cur.execute("UPDATE search_queries SET embedding = %s WHERE name = %s", (vec, name))
print(f"✅ embedded/updated {len(queries)} query rows")
else:
print("ℹ️ no query rows to embed (insert prompts in search_queries and re-run)")
if __name__ == "__main__":
main()
Run it once to fill vectors:
python3 embed_all.py
Example:
[root@localhost ~]# python3 embed_all.py
Loading model: sentence-transformers/all-MiniLM-L6-v2
✅ embedded/updated 20 park rows
✅ embedded/updated 1 query rows
If you later change any description or prompt, do:
UPDATE national_parks SET embedding = NULL WHERE park_id = ...;
UPDATE search_queries SET embedding = NULL WHERE name = ...;
… then run python3 embed_all.py again.
Or refresh everything at once:
python3 embed_all.py --reembed-all
Step 4: Ask questions in pure SQL 🎉
With normalized vectors, the cosine operator <=> works best. Lower = closer.
Question: Which park has amazing stargazing?
WITH q AS (SELECT embedding FROM search_queries WHERE name = 'stargazing')
SELECT p.name, p.location, p.description,
p.embedding <=> (SELECT embedding FROM q) AS cosine_distance
FROM national_parks p
ORDER BY cosine_distance
LIMIT 10;
Example:
yugabyte=# WITH q AS (SELECT embedding FROM search_queries WHERE name = 'stargazing')
yugabyte-# SELECT p.name, p.location, p.description,
yugabyte-# p.embedding <=> (SELECT embedding FROM q) AS cosine_distance
yugabyte-# FROM national_parks p
yugabyte-# ORDER BY cosine_distance
yugabyte-# LIMIT 10;
name | location | description | cosine_distance
---------------+------------+------------------------------------------------------------------------------------------------------+--------------------
Great Basin | Nevada | Remote desert mountains, ancient bristlecone pines, internationally dark skies ideal for stargazing. | 0.4083311734100803
Bryce Canyon | Utah | Hoodoo amphitheaters, orange spires, high-elevation viewpoints, great stargazing. | 0.4721805610602906
Big Bend | Texas | Chisos Mountains and Chihuahuan Desert, Rio Grande canyons, starry night skies. | 0.5010896175775026
Joshua Tree | California | Mojave & Colorado deserts, Joshua trees, jumbo boulders, night sky. | 0.5195259297035338
Acadia | Maine | Dramatic rocky coastline, Cadillac Mountain sunrise, granite peaks, ocean views. | 0.5489347010215939
Mount Rainier | Washington | Glaciated stratovolcano, wildflower meadows, subalpine trails. | 0.5646119208980821
Olympic | Washington | Rugged coastline, temperate rainforest, Olympic peaks, tidepools. | 0.5927502336278472
Grand Teton | Wyoming | Granite skyline, alpine lakes, wildlife, float trips on the Snake River. | 0.6042105611374429
Zion | Utah | Navajo sandstone canyons, The Narrows, Angels Landing. | 0.6050795199000691
Grand Canyon | Arizona | Vast layered canyon, rim-to-river hikes, Colorado River vistas. | 0.6103533324493949
(10 rows)
You should see Great Basin at/near the top. 🌌
Optional — Hybrid nudge
Add a tiny keyword bonus to break ties for very specific intents:
WITH q AS (SELECT embedding FROM search_queries WHERE name='stargazing'),
parks AS (
SELECT p.*,
(CASE WHEN description ILIKE '%dark%' THEN 1 ELSE 0 END) +
(CASE WHEN description ILIKE '%stargaz%' THEN 1 ELSE 0 END) +
(CASE WHEN description ILIKE '%milky way%' THEN 1 ELSE 0 END) +
(CASE WHEN description ILIKE '%astronom%' THEN 1 ELSE 0 END) +
(CASE WHEN description ILIKE '%light pollution%' THEN 1 ELSE 0 END) AS lex_hits
FROM national_parks p
)
SELECT name, location, description,
(embedding <=> (SELECT embedding FROM q)) - (0.11 * lex_hits) AS score,
lex_hits
FROM parks
ORDER BY score
LIMIT 10;
Example:
yugabyte=# WITH q AS (SELECT embedding FROM search_queries WHERE name='stargazing'),
yugabyte-# parks AS (
yugabyte(# SELECT p.*,
yugabyte(# (CASE WHEN description ILIKE '%dark%' THEN 1 ELSE 0 END) +
yugabyte(# (CASE WHEN description ILIKE '%stargaz%' THEN 1 ELSE 0 END) +
yugabyte(# (CASE WHEN description ILIKE '%milky way%' THEN 1 ELSE 0 END) +
yugabyte(# (CASE WHEN description ILIKE '%astronom%' THEN 1 ELSE 0 END) +
yugabyte(# (CASE WHEN description ILIKE '%light pollution%' THEN 1 ELSE 0 END) AS lex_hits
yugabyte(# FROM national_parks p
yugabyte(# )
yugabyte-# SELECT name, location, description,
yugabyte-# (embedding <=> (SELECT embedding FROM q)) - (0.11 * lex_hits) AS score,
yugabyte-# lex_hits
yugabyte-# FROM parks
yugabyte-# ORDER BY score
yugabyte-# LIMIT 10;
name | location | description | score | lex_hits
---------------+------------+------------------------------------------------------------------------------------------------------+--------------------+----------
Great Basin | Nevada | Remote desert mountains, ancient bristlecone pines, internationally dark skies ideal for stargazing. | 0.1883311734100803 | 2
Bryce Canyon | Utah | Hoodoo amphitheaters, orange spires, high-elevation viewpoints, great stargazing. | 0.3621805610602906 | 1
Big Bend | Texas | Chisos Mountains and Chihuahuan Desert, Rio Grande canyons, starry night skies. | 0.5010896175775026 | 0
Joshua Tree | California | Mojave & Colorado deserts, Joshua trees, jumbo boulders, night sky. | 0.5195259297035338 | 0
Acadia | Maine | Dramatic rocky coastline, Cadillac Mountain sunrise, granite peaks, ocean views. | 0.5489347010215939 | 0
Mount Rainier | Washington | Glaciated stratovolcano, wildflower meadows, subalpine trails. | 0.5646119208980821 | 0
Olympic | Washington | Rugged coastline, temperate rainforest, Olympic peaks, tidepools. | 0.5927502336278472 | 0
Grand Teton | Wyoming | Granite skyline, alpine lakes, wildlife, float trips on the Snake River. | 0.6042105611374429 | 0
Zion | Utah | Navajo sandstone canyons, The Narrows, Angels Landing. | 0.6050795199000691 | 0
Grand Canyon | Arizona | Vast layered canyon, rim-to-river hikes, Colorado River vistas. | 0.6103533324493949 | 0
(10 rows)
Paid models (optional upgrade)
This YugabyteDB tip uses Hugging Face MiniLM (384-dim, free & local).
If you want stronger semantic accuracy:
● Hugging Face all-mpnet-base-v2 (768-dim)
◦ Change both columns to
vector(768)and runpython3 embed_all.py --reembed-allwithEMBED_MODEL=sentence-transformers/all-mpnet-base-v2.
● OpenAI (paid)
text-embedding-3-small(1536-dim)◦ Change both columns to
vector(1536)and use a similar worker that calls OpenAI’s API to embed rows; your SQL stays the same (<=>).
Troubleshooting
● Rankings seem off? Re-run the worker with
--reembed-allto ensure everything is normalized and up-to-date.● Changed text but results didn’t move? Set that row’s
embedding = NULLand rerun the worker.● Dimension mismatch error? Make sure your
vector(N)size matches the model you’re using.
Summary
Why use pgvector here instead of plain SQL with LIKE or regex?
● Understands meaning, not just words.
ILIKE '%stargazing%'misses parks that say “dark skies,” “Milky Way,” or “low light pollution.” Embeddings capture these as the same idea, so results show up even when the exact term doesn’t.● Ranks by closeness. Keyword search is binary (match/no match). pgvector orders parks by how well they match the intent, so “Great Basin” naturally bubbles to the top for stargazing.
- ● Handles paraphrases, synonyms, typos, and phrasing.
“red rock arches,” “sandstone spans,” “arches in the desert,” or even a small typo still retrieve Arches. Regex would need a growing, fragile ruleset. - ● Works across varied, natural queries. Users ask in free text; you shouldn’t have to maintain a manual synonym dictionary. Embeddings are learned once and reused.
- ● Multilingual/fuzzy text friendly. Many sentence models map different languages and phrasing into a shared space, something
LIKEcan’t do. - ● Plays well with classic filters. You can filter with regular SQL (
WHERE state IN ('UT','NV')) and then rank semantically with<=>. Best of both worlds.
Quick contrast:
-- Keyword
SELECT name FROM national_parks WHERE description ILIKE '%stargazing%';
-- Semantic (pgvector)
WITH q AS (SELECT embedding FROM search_queries WHERE name='stargazing')
SELECT name
FROM national_parks
ORDER BY embedding <=> (SELECT embedding FROM q)
LIMIT 5;
When LIKE/regex is enough: tightly structured lookups (“state = ‘Utah’ AND has_arches = true”) or when you truly need exact words.
When pgvector shines: open-ended, vibe-driven questions (“best for stargazing,” “rugged rocky coastline”), where synonyms and meaning matter. and you want results ranked by relevance with one simple ORDER BY.
Have Fun!
