Most modern apps want to store semi-structured data as JSON/JSONB, but a lot of legacy or external systems still speak XML. YugabyteDB YSQL has excellent JSONB support but no built-in XML type or parser.
In this tip we’ll build a generic XML → JSONB user-defined function (UDF) that:
● Works in plain YugabyteDB YSQL / yugabyted.
● Does not hard-code any tag names or schemas.
● Handles nested elements and repeated tags (arrays).
● Produces a nested
jsonbtree from arbitrary “reasonable” XML.
We’ll also add an optional helper to auto-cast numeric-looking strings into numeric values.
Finally, we’ll run it on a realistic sample: a RiskAssessmentEnvelope XML payload from a (fictional) risk scoring gateway.
1. Generic XML → JSONB in YSQL
This function walks an XML string as plain text:
● Strips the XML declaration (
<?xml ...?>).● Scans for tags
<Tag ...>/</Tag>.● Ignores attributes (keeps only tag names and content).
● Builds a JSON tree:
○ Each tag name becomes a key.
○ If there are multiple siblings with the same tag name, they become a JSON array.
○ Nested XML becomes nested objects.
⚠️ This is a best-effort XML parser, not a full XML engine. It works well for “API-style” XML but ignores attributes and namespaces.
Function: xml_to_jsonb_generic(text) → jsonb
CREATE OR REPLACE FUNCTION xml_to_jsonb_generic(p_xml text)
RETURNS jsonb
LANGUAGE plpgsql
AS $$
DECLARE
v_xml text := p_xml;
v_children jsonb := '{}'::jsonb; -- key -> array of values
v_result jsonb := '{}'::jsonb;
pos int := 1;
xml_len int;
sub text;
start_rel int;
start_tag_pos int;
end_tag_rel int;
end_tag_pos int;
tag_full text;
tag_name_raw text;
tag_name text;
after_open_pos int;
sub_after_open text;
close_rel int;
close_pos int;
close_end_pos int;
inner_text text;
closing text;
child_value jsonb;
existing_vals jsonb;
k text;
arr jsonb;
BEGIN
-- Strip XML declaration like
v_xml := regexp_replace(v_xml, '^\s*<\?xml[^>]*>\s*', '', 'g');
xml_len := length(v_xml);
-- Walk the XML text and parse tags one by one
WHILE pos <= xml_len LOOP
sub := substring(v_xml from pos);
start_rel := strpos(sub, '<');
IF start_rel = 0 THEN
EXIT; -- no more tags
END IF;
start_tag_pos := pos + start_rel - 1;
-- Skip closing tags like
IF substring(v_xml from start_tag_pos+1 for 1) = '/' THEN
pos := start_tag_pos + 2;
CONTINUE;
END IF;
-- Skip declarations / comments / DOCTYPE: ...>,
IF substring(v_xml from start_tag_pos+1 for 1) IN ('?', '!') THEN
sub := substring(v_xml from start_tag_pos);
end_tag_rel := strpos(sub, '>');
IF end_tag_rel = 0 THEN
EXIT;
END IF;
pos := start_tag_pos + end_tag_rel;
CONTINUE;
END IF;
-- Find end of opening tag:
sub := substring(v_xml from start_tag_pos);
end_tag_rel := strpos(sub, '>');
IF end_tag_rel = 0 THEN
EXIT; -- malformed
END IF;
end_tag_pos := start_tag_pos + end_tag_rel - 1;
-- Extract tag name (first token, trim any trailing '/')
tag_full := substring(v_xml from start_tag_pos+1 for end_tag_pos - start_tag_pos - 1);
tag_name_raw := split_part(tag_full, ' ', 1);
tag_name := trim(trailing '/' from tag_name_raw);
-- Self-closing tag like → treat as empty value
IF right(tag_full, 1) = '/' THEN
inner_text := '';
close_end_pos := end_tag_pos + 1;
ELSE
-- Find matching closing tag
closing := '' || tag_name || '>';
after_open_pos := end_tag_pos + 1;
sub_after_open := substring(v_xml from after_open_pos);
close_rel := strpos(sub_after_open, closing);
IF close_rel = 0 THEN
-- Malformed; treat as empty
inner_text := '';
close_end_pos := end_tag_pos + 1;
ELSE
close_pos := after_open_pos + close_rel - 1;
inner_text := substring(v_xml from after_open_pos for close_pos - after_open_pos);
close_end_pos := close_pos + length(closing);
END IF;
END IF;
-- Determine value: nested structure or plain text?
IF position('<' IN inner_text) > 0 THEN
child_value := xml_to_jsonb_generic(inner_text);
ELSE
child_value := to_jsonb(btrim(inner_text));
END IF;
-- Append child_value to v_children[tag_name] array
existing_vals := v_children -> tag_name;
IF existing_vals IS NULL THEN
v_children := v_children || jsonb_build_object(tag_name, jsonb_build_array(child_value));
ELSE
v_children :=
jsonb_set(
v_children,
ARRAY[tag_name],
existing_vals || jsonb_build_array(child_value),
true
);
END IF;
-- Advance position
pos := close_end_pos;
END LOOP;
-- If no children parsed, return leaf text
IF v_children = '{}'::jsonb THEN
RETURN to_jsonb(btrim(v_xml));
END IF;
-- Convert arrays into either scalar or array in final result
FOR k, arr IN SELECT key, value FROM jsonb_each(v_children) LOOP
IF jsonb_array_length(arr) = 1 THEN
v_result := v_result || jsonb_build_object(k, arr->0);
ELSE
v_result := v_result || jsonb_build_object(k, arr);
END IF;
END LOOP;
RETURN v_result;
END;
$$;
What you get:
● The root tag becomes the top-level key (e.g.
"RiskAssessmentEnvelope").● Each child tag becomes a field.
● Repeated tags at the same level (
<UltimateBeneficialOwner>…</UltimateBeneficialOwner>multiple times) become JSON arrays.
2. Optional: Auto-casting numeric-looking strings in JSONB
By default, all values come out as strings (because XML is text). Sometimes that’s fine; sometimes you want { "OwnershipPercent": 37.5 } as a number, not "37.5".
This helper walks any JSONB and:
● Leaves numbers, booleans, null as-is.
● For string values:
○ If they match a simple numeric pattern (
123or123.45), converts tonumeric.○ Otherwise leaves them as strings.
● Recurses through objects and arrays.
Function: jsonb_autocast_numbers(jsonb) → jsonb
CREATE OR REPLACE FUNCTION jsonb_autocast_numbers(p_j jsonb)
RETURNS jsonb
LANGUAGE plpgsql
AS $$
DECLARE
v_type text;
v_key text;
v_val jsonb;
v_obj jsonb := '{}'::jsonb;
v_arr jsonb := '[]'::jsonb;
v_text text;
BEGIN
IF p_j IS NULL THEN
RETURN NULL;
END IF;
v_type := jsonb_typeof(p_j);
IF v_type = 'string' THEN
-- Get the underlying text
v_text := p_j #>> '{}';
-- Simple numeric pattern: integer or decimal
IF v_text ~ '^[0-9]+(\.[0-9]+)?$' THEN
RETURN to_jsonb(v_text::numeric);
ELSE
RETURN p_j;
END IF;
ELSIF v_type = 'object' THEN
FOR v_key, v_val IN
SELECT key, value FROM jsonb_each(p_j)
LOOP
v_obj := v_obj || jsonb_build_object(
v_key,
jsonb_autocast_numbers(v_val)
);
END LOOP;
RETURN v_obj;
ELSIF v_type = 'array' THEN
FOR v_val IN
SELECT value FROM jsonb_array_elements(p_j)
LOOP
v_arr := v_arr || jsonb_build_array(
jsonb_autocast_numbers(v_val)
);
END LOOP;
RETURN v_arr;
ELSE
-- number, boolean, null → unchanged
RETURN p_j;
END IF;
END;
$$;
Use it when you want numeric-looking things ("37.5", "3000", "2") converted, without hard-coding field names.
3. Example: RiskAssessmentEnvelope XML → JSONB
Here’s a sample XML from a fictitious Risk Gateway:
RISK-REQ-742981
2025-11-18T16:42:11Z
YB-Risk-Gateway
74a7a09b-e678-4b1c-8328-52e9f6ff3d92
UAT
ORG-003914
Global Compliance Advisors LLC
EMEA
FINANCIAL_INSTITUTIONS
j.smith.risk.analyst
SeniorRiskOfficer
EnterpriseRisk
en-GB
RA-2025-000172
KYC_ONBOARDING
API
HIGH
CUST-2048557
LEGAL_ENTITY
Nordic Atlas Trading S.A.
LU
EU
5221
Securities and Commodities Trading
Elena Kovác
37.5
CZ
Atlas Holdings Trust
62.5
LI
PEP_SCREENING
SANCTIONS_LISTS
ADVERSE_MEDIA
IDENTITY_VERIFICATION
BUSINESS_REGISTRY_LOOKUP
COUNTERPARTY_RISK_SCORE
TR-RISK-MODEL-V5
3000
false
SUMMARY_AND_FACTS
https://api.example.com/risk/callback/v1
POST
OAUTH2_CLIENT_CREDENTIALS
risk-client-019
true
HIGH
risk-alerts@example.com
3.1 Convert XML → JSONB
SELECT xml_to_jsonb_generic($$
RISK-REQ-742981
2025-11-18T16:42:11Z
YB-Risk-Gateway
74a7a09b-e678-4b1c-8328-52e9f6ff3d92
UAT
ORG-003914
Global Compliance Advisors LLC
EMEA
FINANCIAL_INSTITUTIONS
j.smith.risk.analyst
SeniorRiskOfficer
EnterpriseRisk
en-GB
RA-2025-000172
KYC_ONBOARDING
API
HIGH
CUST-2048557
LEGAL_ENTITY
Nordic Atlas Trading S.A.
LU
EU
5221
Securities and Commodities Trading
Elena Kovác
37.5
CZ
Atlas Holdings Trust
62.5
LI
PEP_SCREENING
SANCTIONS_LISTS
ADVERSE_MEDIA
IDENTITY_VERIFICATION
BUSINESS_REGISTRY_LOOKUP
COUNTERPARTY_RISK_SCORE
TR-RISK-MODEL-V5
3000
false
SUMMARY_AND_FACTS
https://api.example.com/risk/callback/v1
POST
OAUTH2_CLIENT_CREDENTIALS
risk-client-019
true
HIGH
risk-alerts@example.com
$$);
You’ll get a JSONB structure (formatted here for readability) roughly like:
{
"RiskAssessmentEnvelope": {
"Header": {
"MessageId": "RISK-REQ-742981",
"Timestamp": "2025-11-18T16:42:11Z",
"Environment": "UAT",
"SourceSystem": "YB-Risk-Gateway",
"CorrelationId": "74a7a09b-e678-4b1c-8328-52e9f6ff3d92"
},
"ClientContext": {
"User": {
"Role": "SeniorRiskOfficer",
"Locale": "en-GB",
"UserId": "j.smith.risk.analyst",
"Department": "EnterpriseRisk"
},
"Organization": {
"RegionCode": "EMEA",
"MarketSegment": "FINANCIAL_INSTITUTIONS",
"OrganizationId": "ORG-003914",
"OrganizationName": "Global Compliance Advisors LLC"
}
},
"RiskRequestBody": {
"RiskAssessmentRequest": {
"Channel": "API",
"Subject": {
"LegalName": "Nordic Atlas Trading S.A.",
"SubjectId": "CUST-2048557",
"SubjectType": "LEGAL_ENTITY",
"IndustryCode": "5221",
"IndustryDescription": "Securities and Commodities Trading",
"PrimaryJurisdiction": "EU",
"CountryOfIncorporation": "LU"
},
"Priority": "HIGH",
"RequestId": "RA-2025-000172",
"Configuration": {
"MaxLatencyMs": "3000",
"ScoringModelId": "TR-RISK-MODEL-V5",
"IncludeDebugInfo": "false",
"ReturnLevelOfDetail": "SUMMARY_AND_FACTS"
},
"AssessmentType": "KYC_ONBOARDING",
"RiskDimensions": {
"Dimension": [
{
"RequestedChecks": {
"CheckCode": [
"PEP_SCREENING",
"SANCTIONS_LISTS",
"ADVERSE_MEDIA"
]
}
},
{
"RequestedChecks": {
"CheckCode": [
"IDENTITY_VERIFICATION",
"BUSINESS_REGISTRY_LOOKUP"
]
}
},
{
"RequestedChecks": {
"CheckCode": "COUNTERPARTY_RISK_SCORE"
}
}
]
},
"OwnershipStructure": {
"UltimateBeneficialOwner": [
{
"Name": "Elena Kovác",
"OwnershipPercent": "37.5",
"CountryOfResidence": "CZ"
},
{
"Name": "Atlas Holdings Trust",
"OwnershipPercent": "62.5",
"CountryOfResidence": "LI"
}
]
}
},
"NotificationPreferences": {
"Callback": {
"Method": "POST",
"AuthType": "OAUTH2_CLIENT_CREDENTIALS",
"ClientId": "risk-client-019",
"CallbackUrl": "https://api.example.com/risk/callback/v1"
},
"EmailAlerts": {
"Enabled": "true",
"Recipient": "risk-alerts@example.com",
"SeverityThreshold": "HIGH"
}
}
}
}
}
Note:
● The root element is
"RiskAssessmentEnvelope".● UltimateBeneficialOwneris an array (two owners).● RiskDimensions.Dimensionis an array (AML, KYC, CREDIT).● Attributes like
Dimension name="AML"are not retained (we ignore attributes).
3.2 Optional: Auto-cast numeric strings
To turn numeric-like strings ("37.5", "5221", "3000") into numeric values:
SELECT jsonb_autocast_numbers(
xml_to_jsonb_generic($$
... full XML here ...
$$)
) AS envelope_json_typed;
Now things like:
●
"OwnershipPercent": "37.5"→37.5(numeric)●
"MaxLatencyMs": "3000"→3000●
"IndustryCode": "5221"→5221
… are actual numeric values in the JSON tree, without you hard-coding those field names.
4. Storing and querying the JSONB in YugabyteDB
A simple table to store these envelopes:
CREATE TABLE risk_envelopes (
id uuid PRIMARY KEY,
message_id text,
created_at timestamptz DEFAULT now(),
envelope_json jsonb
);
Insert from XML:
INSERT INTO risk_envelopes (id, message_id, envelope_json)
SELECT
gen_random_uuid(),
'RISK-REQ-742981',
jsonb_autocast_numbers(
xml_to_jsonb_generic($$
... full XML here ...
$$)
);
4.1 Example queries
All HIGH-priority requests:
SELECT id, message_id
FROM risk_envelopes
WHERE envelope_json
-> 'RiskAssessmentEnvelope'
-> 'RiskRequestBody'
-> 'RiskAssessmentRequest'
->> 'Priority' = 'HIGH';
Example:
yugabyte=# SELECT id, message_id
yugabyte-# FROM risk_envelopes
yugabyte-# WHERE envelope_json
yugabyte-# -> 'RiskAssessmentEnvelope'
yugabyte-# -> 'RiskRequestBody'
yugabyte-# -> 'RiskAssessmentRequest'
yugabyte-# ->> 'Priority' = 'HIGH';
id | message_id
--------------------------------------+-----------------
c1933f5d-9bde-47f2-a89c-4ba11d02b0f2 | RISK-REQ-742981
(1 row)
All UBOs with OwnershipPercent > 25:
SELECT
id,
ubo->>'Name' AS ubo_name,
(ubo->>'OwnershipPercent')::numeric AS pct
FROM risk_envelopes e
CROSS JOIN LATERAL
jsonb_array_elements(
e.envelope_json
-> 'RiskAssessmentEnvelope'
-> 'RiskRequestBody'
-> 'RiskAssessmentRequest'
-> 'OwnershipStructure'
-> 'UltimateBeneficialOwner'
) AS ubo
WHERE (ubo->>'OwnershipPercent')::numeric > 25;
Example:
yugabyte=# SELECT
yugabyte-# id,
yugabyte-# ubo->>'Name' AS ubo_name,
yugabyte-# (ubo->>'OwnershipPercent')::numeric AS pct
yugabyte-# FROM risk_envelopes e
yugabyte-# CROSS JOIN LATERAL
yugabyte-# jsonb_array_elements(
yugabyte(# e.envelope_json
yugabyte(# -> 'RiskAssessmentEnvelope'
yugabyte(# -> 'RiskRequestBody'
yugabyte(# -> 'RiskAssessmentRequest'
yugabyte(# -> 'OwnershipStructure'
yugabyte(# -> 'UltimateBeneficialOwner'
yugabyte(# ) AS ubo
yugabyte-# WHERE (ubo->>'OwnershipPercent')::numeric > 25;
id | ubo_name | pct
--------------------------------------+----------------------+------
c1933f5d-9bde-47f2-a89c-4ba11d02b0f2 | Elena Kovác | 37.5
c1933f5d-9bde-47f2-a89c-4ba11d02b0f2 | Atlas Holdings Trust | 62.5
(2 rows)
All requested risk checks (flattened):
SELECT DISTINCT
e.id,
rc AS check_code
FROM risk_envelopes e
CROSS JOIN LATERAL
jsonb_array_elements(
e.envelope_json
-> 'RiskAssessmentEnvelope'
-> 'RiskRequestBody'
-> 'RiskAssessmentRequest'
-> 'RiskDimensions'
-> 'Dimension'
) AS dim
CROSS JOIN LATERAL
jsonb_array_elements_text(
CASE jsonb_typeof(dim->'RequestedChecks'->'CheckCode')
WHEN 'array' THEN dim->'RequestedChecks'->'CheckCode'
ELSE jsonb_build_array(dim->'RequestedChecks'->'CheckCode')
END
) AS rc;
Example:
yugabyte=# SELECT DISTINCT
yugabyte-# e.id,
yugabyte-# rc AS check_code
yugabyte-# FROM risk_envelopes e
yugabyte-# CROSS JOIN LATERAL
yugabyte-# jsonb_array_elements(
yugabyte(# e.envelope_json
yugabyte(# -> 'RiskAssessmentEnvelope'
yugabyte(# -> 'RiskRequestBody'
yugabyte(# -> 'RiskAssessmentRequest'
yugabyte(# -> 'RiskDimensions'
yugabyte(# -> 'Dimension'
yugabyte(# ) AS dim
yugabyte-# CROSS JOIN LATERAL
yugabyte-# jsonb_array_elements_text(
yugabyte(# CASE jsonb_typeof(dim->'RequestedChecks'->'CheckCode')
yugabyte(# WHEN 'array' THEN dim->'RequestedChecks'->'CheckCode'
yugabyte(# ELSE jsonb_build_array(dim->'RequestedChecks'->'CheckCode')
yugabyte(# END
yugabyte(# ) AS rc;
id | check_code
--------------------------------------+--------------------------
c1933f5d-9bde-47f2-a89c-4ba11d02b0f2 | COUNTERPARTY_RISK_SCORE
c1933f5d-9bde-47f2-a89c-4ba11d02b0f2 | PEP_SCREENING
c1933f5d-9bde-47f2-a89c-4ba11d02b0f2 | ADVERSE_MEDIA
c1933f5d-9bde-47f2-a89c-4ba11d02b0f2 | IDENTITY_VERIFICATION
c1933f5d-9bde-47f2-a89c-4ba11d02b0f2 | SANCTIONS_LISTS
c1933f5d-9bde-47f2-a89c-4ba11d02b0f2 | BUSINESS_REGISTRY_LOOKUP
(6 rows)
5. Caveats and when to parse XML in the app tier
This pattern is practical and works well for:
● “API-shaped” XML with nested elements and repeated tags.
● Use cases where you mostly need a JSON representation for querying.
● Demos and prototypes where installing extensions or server-side languages isn’t desirable.
But it’s still a simplified XML view:
● Attributes (like
Dimension name="AML") are ignored.● Namespaces are not modeled; only the element names are used.
● It assumes reasonably well-formed XML; malformed or very exotic XML may not parse correctly.
● For very high-throughput or compliance-sensitive use cases, a proper XML library in the application tier (Java, Python, Go, etc.) is still the best option.
Summary
With just two YSQL UDFs:
●
xml_to_jsonb_generic(text) → jsonb●
jsonb_autocast_numbers(jsonb) → jsonb
… you can:
● Ingest XML into YugabyteDB Anywhere / yugabyted without extra extensions.
● Get a reasonably faithful JSONB representation of nested XML.
● Automatically convert numeric-looking values to actual numbers.
● Use YugabyteDB’s distributed SQL and JSONB features to query, index, and analyze what used to be “just XML blobs”.
Have Fun!
