It is not uncommon for instrumentation, webhooks, and events to end up in large piles of (messy, terrifying, unusable) JSON.
But Buz makes it easy to create well-structured piles of JSON. π
And DuckDB makes it easy to query said piles of JSON. π¦
The combination of π¦'s and π's is pretty cool.
Schematizing Data
A lot can be said about using schemas to describe data.
Schemas can be used to auto-generate tracking SDK's, seed data dictionaries and discovery mechanisms, version payloads and more.
But schemas also provide significant value far downstream of data generation, data collection, and "organizational contracts".
Payload validation and annotation
Buz uses metadata attributes defined within schemas to validate and annotate payloads.
For example, this schema:β
{
"$schema": "https://registry.buz.dev/s/io.silverton/buz/internal/meta/v1.0.json",
"$id": "io.silverton/buz/example/birdsBees/v1.0.json",
"title": "io.silverton/buz/example/birdsBees/v1.0.json",
"description": "Schema for birds and bees",
"owner": {
"org": "silverton",
"team": "buz",
"individual": "jakthom"
},
"self": {
"vendor": "io.silverton",
"namespace": "buz.example.birdsBees",
"version": "1.0"
},
"type": "object",
"properties": {
"uuid": {
"description": "The event uuid",
"type": "string",
"format": "uuid"
},
"event_name": {
"description": "The name of the event",
"type": "string"
},
"organization": {
"type": "object",
"properties": {
"name": {
"description": "The organization name",
"type": "string"
},
"id": {
"description": "The organization id",
"type": "number"
}
}
}
},
"required": [
"uuid",
"event_name",
"organization"
],
"additionalProperties": false
}
and this payload:β
{
"uuid": "b06834d6-aea1-4ef2-9c00-8cd5aa76c8e2",
"event_name": "viewed",
"organization": {
"name": "bee corp",
"id": 10,
}
}
will result in this envelope:β
[
{
"uuid": "28b0b531-285f-46f7-88e5-92b34d403fcd",
"timestamp": "2023-04-28T21:09:16.960953Z",
"buzTimestamp": "2023-04-28T21:09:16.960953Z",
"buzVersion": "x.x.dev",
"buzName": "buz-bootstrap",
"buzEnv": "development",
"protocol": "webhook",
"schema": "io.silverton/buz/example/birdsBees/v1.0.json",
"vendor": "io.silverton",
"namespace": "buz.example.birdsBees",
"version": "1.0",
"isValid": true,
"contexts": {
"io.silverton/buz/internal/contexts/httpHeaders/v1.0.json": {
"Accept": "*/*",
"Content-Length": "153",
"Content-Type": "application/json",
"User-Agent": "curl/7.86.0"
}
},
"payload": {
"event_name": "viewed",
"organization": {
"id": 10,
"name": "bee corp"
},
"uuid": "b06834d6-aea1-4ef2-9c00-8cd5aa76c8e2"
}
}
]
Schema attributes get appended to the envelope, which can be used to consistently, reliably answer questions like:
- Where was payload sourced from?
- What does this payload represent?
- What version of the
buz.example.birdsBeesschema does this event conform to? - Is the payload valid?
These attributes are appended to sink-specific mechanisms like Kafka headers or Pub/Sub attributes.
And power automation far downstream of the point of collection.
Payload annotation-based partitioning
The example AWS deployment of Buz writes events directly to Kinesis Firehose, which then leverages dynamic partitioning to write incoming payloads to well-structured S3 paths.
The above buz.example.birdsBees payload will be written to an s3 path of:
isValid=true/vendor=io.silverton/namespace=buz.example.birdsBees/version=1.0/year=2023/month=4/day=28/$FILE.gz
Payload validation, vendors, namespaces, versions, etc are all written to different paths. Which also means every s3 path is comprised of identically-structured payloads.
This is key.
Querying JSON directly from S3
DuckDB makes it easy to query your S3 pile (bucket? lake? lakehouse? idk).
To start querying this well-structured pile, first load and configure DuckDB's httpfs extension:
D install httpfs;
D load httpfs;
D set s3_region = 'us-east-1'; # or wherever
D set s3_access_key_id = '$YOUR_ID';
D set s3_secret_access_key = '$YOUR_KEY';
Since the pile of JSON in S3 is consistently-structured, DuckDB schema discovery is effectively free.
You get real tables with real types. And structs all the way down:
D create table local_pile as
select
*
from
read_json_auto('s3://bz-dev-events/isValid=true/vendor=io.silverton/namespace=buz.example.birdsBees/version=1.0/year=2023/month=4/day=28/*.json.gz', timestampformat='%Y-%m-%dT%H:%M:%S.%fZ');
D describe local_pile;
ββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββ
β column_name β column_type β null β key β default β extra β
β varchar β varchar β varchar β varchar β varchar β int32 β
ββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββ€
β uuid β UUID β YES β β β β
β timestamp β TIMESTAMP β YES β β β β
β buzTimestamp β TIMESTAMP β YES β β β β
β buzVersion β VARCHAR β YES β β β β
β buzName β VARCHAR β YES β β β β
β buzEnv β VARCHAR β YES β β β β
β protocol β VARCHAR β YES β β β β
β schema β VARCHAR β YES β β β β
β vendor β VARCHAR β YES β β β β
β namespace β VARCHAR β YES β β β β
β version β VARCHAR β YES β β β β
β isValid β BOOLEAN β YES β β β β
β contexts β STRUCT("io.silverton/buz/internal/contexts/httpHeaders/v1.0.json" STRUCT("Accept" VARCHAR, "Content-Length" BIGINT, "Content-Type" VARCHAR, "User-Agent" VARCHAR)) β YES β β β β
β payload β STRUCT(event_name VARCHAR, organization STRUCT(id UBIGINT, "name" VARCHAR), uuid UUID) β YES β β β β
ββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββ€
β 14 rows 6 columns β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Queries like this are possible out of the box:
D select
> date_trunc('day', timestamp) as day,
> payload.event_name,
> payload.organization.id as org_id,
> count(*) as events
> from
> local_pile
> group by
> 1,2,3;
ββββββββββββββ¬βββββββββββββ¬βββββββββ¬βββββββββ
β day β event_name β org_id β events β
β date β varchar β uint64 β int64 β
ββββββββββββββΌβββββββββββββΌβββββββββΌβββββββββ€
β 2023-04-28 β viewed β 10 β 210353β
ββββββββββββββ΄βββββββββββββ΄βββββββββ΄βββββββββ
Well-structured piles of Parquet
A well-structured pile of parquet is sometimes preferable to a well-structured pile of JSON. Which is fine!
This too is easy with DuckDB:
D copy local_pile to 'local_pile.parquet' (format parquet);
To query the local parquet pile:
D select * from read_parquet('local_pile.parquet');
ββββββββββββββββββββββββ¬βββββββββββββββββββββββ¬βββββββββββββββββββββββ¬βββββββββββββ¬ββββββββββββββββ¬ββββββββββββββ¬ββββ¬βββββββββββββββ¬βββββββββββββββββββββββ¬ββββββββββ¬ββββββββββ¬βββββββββββββββββββββββ¬βββββββββββββββββββββββ
β uuid β timestamp β buzTimestamp β buzVersion β buzName β buzEnv β β¦ β vendor β namespace β version β isValid β contexts β payload β
β uuid β timestamp β timestamp β varchar β varchar β varchar β β varchar β varchar β varchar β boolean β struct("io.silvertβ¦ β struct(event_name β¦ β
ββββββββββββββββββββββββΌβββββββββββββββββββββββΌβββββββββββββββββββββββΌβββββββββββββΌββββββββββββββββΌββββββββββββββΌββββΌβββββββββββββββΌβββββββββββββββββββββββΌββββββββββΌββββββββββΌβββββββββββββββββββββββΌβββββββββββββββββββββββ€
β 813da4d2-c49b-4b99β¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β dcbf6249-42e0-4abeβ¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β 3ae2d65c-f785-41b5β¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β 6c7b3e35-6635-4eecβ¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β 8b3b4c8c-2ce5-4431β¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β 8683ec4e-38a4-49c1β¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β c098fdd6-7d0a-446aβ¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β dc47ec95-fe66-472dβ¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β a3494e5e-4709-4e62β¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β a6c4fc0e-b03f-47acβ¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β 55065c25-546a-4f26β¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β 808bfe01-6489-44aeβ¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β ff6325aa-5235-4f94β¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β b488fe95-50d6-4244β¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
β f7d63eb5-a1e5-41c5β¦ β 2023-04-28 21:34:0β¦ β 2023-04-28 21:34:0β¦ β x.x.dev β buz-bootstrap β development β β¦ β io.silverton β buz.example.birdsBβ¦ β 1.0 β true β {'io.silverton/buzβ¦ β {'event_name': vieβ¦ β
Or write it straight back to S3 as partitioned Parquet:
D copy local_pile to 's3://bz-dev-events/partitioned' (format parquet, partition_by (isValid, vendor, namespace, version));
In Conclusion
Tools like Buz and DuckDB make it easier than ever to create and use well-structured data sitting in S3. Your pile of JSON (or Parquet) doesn't need to be a mess!
And with functionality like registering file paths or this it's only getting better.
