What it is
rag_snippets is a single BigQuery table that holds every piece of extracted intelligence. Reviews, YouTube transcripts, web articles, Reddit posts, BazaarVoice, AI Overviews — all converge here, atomically.
Each row is one snippet: one feature mention, one benefit claim, one use-case observation, one comparison, or one product sentiment. With source ID, sentiment score, harmonised canonical property, and timestamp.
This is the single primary table for analytical queries. Strategy agents read from it. The chatbot reads from pre-computed tables that read from it. Almost every interesting question about a market resolves to a query against rag_snippets joined with label_harmonisation.
Schema (simplified)
| Column | Type | Notes |
|---|---|---|
snippet_id | STRING | Unique |
source_type | STRING | review, youtube, web_article, reddit, bazaarvoice, ai_overview |
source_id | STRING | Joins back to review_id, video_id, article_id |
product_name | STRING | Canonical (post-resolver) |
snippet_type | STRING | feature, benefit, use_case, comparison, product_sentiment |
category_label | STRING | Raw extracted label, source-language |
property_value | STRING | e.g. "4K 60fps", "2 hours" |
sentiment_score | FLOAT | -1.0 to 1.0 |
sentiment_label | STRING | positive / neutral / negative |
snippet_text | STRING | Exact quote — never modified |
country | STRING | lowercase: uk, de, fr, it, us |
language | STRING | source language |
created_at | TIMESTAMP | Source publication date, not pipeline time |
Why this design works
Three properties matter:
01 — Atomic. One row = one observation. No nested arrays of properties per source. This makes aggregation trivial and joins predictable.
02 — Source-traceable.
Every snippet carries source_type + source_id. Click through to the review, the video, the article. Strategy agents include source citations because the data is structured to support them.
03 — Harmonised on read.
category_label stays in the source language. The label_harmonisation join provides the canonical property at query time. This preserves nuance while enabling cross-language analysis.
How it gets populated
The intelligence pipeline writes to rag_snippets:
- Reviews: Oxylabs scrape → Claude extraction → snippet rows
- YouTube: Oxylabs transcript → Claude extraction → snippet rows
- Web articles: SERP discovery → Oxylabs scrape → Claude extraction → snippet rows
- Reddit: Reddit API → Claude extraction → snippet rows
- AI Overviews: DataForSEO LLM Mentions → parsed → snippet rows
Every write goes through product_resolver.resolve_to_canonical() first, so product_name is always a canonical model name.
Scale
Production deployments:
- Canon Consumer (UK/DE/FR/IT): 737K snippets
- Canon B2B (US/DE/JP/KR): 90K+ snippets, growing
- Bose Germany: 90K snippets from a 10-day pilot
- Principality (UK FS): 90,107 snippets in 6 weeks
The largest tables are 4M+ snippets. Query latency on a single-product look-up is sub-second; full-category aggregation is single-digit seconds.
What it isn't
It's not a vector store. There's no embedding column on rag_snippets. Retrieval uses structured queries against the canonical property (fixed entity architecture), not cosine similarity to a question.
Vector search has its place (chatbot fallback for ambiguous queries) but the primary retrieval surface is structured. This is faster, more deterministic, and more explainable than vector RAG for the kinds of questions market research actually asks.