Glossary·methodology

Math for connections, LLMs for extraction

Theia's core engineering doctrine. Use LLMs where language understanding is genuinely required — extraction, sentiment, multilingual harmonisation. Use deterministic math — cosine, Leiden, HHI, TF-IDF — for every graph connection. This is what makes the intelligence layer reproducible at scale.

The doctrine, in one line

LLMs are essential for the parts of market intelligence that require language understanding. They are unsafe for the parts that require graph structure.

Most "AI-driven" market research tools use LLMs for everything. The result is intelligence that drifts every run, retrieval that hallucinates, and a graph that has to be rebuilt every quarter.

Theia's architecture splits the work. LLMs do extraction. Math does connections. The graph is stable. The intelligence layer compounds rather than decays.

Where LLMs are the right tool

Four jobs that genuinely require language understanding:

JobWhy LLMTheia implementation
Feature / benefit / use-case extraction from messy textRequires semantic understanding of "battery lasts ages" → BATTERY_LIFE = positiveClaude Haiku on review/transcript text
Sentiment scoring in source languageRequires native-language nuanceClaude Haiku, per-language calibration
Cross-language harmonisation (label A → canonical property)Requires multilingual semantic mappingClaude Haiku batched harmonisation
Strategy synthesis at L1-L4Requires reasoning about competitive dynamicsClaude Sonnet, structured output

These jobs share a property: the input is unstructured language, the output benefits from semantic understanding, and there's no reasonable mathematical alternative.

Where math is the right tool

Four jobs that look like LLM jobs at first glance — and shouldn't be:

JobWhy math, not LLMTheia implementation
Product → product similarityReproducible, set-independent edges requiredCosine similarity on raw mention vectors
Keyword cluster discoveryStable communities, no resolution-limit failureLeiden community detection with Surprise optimisation
Keyword → cluster namingDistinctive, non-generic, interpretableHHI × traffic scoring
Selection (which URLs to scrape, which keywords to keep)Bounded cost, interpretable cutoffsCumulative CTR threshold (e.g. 65%)

Each of these can be implemented with an LLM. Each is dramatically worse when you do.

The five problems with all-LLM connections

01 — Drift. Re-running an "ask the LLM to cluster these products" call produces different clusters each time. The graph is unstable. Two analyses on the same data give two answers.

02 — Cost compounds linearly. Math operations on a 10K-product corpus cost roughly the same as 100. LLM operations cost 100×. The all-LLM approach can't scale to production volumes without bankruptcy or aggressive sampling.

03 — Set-dependence. Ask an LLM "which products are similar to Canon EOS R6 II?" and the answer depends on which other products are in context. Add Sony A7C II to the prompt; the R6-R8 similarity score changes. Math edges don't have this property — they're stable regardless of which other products are in the analysis.

04 — Super-nodes form. LLMs tend to over-cluster generic concepts. Without distinctiveness scoring, "autofocus" becomes the cluster name for half the camera market — and stops being useful for retrieval. Math-based HHI scoring prevents this.

05 — No audit trail. "The LLM said these products are similar" is not a defensible answer in front of a board, a regulator, or a sceptical analyst. "Cosine similarity 0.74 on raw mention vectors, computed on this run_id" is. The math approach is auditable; the LLM approach isn't.

The 80/20 split

Across the Theia production pipeline:

  • ~80% of LLM cost is upstream extraction (review → snippet, transcript → features)
  • ~15% is downstream strategy synthesis at L1-L4
  • ~5% is harmonisation maintenance

Connection-building operations — clustering, similarity, distinctiveness — are 0% of LLM cost. They run as math on the pre-extracted intelligence layer. Re-clustering the Canon EU market takes ~10 minutes from parquet and costs roughly nothing.

A competitor running connection-building through LLMs would spend $5K-50K per re-cluster, depending on volume. They don't re-cluster as often. Their graphs decay.

Where the discourse is wrong

The popular AI market research discourse frames the question as "should we use AI for market research?" — yes/no. The right question is: where in the pipeline is the LLM the right tool, and where is it the wrong tool?

The doctrine: LLM where the input is language and the output benefits from semantic understanding. Math everywhere else.

This isn't a Theia-specific opinion. It's the operational discipline that separates production-grade AI research infrastructure from demos.

Strategic implication

For any brand evaluating an AI research vendor, ask: "show me the part of your pipeline that doesn't use an LLM."

  • If the answer is "everything uses an LLM" — that vendor is shipping demos.
  • If the answer is "extraction and synthesis use LLMs; clustering, similarity, selection, and distinctiveness use math" — that vendor is shipping infrastructure.

The brands that compound intelligence value over 18-24 months are buying from the second group. The brands that have to rebuild every quarter are buying from the first.