What is Math for connections, LLMs for extraction?

Theia's core engineering doctrine. Use LLMs where language understanding is genuinely required — extraction, sentiment, multilingual harmonisation. Use deterministic math — cosine, Leiden, HHI, TF-IDF — for every graph connection. This is what makes the intelligence layer reproducible at scale.

Math for connections, LLMs for extraction

The doctrine, in one line

LLMs are essential for the parts of market intelligence that require language understanding. They are unsafe for the parts that require graph structure.

Most "AI-driven" market research tools use LLMs for everything. The result is intelligence that drifts every run, retrieval that hallucinates, and a graph that has to be rebuilt every quarter.

Theia's architecture splits the work. LLMs do extraction. Math does connections. The graph is stable. The intelligence layer compounds rather than decays.

Where LLMs are the right tool

Four jobs that genuinely require language understanding:

Job	Why LLM	Theia implementation
Feature / benefit / use-case extraction from messy text	Requires semantic understanding of "battery lasts ages" → BATTERY_LIFE = positive	Claude Haiku on review/transcript text
Sentiment scoring in source language	Requires native-language nuance	Claude Haiku, per-language calibration
Cross-language harmonisation (label A → canonical property)	Requires multilingual semantic mapping	Claude Haiku batched harmonisation
Strategy synthesis at L1-L4	Requires reasoning about competitive dynamics	Claude Sonnet, structured output

These jobs share a property: the input is unstructured language, the output benefits from semantic understanding, and there's no reasonable mathematical alternative.

Where math is the right tool

Four jobs that look like LLM jobs at first glance — and shouldn't be:

Job	Why math, not LLM	Theia implementation
Product → product similarity	Reproducible, set-independent edges required	Cosine similarity on raw mention vectors
Keyword cluster discovery	Stable communities, no resolution-limit failure	Leiden community detection with Surprise optimisation
Keyword → cluster naming	Distinctive, non-generic, interpretable	HHI × traffic scoring
Selection (which URLs to scrape, which keywords to keep)	Bounded cost, interpretable cutoffs	Cumulative CTR threshold (e.g. 65%)

Each of these can be implemented with an LLM. Each is dramatically worse when you do.

The five problems with all-LLM connections

01 — Drift. Re-running an "ask the LLM to cluster these products" call produces different clusters each time. The graph is unstable. Two analyses on the same data give two answers.

02 — Cost compounds linearly. Math operations on a 10K-product corpus cost roughly the same as 100. LLM operations cost 100×. The all-LLM approach can't scale to production volumes without bankruptcy or aggressive sampling.

03 — Set-dependence. Ask an LLM "which products are similar to Canon EOS R6 II?" and the answer depends on which other products are in context. Add Sony A7C II to the prompt; the R6-R8 similarity score changes. Math edges don't have this property — they're stable regardless of which other products are in the analysis.

04 — Super-nodes form. LLMs tend to over-cluster generic concepts. Without distinctiveness scoring, "autofocus" becomes the cluster name for half the camera market — and stops being useful for retrieval. Math-based HHI scoring prevents this.

05 — No audit trail. "The LLM said these products are similar" is not a defensible answer in front of a board, a regulator, or a sceptical analyst. "Cosine similarity 0.74 on raw mention vectors, computed on this run_id" is. The math approach is auditable; the LLM approach isn't.

The 80/20 split

Across the Theia production pipeline:

~80% of LLM cost is upstream extraction (review → snippet, transcript → features)
~15% is downstream strategy synthesis at L1-L4
~5% is harmonisation maintenance

Connection-building operations — clustering, similarity, distinctiveness — are 0% of LLM cost. They run as math on the pre-extracted intelligence layer. Re-clustering the Canon EU market takes ~10 minutes from parquet and costs roughly nothing.

A competitor running connection-building through LLMs would spend $5K-50K per re-cluster, depending on volume. They don't re-cluster as often. Their graphs decay.

Where the discourse is wrong

The popular AI market research discourse frames the question as "should we use AI for market research?" — yes/no. The right question is: where in the pipeline is the LLM the right tool, and where is it the wrong tool?

The doctrine: LLM where the input is language and the output benefits from semantic understanding. Math everywhere else.

This isn't a Theia-specific opinion. It's the operational discipline that separates production-grade AI research infrastructure from demos.

Strategic implication

For any brand evaluating an AI research vendor, ask: "show me the part of your pipeline that doesn't use an LLM."

If the answer is "everything uses an LLM" — that vendor is shipping demos.
If the answer is "extraction and synthesis use LLMs; clustering, similarity, selection, and distinctiveness use math" — that vendor is shipping infrastructure.

The brands that compound intelligence value over 18-24 months are buying from the second group. The brands that have to rebuild every quarter are buying from the first.

Related terms