The problem this solves
Once Leiden has clustered the market, every cluster needs a name — and "name" really means "which keyword best represents what this cluster is about."
Naive approaches fail:
- Highest-traffic keyword in the cluster picks generic head terms ("camera") that appear in every cluster
- Most-frequent keyword picks brand names ("canon") which aren't segments
- Manual naming doesn't scale and isn't reproducible
The keyword that should win is the one whose traffic is most concentrated in this cluster vs spread across many clusters.
What HHI measures
The Herfindahl-Hirschman Index is a concentration measure used by competition regulators to assess market structure. For a set of shares s_1, s_2, ..., s_n summing to 1:
HHI = Σ (s_i)² for i = 1 … n
- HHI = 1: complete concentration (one cluster owns 100% of the keyword's traffic)
- HHI = 1/n: even spread across all n clusters
- HHI closer to 1: the keyword is distinctive to a small number of clusters
How Theia uses it
For every keyword × cluster pair, we compute:
distinctiveness = HHI(keyword across clusters) × traffic in this cluster
The first term rewards concentration. The second term rewards relevance (you don't want a distinctive but tiny keyword winning).
The keyword that wins is distinctive AND material.
Examples from Canon EU
| Keyword | HHI | Best cluster | Outcome |
|---|---|---|---|
camera | 0.04 | many | Rejected — too spread |
canon | 0.08 | many | Rejected — brand term |
mirrorless camera | 0.62 | mirrorless cameras | Cluster name |
spiegellose kamera | 0.71 | mirrorless cameras (DE) | Merged after cross-language step |
wildlife photography camera | 0.89 | pro mirrorless | Sub-segment defining |
pg-540 ink | 0.94 | canon 540/541 ink | Cartridge family defining |
Why this matters strategically
Distinctive keywords reveal what consumers think a segment IS.
If "wildlife photography camera" is highly distinctive to the pro mirrorless cluster, that's not just a naming convenience — it tells you that wildlife is the defining use case the market associates with pro mirrorless. Marketing copy, retailer category trees, and content briefs should reflect that.
We persist distinctiveness scores at two granularities:
distinctive_keywords_segment— which keywords define a market segmentdistinctive_keywords_product— which keywords define a specific product
The second one is the basis for SEO and Amazon listing strategy: target the keywords your product is distinctively associated with, not the head terms where everyone fights for crumbs.