Digital strategy · SEO · GEO · LLMs · Entity optimization
Everything Google knows about your brand, how it learned it, and what happens when ChatGPT takes over
There are two versions of your brand. One lives in the real world. The other lives in a machine’s memory, encoded as a cluster of mathematical coordinates and factual associations.
The first version you can control with money, people, and time. The second one you earn by being consistent, credible, and structurally legible to systems that don’t read the same way humans do.
For most of the last 25 years, the second version didn’t matter much. You ranked pages. Users clicked. Life went on. But in 2026, the machine’s version of you is the one that gets recommended by ChatGPT, cited by Perplexity, included in Google’s AI Overviews, and selected by shopping agents on behalf of people who never type a query at all. If the machine doesn’t know who you are clearly and consistently, you’re not in the answer. This post is the full story: how we got here, what changed and when, what “connected entities” actually means technically, and what you need to do about it right now.
The short version first
- Search evolved from keyword counting (1990s) to link graphs (1998–2012) to semantic entity understanding (2012–2019) to LLM synthesis (2023–now). Each era made the previous era's dominant tactic obsolete.
- Google's Knowledge Graph (2012) was the pivot: 'things not strings.' Brands became entities with attributes, not keyword-matched strings. Entity recognition is now a prerequisite for visibility.
- LLMs don't retrieve via links or keywords. They retrieve via vector similarity and knowledge graph traversal. Your brand needs a clear, consistent entity definition across multiple authoritative sources.
- Connected entities: your brand recognized unambiguously across Wikidata, Wikipedia, Crunchbase, Schema.org markup, LinkedIn, and third-party review platforms, with consistent attributes and cross-linked sameAs identifiers.
- Unlinked mentions now matter as much as backlinks. LLMs train on co-occurrence patterns, not hyperlink graphs. Brand mentions in credible editorial contexts build your 'vector real estate.'
- The llms.txt protocol (proposed 2024, now standardized) lets you communicate directly with AI systems in their native format, bypassing HTML noise.
- Traditional SEO KPIs (rankings, clicks) are becoming incomplete. The new metrics: Entity Share of Voice, Citation Provenance, Sentiment Score, and Answer Position in AI responses.
- The window for competitive advantage is open now. The training data and entity associations being built into knowledge graphs are being established in 2026.
The history: how ranking evolved, era by era
The first web wasn’t really searchable. Yahoo started as a manually curated directory in a Stanford campus trailer in 1994. Jerry Yang and David Filo bookmarked interesting sites and organized them by category. It scaled badly, as manual curation always does.
When crawler-based engines like AltaVista, Infoseek, and Excite appeared, they automated discovery but their ranking algorithms were primitive. They counted words. A page about dog food ranked for “dog food” if it contained the phrase many times. That was it.
The manipulation was immediate and embarrassing. If your competitor ranked for a keyword by using it 100 times, you used it 200 times. Sometimes in white text on a white background so users couldn’t see it. The web in this era was full of garbage dressed up as content, and the search results reflected it.
The optimization focus was entirely on-page: meta tags, title tags, keyword density in the body. There was no concept of off-page authority. The gap between what a page claimed to be about and what it was actually useful for was enormous.
Larry Page and Sergey Brin built something at Stanford in 1996 that would change everything. They called it Backrub, and it treated the web as an academic citation network. The core insight was simple: if a lot of credible sites link to your page, your page is probably credible too. Links as votes. Authority as a network property.
This became PageRank, and Google’s dominance over the next decade was built on it. In 2000, Google launched the Google Toolbar, which let anyone see a page’s PageRank score directly in their browser. This single decision sent an unambiguous message to the entire web industry: external links are the currency. Within months, link building became the heartbeat of SEO strategy worldwide.
And the manipulation followed immediately. Link farms appeared. Blog comment spam exploded. Paid link schemes proliferated. Private blog networks were built at scale to funnel artificial authority to client sites. The arms race between Google and the SEO industry became the defining tension of the 2000s.
The 2010s were when Google started trying to understand the web instead of just indexing it.
In 2012, Google launched the Knowledge Graph with a memorable line: “things, not strings.” Instead of treating “Apple” as a five-character string to match, the Knowledge Graph mapped it as an entity with attributes: a technology company, founded 1976, headquartered in Cupertino, makes iPhones, traded on Nasdaq. A different entity called “apple” is a fruit, grows on trees, comes in varieties like Fuji and Granny Smith.
In 2013, Hummingbird added Natural Language Processing to the core ranking system. Google could now parse the intent behind a conversational query, not just match its individual words. Then came RankBrain in 2015, which brought machine learning directly into the ranking process. When Google saw a query it had never seen before, it could make intelligent inferences by mapping the query into a vector space and finding semantically similar queries it had seen.
In 2019, BERT applied bidirectional transformer models to search. BERT reads sentences in both directions simultaneously, understanding how each word modifies the meaning of every other word in context. Prepositions that used to be ignored (“for a child” versus “by a child”) now mattered. One in ten English searches was immediately affected at launch.
Directory era
Yahoo launches as a manually curated bookmark list. AltaVista, Infoseek, Lycos follow with crawler-based indexing.
Signal that won: keyword frequency and meta tags
PageRank launches
Google's PageRank treats hyperlinks as votes. Links become the dominant currency of search authority.
Signal that won: quantity and quality of inbound links
Google Toolbar
PageRank scores become publicly visible. Link building industrializes almost immediately. Link farms, comment spam, and paid link schemes emerge at scale.
The abuse: link farms, blog comment spam, private blog networks
Knowledge Graph + Penguin
Knowledge Graph launches ("things not strings"). Google Penguin penalizes unnatural link patterns. Entity thinking enters the mainstream.
Signal that won: entity recognition, topical authority, structured data
Hummingbird
Hummingbird adds NLP to core ranking. Google now interprets intent and conversational queries, not just keyword matches.
Signal that won: search intent, topic clusters, E-A-T
RankBrain
RankBrain introduces machine learning into ranking. Novel queries handled via vector similarity inference.
Signal that won: semantic relevance, behavioral engagement, freshness
BERT
BERT applies bidirectional transformers to search. Context, prepositions, and nuance finally matter. Affects 1 in 10 English queries at launch. E-E-A-T formalized.
Signal that won: author expertise, authoritative citations, entity trust
AI Overviews + LLM search
ChatGPT, Perplexity, Gemini, Google AI Overviews. No more ten blue links for informational queries. The model synthesizes and cites.
Signal that wins: entity clarity, cross-platform consistency, semantic density, unlinked mentions
What connected entities actually means
A connected entity is a brand, person, or organization that a machine can recognize unambiguously across multiple data sources, and link to a consistent, coherent set of facts.
The technical term for the process is entity resolution: the ability of an AI system to deduce that “IBM,” “International Business Machines,” and “Big Blue” are all the same real-world organization, regardless of how they appear in different texts. Strong entity resolution requires consistency across sources. If your brand name is spelled differently on Wikidata, Crunchbase, and your own website, the machine hesitates. Hesitation means reduced confidence. Reduced confidence means you get cited less.
The architecture behind this is a Semantic Knowledge Graph. Think of it as a relational database, but instead of isolated tables, everything is a node connected to other nodes by labeled edges. Your company is a node. “Founded in” is an edge. “2015” is another node. “CEO” is an edge. “Your name” is a node. Every fact about your brand is a triple: subject, predicate, object.
The critical Schema.org properties for entity connection are these three.
The @id property assigns a unique, stable URI identifier to your entity across all pages of your site. This tells every AI crawler that different mentions across different pages all resolve to exactly the same entity. Without it, your site looks like a collection of loosely related pages rather than a coherent brand node.
The sameAs property links your on-site entity declaration to external authoritative sources: your Wikidata entry, your Crunchbase profile, your verified LinkedIn company page. Think of it as cryptographic proof of identity. It lets an LLM merge your self-declared data with its pre-existing training knowledge of your brand. The more sameAs links you provide to trusted sources, the higher the confidence of the match. The mentions and about properties declare explicitly which entities are discussed within a given piece of content. Instead of making an AI crawler guess what your article is about by reading every word, you tell it directly. This removes ambiguity and dramatically improves extraction accuracy during RAG queries.
If an LLM cannot accurately map a brand to a recognized concept within its ontology, the brand effectively does not exist in generative search. It gets hallucinated around, ignored, or replaced by a competitor who showed up more clearly.
How LLMs actually retrieve information: RAG vs GraphRAG
Understanding this is not optional anymore. How an AI retrieves information determines whether your content gets cited or ignored.
LLMs have a fixed knowledge cutoff. They know everything they were trained on, and nothing that happened after. To answer questions about recent or specific information, they use a process called Retrieval-Augmented Generation (RAG). Before generating a response, the system retrieves relevant documents from an external database, injects them into the model’s context window, and generates the answer based on that retrieved information. This is how Perplexity works. This is how Google AI Overviews work. This is how enterprise AI tools work.
Traditional RAG uses vector similarity: the user’s query is converted to a mathematical vector, and the system retrieves the text chunks whose vectors are closest in high-dimensional space. It’s fast and effective for simple factual questions, but it struggles with complex questions that require connecting facts across multiple documents.
GraphRAG is the evolution. Instead of retrieving isolated text chunks, it retrieves a connected sub-graph of context from a knowledge graph. If someone asks “which suppliers will affect Company X’s Q4 delivery schedule?”, traditional RAG might fail if those terms don’t appear together in a single document. GraphRAG traverses the knowledge graph: Company X, via the “suppliedBy” edge, to Raw Material Y, via the “affects” edge, to Delivery Schedule Z. It follows the logical connections rather than just the semantic proximity.
Traditional vector RAG
GraphRAG (2025–26 standard)
What does this mean practically? When ChatGPT or Perplexity answers “what’s the best CRM for mid-sized B2B companies?”, it’s running something close to GraphRAG against a combination of its training data and real-time retrieved content. The brands with clear entity definitions, consistent attributes, and high co-occurrence with relevant query terms in trusted contexts get retrieved and cited. The brands that are structurally ambiguous get skipped.
Your structured data, your sameAs links, your consistent entity definition across the web: these are not just SEO housekeeping. They are directly feeding the data structures that AI retrieval systems query in real time.
Why unlinked mentions now matter more than links
This is the one that surprises people the most, and it makes complete sense once you understand how LLMs learn.
PageRank relied on hyperlinks because hyperlinks were explicit, machine-readable signals of endorsement. One page linking to another was a declaration that could be counted and weighted. The entire link economy of the 2000s and 2010s was built on that single architectural fact.
LLMs don’t read the web by following links in real time. They train on massive static corpuses of text: Common Crawl, digitized books, Wikipedia, Reddit, curated web datasets. During this training, the model doesn’t care whether your brand name has a hyperlink attached to it or not. What it cares about is co-occurrence and context. How often does your brand name appear near the concepts you want to be associated with? What kind of language surrounds those mentions? Is it positive, negative, or neutral?
A study of 75,000 brands found that branded web mentions correlate most strongly with brand mentions in AI Overviews. Not backlink profiles. Not Domain Rating. Mentions, in context, across the open web. An LLM builds confidence from consensus. If ten independent sources discuss your brand favorably in the context of a specific product category, the model binds your brand vector tightly to that category.
This creates the “dark funnel” problem. A user asks ChatGPT for the best enterprise CRM, reads a comprehensive synthesized answer, and visits the recommended brand directly by typing the URL. Traditional analytics sees a direct traffic visit with no attribution. The AI’s determinative role in the conversion is completely invisible. The influence is real. The measurement is broken.
The zero-click reality and what it actually means for your business
Rand Fishkin at SparkToro has been saying this for years: the real disruption is not AI. It’s the disappearance of clicks. When an AI answers a question completely within the interface, there’s no reason to click anywhere. The traffic that used to flow to publisher sites, brand blogs, and informational content pages is being absorbed by the AI layer itself.
LinkedIn’s B2B Organic Growth team published internal data in early 2026 showing up to 60% traffic declines for non-brand informational content, even when those pages maintained stable traditional search rankings. The pages were still ranking. Nobody was clicking on them. The AI Overviews were answering the question before anyone got there.
The economics of this are genuinely unsettling for businesses that built themselves on informational content traffic. The old model was: write good content, rank for informational queries, convert some percentage of that traffic to leads or sales. That funnel is collapsing at the top. The information is now delivered by the AI. The brand recognition happens inside the AI’s response. And if your brand isn’t in that response, you never got into the funnel at all.
What remains valuable in this world is brand mention frequency in AI responses and the quality of that mention’s context. Are you being cited as a credible example? Are you the first brand mentioned or the third? Are you associated with positive attributes or hedged with caveats? These questions are replacing rank tracking as the primary competitive metrics.
Vector real estate: the spatial metaphor that makes this click
LLMs represent meaning as mathematics. Every concept, entity, and phrase is translated into a vector: a dense array of hundreds or thousands of floating-point numbers. These vectors are plotted in a high-dimensional space where mathematical proximity reflects semantic similarity. The vector for “puppy” sits near “dog” and “canine,” far from “skyscraper.”
Your brand has a vector position in that space. Brand authority in 2026 is determined by where that position sits relative to the concepts you care about. If your brand’s vector is densely clustered near “reliable B2B software,” “enterprise CRM,” and “customer data platform,” then when a user asks about any of those topics, the model retrieves and cites your brand. You don’t even need to be mentioned by name in the query.
The phrase circulating in technical marketing circles is “vector real estate.” You earn it through semantic density: the frequency with which your brand name co-occurs with your target topic vocabulary across credible, diverse sources. The more consistent and widespread that co-occurrence, the tighter your vector clusters around those concepts, and the more reliably you get retrieved.
The llms.txt protocol: talking directly to the machine
Proposed by Jeremy Howard of fast.ai in 2024 and now standardized across the industry, llms.txt is the brand’s direct communication channel to AI systems. Conceptually it’s like robots.txt, but instead of telling crawlers what to avoid, it tells AI models exactly what they need to know about your organization.
The file lives at the root of your domain (yourdomain.com/llms.txt), is written in plain Markdown, and is actively ingested by models including Claude, Perplexity, and various ChatGPT implementations. Real-world results from 2026 implementations show accuracy improvements of 30 to 70% in AI-generated brand summaries when the file is present and well-structured.
The reason it works is architectural. HTML is designed for human browsers: visual layout, navigation, styling. LLMs parse HTML by stripping the markup and reading the text, which introduces noise, inconsistency, and ambiguity. A plain Markdown file is the native format of LLMs. It bypasses all of that and delivers pure, structured entity data directly.
| Section | What to include | Why it matters |
|---|---|---|
| Company description | One sentence. No adjectives like 'innovative' or 'leading.' Just verifiable facts: what you do, for whom, and where. | LLMs process text literally. Vague marketing language creates semantic ambiguity. Ambiguity reduces confidence. Reduced confidence means fewer citations. |
| Products and services | Exact product names, core functionalities, pricing if applicable. Use the same names you use everywhere else. | Structured product data is what shopping agents use for comparison and selection. Inconsistent names break entity resolution. |
| Markets served | Specific geographies (city, country), specific industry verticals. Not 'global' or 'enterprises.' Be concrete. | Grounds your entity in spatial logic. Improves accuracy for localized RAG queries. 'Serving B2B SaaS companies in Spain' is more useful than 'serving businesses worldwide.' |
| Target search intents | The exact queries you want to be retrieved for. Verbatim, in natural language. 'GEO agency Madrid' not 'digital marketing.' | Binds your vector directly to the specific query concepts you care about. Prevents the model from miscategorizing you. |
| External references | Verified URLs: Wikidata entry, Crunchbase profile, LinkedIn company page, G2 listing, Clutch profile. | Provides the model with cross-referenceable citations that accelerate entity resolution and reinforce the trust layer. |
The content architecture that AI prefers
Traditional SEO produced fragmented, keyword-targeted pages designed to rank for isolated queries. The result was often dozens of thin pages each trying to own a single keyword, loosely connected or not connected at all.
LLMs evaluate topical authority by looking at the comprehensiveness and structural coherence of a domain’s total content, not individual page performance. This requires a completely different architecture.
The hub-and-spoke model is the framework that works. A pillar page covers a broad topic exhaustively. A set of 10 to 15 spoke pages each cover a specific sub-facet of that topic in depth. Every spoke links back to the pillar with descriptive anchor text. Every pillar links out to all spokes. Internal links are not about passing “link juice” anymore. They are about defining explicit semantic relationships between concepts for AI crawlers. Generic anchor text like “click here” or “read more” is a waste.
The other architectural shift is “Bottom Line Up Front” writing. AI crawlers prefer content where the definitive answer appears in the first 40 to 60 words after a heading. Those compact, self-contained answer blocks are what get extracted as citation blocks in RAG pipelines. FAQPage schema amplifies this dramatically. Adding FAQPage schema to your key pages is declaring to every AI crawler: here are the questions this page answers, and here are the precise answers.
The signal table: what used to work, what works now
| Tactic | Era | Status | The actual reason |
|---|---|---|---|
| Keyword stuffing | 1990s | dead | Penalized since early 2000s. Signals spam. Don't. |
| Bulk link building | 2000s | dead | Penguin killed unnatural patterns. Quality editorial links from topically relevant sources still matter significantly. |
| Exact-match anchor text | 2000s | dead | Manipulative signal. Natural anchor text diversity is now the expected pattern. |
| Private blog networks | 2000s–10s | dead | Google detects and penalizes. LLMs don't credit them either — no editorial credibility. |
| Long-form topical content | 2010s–now | works | Depth signals expertise. LLMs favor comprehensive sources for citations over shallow pages. |
| Schema markup (basic) | 2010s–now | works | Still important but basic implementation is the floor, not the ceiling. |
| Advanced entity schema (@id, sameAs) | 2020s–now | critical | The primary mechanism for entity resolution. Without this, machines cannot unambiguously identify you. |
| E-E-A-T signals | 2019–now | critical | Author credentials, bylines, editorial citations, real-world reputation. Who you are matters as much as what you publish. |
| Wikipedia / Wikidata entry | 2020s–now | works | LLMs are trained heavily on Wikipedia and Wikidata. Your entry is part of the training data that shapes what AI systems know about you. |
| Unlinked brand mentions | 2024–now | growing | LLMs don't need hyperlinks. Co-occurrence of your brand with target concepts in credible editorial contexts directly builds your vector real estate. |
| llms.txt file | 2024–now | emerging | Claude, Perplexity, and others actively ingest it. 30 to 70% improvement in AI brand summary accuracy in early implementations. Most brands haven't done it yet. |
| FAQPage schema | 2020s–now | works | Maps directly to RAG extraction logic. AI systems prefer Q&A-structured content for citation blocks. |
| Digital PR and podcast appearances | 2024–now | growing | Editorial mentions in credible contexts are the votes of the LLM era. They build semantic density in training data. |
| Single-page SEO with no entity layer | Now | dead | Individual pages without entity context mean increasingly little to answer engines. The unit of optimization has shifted from page to brand. |
New KPIs for the AI era
Traditional SEO metrics are not disappearing overnight, but they are becoming incomplete. Keyword ranking positions measure your visibility in a paradigm where users click through results. That paradigm is shrinking. The metrics that matter alongside rankings are these.
Entity Share of Voice
How often your brand is mentioned in AI responses for your target queries, relative to competitors. The clearest competitive benchmark of algorithmic preference available.
Citation Provenance
Not just whether you were mentioned, but whether your own content was the cited source. Being mentioned is good. Being cited as the source is better.
Sentiment Score
The contextual tone of AI statements about your brand. Positive, hedged, or negative. Heavily influenced by your unlinked mention profile across the training data.
Answer Position
Where in a synthesized response your brand appears. First mention captures disproportionate trust. Third or fourth mention after competitors signals secondary status.
Tracking these requires dedicated tools. The market has moved fast. Platforms like LLM Pulse, Peec AI, Profound, and Kime autonomously query multiple LLMs repeatedly to extract quantitative performance data. They track your brand across ChatGPT, Gemini, Perplexity, and Claude simultaneously, giving you the visibility data that web analytics platforms completely miss.
| Platform | Models tracked | Primary focus | Best for |
|---|---|---|---|
| LLM Pulse | 5 to 10+ | Share of Voice and Sentiment | Enterprise brands and large agencies |
| Peec AI | 10 | Real-time mentions and citations | E-commerce startups, practical tracking |
| Profound | 5+ | Strategy, compliance, citation reporting | Complex data environments, enterprise risk |
| Kime | Multiple | AI visibility across LLMs | Broad multi-model AI presence tracking |
| LLMrefs | Multiple | Brand mentions vs. competitor benchmarking | Competitive intelligence and GEO strategy |
How to plan and build with connected entities in mind
The brands that will be invisible in two years are the ones that think this is something they can deal with later. It isn’t. The training data that shapes LLM behavior is being collected and weighted now. The entity associations being built into knowledge graphs are being built now. The window for competitive advantage is open, and it will close as competitors catch up. Here is how to approach this systematically.
1. Define your entity sentence
Write one sentence about what you do with zero adjectives and zero marketing language. Factual, verifiable, specific. This sentence becomes the foundation for your Schema.org Organization markup, your Wikidata entry, and your llms.txt description.
2. Build your sameAs network
Get on Wikidata. Claim and complete your Crunchbase profile. Ensure your LinkedIn company page is complete and accurate. Add a verified Google Business Profile. Then link all of them from your website's sameAs schema property.
3. Audit consistency everywhere
Check your brand name spelling, founding year, address, product names, and industry category across every platform. Inconsistency signals uncertainty to entity resolution systems. Uncertainty means fewer citations.
4. Deploy advanced schema markup
Add Organization schema with @id, sameAs, foundingDate, areaServed, and knowsAbout to your homepage. Add FAQPage schema to key informational pages. Add Article schema with author, mentions, and about to all content.
5. Create an llms.txt file
Write it in plain Markdown. One factual sentence for company description. Exact product names. Specific markets. The exact queries you want to be retrieved for. Links to Wikidata, Crunchbase, LinkedIn, and G2. Put it at yourdomain.com/llms.txt.
6. Build hub-and-spoke content
Pick your three most important topic areas. For each, write one exhaustive pillar page and 10 to 15 specific spoke pages covering sub-facets in depth. Connect them with descriptive anchor text. This signals topical authority to LLMs.
7. Earn editorial mentions actively
Appear on podcasts in your industry. Contribute to expert roundups. Get reviewed on G2 and Clutch. Write guest posts for publications your audience reads. These unlinked mentions in credible contexts build your semantic density in training data.
8. Monitor and respond to your AI presence
Ask ChatGPT, Perplexity, Gemini, and Claude about your brand and your category monthly. Screenshot the responses. Compare against competitors. Treat this as a core brand monitoring discipline.
The checklist: print it, use it, come back to it quarterly
Entity foundation
LLM-specific technical
Content architecture
Off-site mentions and PR
Monitoring and measurement
The pattern across all of this is the same one that has repeated in every era of search. The systems get smarter. The shortcuts stop working. What survives is genuine, consistent, well-structured credibility.
The brands that invested in quality content and real links when Penguin arrived in 2012 outperformed for a decade. The brands that build real entity clarity, real external references, and real semantic density now will outperform in the era where AI makes the first recommendation before a user ever types a query. Understanding history means you don’t have to wait to see how bad things get before you adapt.