The peer-reviewed science behind Generative Engine Optimization.
Most GEO/AEO content on the internet is opinion, anecdote, or vendor speculation. This page is different. It's the canonical research foundation for best-aeo-skill, focused on the one peer-reviewed paper that established GEO as a research field — and how every scoring weight in our skill traces back to it.
Generative Engine Optimization (GEO) is a young field. The term was formally introduced in November 2023 with the arXiv preprint of "GEO: Generative Engine Optimization" by a Princeton-led team, and presented at KDD 2024 — the Association for Computing Machinery's premier data science conference. Before that paper, the entire literature on optimizing for AI-generated answers was practitioner blog posts and vendor whitepapers.
The Princeton paper changed that. It formalized GEO as a measurable discipline by:
- Building GEO-bench, a 10,000-query benchmark spanning 9 domains (legal, history, science, business, etc.)
- Defining Position-Adjusted Word Count (PAWC) and Subjective Impression as standardized citation-quality metrics
- Empirically testing 9 distinct optimization tactics against this benchmark
- Measuring per-tactic and per-domain effects with statistical rigor
Two years later (2026), the paper has accumulated hundreds of citations and remains the only widely-cited peer-reviewed work that quantifies which tactics actually move the needle on AI citation rates. Every other "GEO study" you'll see on the internet — from agencies, vendors, or commenters — either cites this paper or makes uncalibrated claims.
That's why best-aeo-skill operationalizes Princeton specifically. When the user asks "will this work?" — we can point to peer review, not anecdote.
The setup
The team built GEO-bench: 10,000 user queries across 9 domains. For each query, they generated a baseline response using a generative engine (a synthesized answer with cited sources). Then they applied each of 9 candidate optimization tactics to the source content and re-ran the query — measuring whether the modified source got more visibility in the new synthesized response.
"Visibility" was operationalized two ways:
- Position-Adjusted Word Count (PAWC) — how much of the synthesized answer is sourced from this page, weighted by where in the answer it appears (top-of-answer = higher weight)
- Subjective Impression — judges' rating of how prominently the source is featured in the response
Both metrics moved together for most tactics. The paper reports composite "visibility uplift" percentages, which we use throughout this site.
The headline result
The single most surprising finding from the paper:
Source emphasis — the simple act of bolding citations or framing them prominently — increased citation likelihood by +115%. This was the strongest effect of any tactic tested. Aggarwal et al., 2024 — Section 5.2
Two implications:
- You can more than double your AI citation rate with formatting alone, no new content needed
- Most existing content on the internet under-emphasizes its sources, which is why "everyone" is dissatisfied with their AI search performance
The paper also identified two negative findings: tactics that reduce visibility. Keyword-stuffing was the most prominent — confirming that the same tactic that hurts in modern Google also hurts in generative engines, possibly more aggressively.
| # | Tactic | Description | Visibility impact |
|---|---|---|---|
| 1 | Source emphasis | Bold or otherwise emphasize cited sources, references, attribution. | +115% |
| 2 | Expert quotes | Add 2-4 attributed quotations per ~1000 words. Use quotation marks with speaker name. | +41% |
| 3 | Statistics | Add numeric claims with sources. Target ~1 stat per 200 words. | +40% |
| 4 | Inline citations | Reference primary sources at the point of claim, not only at the bottom. | +30% |
| 5 | Authority signaling | Credential markup, named contributors, institutional affiliation. | +25% |
| 6 | Improved fluency | Natural language; reduced formulaic phrasing; varied sentence length. | +15% |
| 7 | Easy-to-read | Flesch-Kincaid grade 8-10. Higher (academic) loses general AI; lower (oversimplified) loses authority. | +12% |
| 8 | Topic relevance | One primary topic per page. Avoid multi-topic mash-up content. | +10% |
| 9 | Keyword stuffing | Stuffing the page with target keywords. | -22% |
A research paper is just text until someone implements it. We built best-aeo-skill as a one-to-one operationalization. Each Princeton tactic maps to a specific evidence collector in our scoring engine, and each collector maps to a numbered Rule in SKILL.md:
When you run bestaeo audit, each finding the skill returns is grounded in this map. If a finding says "Add expert quotes — projected +12 GEO score," you can trace it to quote_extractor → Rule 13 → Aggarwal et al., 2024, Section 5.2, Tactic 2. No invented metrics.
Beyond the Princeton paper, the field generates ongoing empirical data from industry sources. We track the most useful figures and update our scoring weights when reliable measurements appear:
Sources we cite
- SE Ranking — audited 300,000 domains for llms.txt presence (Q1 2026); reports 10.13% adoption.
- Superlines — quarterly tracking of Google AI Overview trigger rates; up from 13.14% in March 2025 to 25.11% in Q1 2026.
- Position.digital — analysis of AI referral traffic distribution across engines; ChatGPT dominates at 87%.
- HubSpot — case studies showing 6× AI-referred trial uplift within 7 weeks of consistent optimization.
- OpenAI usage reports — ChatGPT WAU 900M, monthly visits 5.72B (2026).
- SimilarWeb — zero-click search rate tracking; 43% in standard mode, 93% with AI Mode active.
None of these are peer-reviewed in the academic sense, but they are traceable empirical figures from organizations whose business depends on the data being accurate. We treat them as Tier-2 citations: useful, but explicitly marked as industry data, not peer-reviewed research.
The 4-vector composite
The Princeton tactics cluster into four orthogonal vectors. We weight them based on what's most actionable for the typical site:
- Technical Accessibility (20%) — robots.txt, AI bot allowance, JS rendering. If crawlers can't reach you, prose doesn't matter.
- Content Citability (35%) — statistic density, expert quotes, citations, freshness. The single biggest weight, because Princeton's strongest tactics live here.
- Structured Data (20%) — FAQPage, Article, Organization, HowTo, Speakable. Beyond Princeton, but empirically high-leverage for AI Overviews and Perplexity.
- Entity & Brand Signals (25%) — author credentials, Knowledge Graph linking, NAP consistency. Sustained citation requires entity presence, not just one-off content quality.
Weights adapt to your business profile (SaaS, e-commerce, publisher, local, agency, devtools, academic, default). A SaaS landing isn't audited like a news article; the Schema vector matters more for SaaS, Citability matters more for publishers.
Confidence labels
Every finding output by the skill carries one of three labels:
- Confirmed — directly observed by an evidence collector. Example:
parse_html.pyreturned no<title>tag. - Likely — inferred from ≥2 collectors that agree. Example:
schema_validatefound no FAQPage ANDquote_extractordetected Q&A patterns. - Hypothesis — LLM judgment or single weak signal. Always flagged for human review.
This is the anti-hallucination guarantee: no recommendation is ever presented without a label. If a tool tells you "fix this" without saying how confident it is — be skeptical.
Score bands
- 86-100 Excellent · cited frequently · maintain freshness
- 68-85 Good · regular citation, gaps to fix · apply top-3 fixes
- 36-67 Foundation · indexed but rarely cited · run full audit, fix everything
- 0-35 Critical · effectively invisible · fix Technical and Schema first, then content
Below 36, a low score is almost always a technical or schema problem, not a content problem. The audit's recommended action ordering reflects this.