LLMs have a well earned reputation for hallucination. I tested GPT5.1 on 500 pages of investment bank research (14 separate 2026 outlook PDFs from Goldman, UBS, Barclays, and others). The result was this site comparing their analysis, where each claim links to the PDF page it came from. Spot-checking 100 citations found very low single-digit error rates for this project.
Below I explain how I built it, and why the citation accuracy matters.
Output: 2026 Macro Analysis
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Comparing rate-path views meant jumping between UBS's asset-allocation section (p.42), Goldman's rates chapter (p.23), and Barclays's policy discussion (p.19). I kept losing page refs, couldn't remember who said what. The reports use different structures, so there's no common index.
I wanted side-by-side views by topic with traceable citations rather than a blended summary.
I defined a 6-pass pipeline: summarize each report, define comparison axes, extract evidence with page refs, build tables and syntheses, render HTML, link citations to PDFs.
Everything was done in python in Cursor with Claude Code / Codex presiding over workflows.
Each report was summarized (by GPT-5.1) into a fixed template: Headline Summary, Macro & Policy Path, Rates & Fixed Income, Equities, FX & Commodities, Scenarios & Risks, Time Horizon & Confidence, Drivers & Evidence, Notable Quotes, Other Emergent Themes. This normalizes structure so later passes could compare.
I had GPT-5.1 read all Pass 1 summaries and proposed candidate vectors—comparison axes that recur across reports. This went through several iterations with my input (v1→v2→v3→final), merging and pruning based on coverage. "Housing" got folded into Growth/Inflation after only 3 reports mentioned it substantively.
Final set: Policy Path, Growth/Inflation/Fiscal, Equities Valuation, FX & USD, Sovereign/Fiscal/Supply, Credit & Private Credit, China & Asia, Energy/Commodities/Power, AI CapEx & Constraints, Geopolitics/Trade/Reshoring.
I could have chosen a longer list, but felt that this balanced breadth with depth.
Pass 3 goes back to full report text and pulls detailed evidence for each report × vector. Required fields: coverage assessment, stance, core positions with page refs, key numbers with units, horizon, confidence cues, risks and triggers, notable quotes.
Page refs are mandatory in the form [p.X]. If uncertain, the extractor uses [p.?] to flag gaps, though this was suprisingly rare.
Scale: 140 extractions, ~32k input tokens each, ~34 seconds each, ~$16 total (GPT-5.1). Cheaper than I expected.
Step 1: The LLM samples several extractions and proposes 5-7 comparison headings per vector. Example from Policy Path: "DM policy rate path to 2026 — Expected cuts/hikes, timing, and end-2026 levels."
Step 2: For each report × heading, extract what that report says. Two-stage: free-form first, then normalize into text + coverage score (0-100). Scores under 50 display as "n.a.". I chose a cut-off because some of the reports were very thin on details, and it did not make sense to clutter the tables.
Step 3: All extractions (from Pass 1.) for a vector go into one prompt to write synthesis. Output has standard sections: Overall Summary, Where Analysts Agree, Where Analysts Disagree, Core Narrative by Sub-Areas, Key Conditional Risks, Bottom Line.
Each vector becomes an HTML page with two tabs: Synthesis and Comparison. Comparison tables use accordions—each heading expands to show all firms' views. I had cursor create a script to transform the .md files into a simple html template.
Page refs like [p.42] get converted to clickable links that open the PDF at that page. Table page refs are linked automatically; synthesis page refs were linked with Cursor Composer.
The citations flow is simple: PDFs are converted to text with explicit page markers like --- Page 42 ---, and the extraction prompt requires a [p.X] reference on every factual claim. After extraction I spot-checked about 100 citations across 10 reports and found the occasional [?] or mis-aligned page.
Link: 2026 Macro Analysis
Homepage lists vectors and links to PDFs. Each vector page has synthesis (agreements, disagreements, risks) and comparison (accordion table by dimension). Claims link to their source PDF page. Click [p.42], land on page 42.
Charts and visuals not extracted. The system only sees prose. If a report makes a key point via a line chart or heatmap, that information is missing. Fortunately, most of the reports highlighted key stats/info from charts in the accompanying text which the system picked up.
Manual citation linking in syntheses. Cursor Composer-1 handled it with some minor cajoling, but a script checking that every [p.X] appears in source text would have been faster.
Coverage scoring could have hidden thin mentions. Under-50 scores show "n.a." which could mask minor but real notes. I spot checked a sample of the n.a. results and found no issues.
No structured numbers. Numeric claims are text, not floats. Can't auto-generate charts or calculate consensus ranges. Storing "Fed terminal rate: 3.25%" as typed data would enable range charts and outlier detection.
No cross-vector views. Each vector is siloed. Which firms are consistently outliers? Who's bullish on growth but bearish on equities? A firm-profiles page or consistency matrix would surface this.
I expected there to be more hallucinations, but was impressed that GPT-5.1 consistently cited its sources accurately. Mandatory format, page markers in source text, and spot-check verification produced very low single-digit hallucination rates across 140 extractions and 10 synthesis documents.
Trust in AI generated outputs has remained low for good reason, however I expect in 2026+ trust in AI generated outputs to begin increasing. This has implications for research synthesis, literature reviews, due diligence, competitive analysis... Any task requiring traceable claims can now start to use LLMs for the heavy lifting, as long as the system is designed to enforce citations and enable verification.