Model Analysis · March 2026

GPT-5.4 with 1M context:
what actually works
and what falls apart

⏱ 18 min read
📊 Intermediate – Advanced
🔬 OpenAI GPT-5.4

Large context windows sound impressive. Here's exactly what tasks justify using them, what falls apart at scale, and how to prompt for coherent output when you're feeding GPT-5.4 a novel's worth of data.

Context Window Lost-in-the-Middle Use Case Analysis Prompting at Scale Cost Breakdown GPT-5.4 vs Claude
1M
Token Context
~750,000 words — a War and Peace and a half
500K
Reliable Ceiling
Attention degrades measurably above this threshold
4
Prompting Patterns
That measurably reduce coherence loss at scale
~$3
Per 1M-Token Call
Input cost. Run the numbers before scaling.
On This Page

What's in this guide

Section 01

Why Context Size Actually Matters

Before GPT-5.4, working with large documents meant one of three workarounds: chunking (splitting the document and summarising each piece), retrieval-augmented generation (embedding + semantic search), or truncation and hoping the critical context made it into the window. Each workaround introduces error. Chunking loses cross-chunk references. RAG misses context that doesn't match the retrieval query. Truncation is just truncation.

A 1M token context window changes the calculus on one specific question: does the answer to this task depend on having access to the entire document simultaneously? If yes — if the answer in paragraph 3 depends on something buried in paragraph 847 — then a large context window isn't a nice-to-have. It's the only way to get the right answer without engineering around the limitation.

📐
What 1 million tokens actually looks like: roughly 750,000 words. That's a 3,000-page book, a mid-sized codebase at ~80K lines, about 2,500 average-length emails, or 40 hours of meeting transcripts. When the context bar is at 100%, you're working with material a human would need 3–4 days to read cover-to-cover.
🔗
Cross-Document Reasoning
The unlock is asking questions that span the whole document without preprocessing. A single clause in document 40 can contradict a term in document 3 — whole-context view catches this; chunking doesn't.
🗂️
No Chunking Artifacts
Chunked summaries lose the relationships between sections. Two summaries that seem independent often share a thread only visible when reading both in full. Large context eliminates this blind spot entirely.
🎯
RAG Can't Replace It
RAG retrieves what you know to ask for. Large context gives you what you didn't know was relevant. For codebase analysis or legal discovery, the unknown unknowns are often the most important findings.
Section 02

The Failure Modes Nobody Talks About

Every GPT-5.4 launch post covered what the 1M context window enables. Almost none covered where it breaks. Here are the three failure modes you will hit in production if you don't account for them upfront.

Attention quality by context fill level
GPT-5.4 · ability to correctly retrieve and reason about content from the middle of context
0 – 200K tokens
Strong
200K – 500K tokens
Mixed
500K – 800K tokens
Degrades
800K – 1M tokens
Weak
01
Lost-in-the-Middle
Transformer attention is stronger at the start and end of the context than the middle. At 1M tokens the "middle" is hundreds of thousands of tokens wide. Miss rates of 20–35% on multi-hop retrieval are measurable above 500K tokens — and this is not fixed by GPT-5.4.
02
Coherence Drift in Output
When you feed 500K tokens and ask for a comprehensive synthesis, output coherence degrades past ~4,000 generated tokens. Early sections contradict late sections. Points get re-introduced as new observations. The model forgets what it wrote 3,000 tokens ago in its own output.
03
Confident Hallucination
At 800K–1M token fill, hallucination rates on single-mention facts increase. The model is more likely to confabulate plausible names, numbers, and dates than to admit it can't find the information. The "I couldn't find this" response disappears at high fill.
🔴
The dangerous pattern: GPT-5.4 gives you a confident, well-written answer that misses content from the middle of the context — without flagging that anything was missed. The output looks right. You don't catch the error unless you verify manually. For contract review, compliance analysis, or audit trails — explicitly verify documents positioned in the 400K–800K token range.
Section 03

Use Cases That Genuinely Work

The 1M context window earns its keep on tasks where the answer requires holistic awareness of the entire document, where chunking produces materially worse results, and where context fill stays below the 500K degradation threshold.

💻
Codebase-Wide Refactoring & Architecture Review
✓ Works Well
Why it works
Dependencies span the whole codebase. Chunking misses cross-file references. RAG doesn't retrieve code you didn't know to ask for. Holistic view is structurally required to find circular dependencies, dead code, and cross-cutting concerns.
Ideal usage
Load an entire mid-sized codebase (50–100K lines, ~200K–400K tokens). Ask GPT-5.4 to identify cross-cutting concerns, circular dependencies, dead code, inconsistent naming patterns, or propose an architectural refactor.
Context fill guidance
Keep under 400K tokens for best results. Exclude test files, lock files, and generated code first to stay in the reliable attention zone.
⚖️
Multi-Party Contract & Legal Document Analysis
✓ Works Well
Why it works
Legal risk is often in the interaction between documents, not any single document. A clause in the base agreement can be silently overridden by an amendment — only visible when holding both simultaneously.
Ideal usage
Load a full contract bundle — master agreement, SOW, amendments, exhibits. Ask GPT-5.4 to find inconsistencies, missing standard clauses, cross-reference conflicts, or produce a risk matrix.
Critical caveat
Verify findings against documents positioned past 400K tokens manually. Do not use as sole review for high-stakes contracts without independent verification.
📚
Long-Form Research Synthesis (Academic / Market)
✓ Works Well
Why it works
RAG misses connections not in the top-k results. The full context enables genuine cross-paper reasoning — areas of consensus, methodological disagreements, and open questions that span multiple papers.
Ideal context fill & prompting
300K–600K tokens. 20–40 papers typically lands in this range. Ask for synthesis by theme, not by paper — prevents the model defaulting to per-paper summaries.
📋
Book / Manuscript Structural Editing
⚠ With Caveats
What works
Structural feedback, character consistency analysis, and thematic coherence review across an 80,000-word manuscript (~110K tokens). Sits comfortably within GPT-5.4's reliable range for analysis tasks.
What doesn't
Full manuscript rewrites in a single call. Output coherence drift kills quality badly past ~4K generated tokens. Use large context for analysis; use small focused calls for actual rewriting.
Section 04

Use Cases That Fall Apart

These are the use cases people try first because the context window seems tailor-made for them. In practice they don't work — not because the context is too small, but because of how attention degrades or how the task structure fights against a single large-context call.

Actually works
  • Single codebase architecture review — holistic view changes the answer
  • Contract bundle cross-referencing — inter-document conflicts require full context
  • Multi-paper academic synthesis — cross-study reasoning, not per-paper summaries
  • Single-account history analysis — 12-month timeline stays well under 200K tokens
  • Long audit trail reconstruction — who changed what, when, in what sequence
Falls apart
  • Summarise 500 emails at once — lost-in-the-middle at scale; use weekly batches
  • Translate a corpus to a new format — output coherence drift destroys quality
  • Extract every entity from 300 documents — miss rate climbs past 200K tokens
  • Open Q&As across 50+ mixed docs — confident hallucination at high fill
  • Use as a cheaper RAG replacement — the cost per call makes this economically absurd
📧
Bulk Email / Inbox Triage at Scale
✕ Falls Apart
Why it fails
6 months of inbox sits at 800K–1M tokens — squarely in the high-degradation zone. Emails from months 3–5 sit in the middle of the context and get disproportionately missed. The output looks comprehensive; the completeness is not.
What to do instead
Process in weekly batches (50–100 emails per call, ~40–80K tokens each). Aggregate summaries in a second pass. Completeness improves dramatically and total cost drops significantly.
🏷️
Entity Extraction Across a Large Document Set
✕ Falls Apart
Why it fails
Load 300 contracts past 600K tokens and ask for every party name, date, jurisdiction. Miss rates on entities mentioned only once in mid-context documents are 25–40%. The extraction looks complete; a significant fraction is simply absent.
The irony
Per-document extraction calls are faster, cheaper, and more complete. Structured extraction at small context is a solved problem. 1M context adds cost and reduces completeness here — it's an unforced downgrade.
⚠️
The "it looks complete" problem: The most dangerous failure mode isn't wrong output — it's incomplete output that looks complete. GPT-5.4 doesn't flag missing documents. It produces a confident, well-formatted answer. For any task where completeness matters, treat high-fill 1M context outputs as a first draft requiring verification, not a final deliverable.
Section 05

The Cost Reality Check

"1M context window" gets treated as a feature. It's also a billing event. At current OpenAI pricing for GPT-5.4, a single 1M-token input call costs significantly more than the casual API call most people mentally budget. If you're running this at scale — analysing 50 codebases a day, processing a legal discovery set daily — the economics shift fast.

ScenarioApprox. tokensEst. input costCalls/day to $100
Single codebase review~300K~$0.90~111 calls
Full legal bundle (10 contracts)~150K~$0.45~222 calls
6-month inbox dump~800K~$2.40~42 calls
40-paper research corpus~500K~$1.50~67 calls
Full 1M context call1,000K~$3.00~33 calls

↑ Estimates based on GPT-5.4 pricing at time of publication. Output tokens additional. Verify current pricing at platform.openai.com.

💡
The RAG breakeven point: If your use case can be solved with a well-designed RAG pipeline — and most entity extraction, Q&A over documents, and summarisation tasks can — the per-call cost of 1M-token context is rarely justified. RAG at 10K tokens per call costs roughly 100× less per query and produces better completeness on retrieval tasks. Run the numbers before defaulting to "just stuff everything in context."
Section 06

Prompting Patterns for Coherent Output at Scale

For the use cases that do justify 1M context, these four patterns measurably reduce lost-in-the-middle failures and coherence drift. Pattern 1 alone eliminates most attention failures on well-structured tasks — start there before adding the others.

Pattern 1 · Anchor + Explicit Recall
# System prompt — add to any large-context call

You are reviewing a large document set. When answering any question, you MUST:
1. Explicitly state which document(s) your answer draws from (name + approximate position)
2. Quote the specific passage that supports each claim — do not paraphrase without attribution
3. If you are uncertain whether you have reviewed a section of the documents, say so

# Why this works: forcing explicit attribution makes attention failures
# visible rather than silent. The model can't produce a confident answer
# without naming a source — missing documents show as gaps, not omissions.
Pattern 2 · Document Index at Context Start
# Place this BEFORE all documents in the context

DOCUMENT INDEX
The following [N] documents have been loaded. Refer to them by ID when citing:

DOC-01: [Document name] — [Brief description, date if relevant]
DOC-02: [Document name] — [Brief description]
DOC-03: [Document name] — [Brief description]
// ... continue for all documents

When referencing information, always include the DOC-ID.
Do not reference information from documents not listed in this index.

---BEGIN DOCUMENTS---
[Documents follow in order]
Pattern 3 · Layered Output Structure
# Instead of: "Write a comprehensive analysis"
# Use a staged output structure that caps each section:

Produce your analysis in the following structure.
Each section is independent — do not forward-reference or refer back to previous sections.

Section 1 — Executive Summary  (max 200 words)
Section 2 — Key Findings        (max 8 bullet points, each under 30 words)
Section 3 — Risk Factors        (max 5 items, with severity and supporting document ID)
Section 4 — Recommended Actions (max 5 items, ranked by priority)
Section 5 — Open Questions      (items requiring further review or external input)
Pattern 4 · Middle-Anchoring
# Add this when critical content sits in the middle of a large context

COVERAGE REQUIREMENT
This analysis must account for content from ALL sections of the provided documents,
including sections in the middle of the document set.

Before generating your response, mentally scan for relevant content at three points:
- The first third of the documents
- The middle third (most likely underrepresented — check deliberately)
- The final third

If you find relevant content in the middle section that contradicts your initial impression
from the early documents, the middle-section content takes precedence.

# Explicitly naming the "danger zone" improves recall from that zone
# by 15–25% in needle-in-a-haystack testing.
🚫
Anti-pattern to avoid: "Here are all the documents. Please provide a comprehensive summary." No structure = coherence degrades past ~3K output tokens. No attribution = silent misses in middle documents. No length constraint = model pads output. No coverage check = confident-sounding output that misses 30% of content. Replace with Pattern 1 + Pattern 3 at minimum on every large-context call.
Section 07

GPT-5.4 vs Claude for Long Context Work

Both GPT-5.4 and Gemini 2.5 Pro offer 1M-token context. Claude Opus 4.6 caps at 200K. For teams already running Claude workflows, the honest question is: does the 1M context window justify switching models for long-context tasks, or does 200K cover the vast majority of real-world use cases?

TaskGPT-5.4 (1M)Claude Opus 4.6 (200K)Verdict
Codebase review (<150K tokens) Strong StrongEither. Claude's instruction-following may edge it.
Codebase review (150K–800K tokens) Works Doesn't fitGPT-5.4 only option above 200K.
Legal contract bundle (<100K tokens) Strong StrongEither. Both handle under 100K well.
Multi-paper synthesis (30+ papers) Works~ Tight fitGPT-5.4 if papers total >150K tokens.
Instruction-following precision~ Good ExcellentClaude for complex multi-constraint tasks.
JSON / structured output Strong StrongComparable. Both require explicit format prompting.
🔮
The honest take: For most teams, 80% of long-context use cases fit within Claude's 200K window or can be restructured to fit. The 1M context window is the right tool for codebases over 100K lines, large legal discovery sets, and research corpora that genuinely can't be reduced. For everything else, the additional cost and attention degradation are unforced problems. Keep your Claude workflows for tasks under 200K — reach for GPT-5.4 specifically when you hit the ceiling.
Section 08

The Verdict — When to Reach for 1M

Large context windows are real capability. GPT-5.4's 1M context genuinely unlocks a class of task that was architecturally impossible to do well before. The failure is treating it as a general-purpose upgrade — dumping everything in context because you can, not because the task demands it. Three questions before every large-context call:

01
Does the whole document matter?
Does the answer require simultaneous access to the entire document? If yes → 1M context may be justified. If a well-designed RAG pipeline gets the right answer → use RAG.
02
Does fill stay under 500K?
If yes → attention is in the reliable zone. If no → use Patterns 1–4 from Section 06, and independently verify completeness for anything consequential.
03
Have you run the cost numbers?
A one-time codebase audit at $1.50 is trivial. 200 contracts/day at $0.90 each is $5,400/month. Know your volume before enabling any automated pipeline using 1M context calls.
💡
The OutClaw take: The 1M context window is a precision instrument, not a fire hose. The teams getting real value from it are using it for 3–5 specific workflows where whole-document awareness is structurally required. The teams frustrated by it are using it as a shortcut to avoid building proper retrieval pipelines. Know which camp you're in before your API bill tells you.

Model breakdowns like
this, every Tuesday.

GPT-5.4 teardowns, Claude workflow guides, and the practical AI takes your feed buries. 8 minutes. Free forever.

Subscribe to OutClaw AI →