Chapter 01 of 21

The Token Ceiling

A large US bank spent eighteen months building an LLM-powered contract review system. It worked beautifully on clauses. It failed on contracts — because a contract is not a clause, and the token is not a concept.

9 min read

Part 1 — Foundations

The Token Ceiling

A large US bank spent eighteen months building an LLM-powered contract review system. It worked beautifully on individual clauses. Legal teams loved the clause-level explanations, the risk flags, the plain-English summaries. The system passed every benchmark the team designed.

Then the business asked for something different: compare this new vendor agreement against the master service agreement it modifies. Both documents are real. The relationship between them is what matters. The LLM — even with a 128,000-token context window — produced summaries of both documents, but it could not reliably identify when the new agreement introduced obligations that contradicted the master. It did not lose track of individual sentences. It lost the semantic thread between them.

That is the token ceiling. It is not a hardware limit. It is a representational limit. And understanding it precisely is the prerequisite for knowing when Large Concept Models matter.

What You Will Learn

Define the token ceiling and distinguish it from context window limitations
Explain why tokens are the wrong unit of representation for concept-level tasks
Identify the three failure modes that appear when LLMs hit the token ceiling
Map enterprise task types onto the token-level vs. concept-level divide
Apply the Task Unit Test to determine whether a given problem warrants LCM consideration

1.1 What a Token Actually Is

A token is not a word. It is not a sentence. It is not a meaning. A token is a subword unit — a fragment that a tokenizer has learned to associate with a numerical ID based on statistical frequency in training data. "Unambiguous" becomes three tokens in most modern tokenizers: "un", "ambigu", "ous". "SONAR" might be two tokens. A 500-word paragraph typically produces 650 to 750 tokens, depending on vocabulary.

This matters because everything an LLM does, it does at the token level. The attention mechanism in a transformer computes relationships between tokens. The context window is measured in tokens. Inference cost is billed by token. The model predicts the probability distribution over the next token, samples from that distribution, and appends the result. It does this until it generates an end-of-sequence token or hits a stop condition.

At no point in this process does the model form an explicit representation of a sentence, a paragraph, a document, or an idea. Coherence at higher levels of abstraction emerges from the statistical regularities in the token prediction task. When those regularities are sufficient — which they are for an enormous range of practical tasks — the emergent coherence is impressive. When they are not sufficient, the model fails in ways that are hard to predict and hard to diagnose, because there is no "sentence understanding" layer to inspect. There are only token probabilities.

1.2 Context Windows and Their Limits

The standard response to the token ceiling is: just increase the context window. If the model can see 128,000 tokens at once, that is roughly 100,000 words — enough for most enterprise documents. The counterargument has three parts.

Attention degrades at long range. Empirically, transformer attention is not uniform across the context window. Models trained on long contexts attend more reliably to content near the beginning and end of the window than to content in the middle. This is sometimes called the "lost in the middle" problem. A 300-page document loaded into a 128,000-token context window does not get uniform attention across all 300 pages. The middle chapters are structurally disadvantaged.

Cross-document reasoning is not the same as long-document reading. The bank's contract review problem is not "read a long document." It is "compare two documents and identify semantic contradictions." Even if both documents fit in the context window simultaneously, the model has no mechanism for computing the semantic distance between a clause in Document A and a clause in Document B except through token-level attention. For tasks where the relevant relationship is conceptual — this clause means the same thing as that clause, despite using different words — token-level attention is an indirect and unreliable proxy.

Cost scales linearly with context. A 128,000-token context window costs roughly eight times as much to process as a 16,000-token window. For tasks that require comparing dozens of documents — regulatory filings, multi-year project archives, international policy sets — the economics of filling large context windows become prohibitive before the technical limitations become the binding constraint.

1.3 The Three Failure Modes

When an LLM hits the token ceiling, it fails in three characteristic ways. Each failure mode has a different surface presentation, which is why they are often misdiagnosed.

Failure Mode 1: False coherence. The model produces output that is locally fluent but globally inconsistent. Summaries of 200-page documents that accurately represent each section individually but contradict each other across sections. Planning documents that list sensible steps individually but where steps 3 and 7 address the same problem with incompatible approaches. The model is not fabricating. It is accurately representing local token patterns without maintaining a global semantic thread. False coherence is the hardest failure mode to detect because the output reads well.

Failure Mode 2: Semantic amnesia. The model loses the semantic context established early in a long document when generating content about material introduced later. A contract review system that correctly identifies an obligation on page 2 may not flag the contradiction on page 47 because the token-level context for page 2 has effectively faded by the time the model attends to page 47. Semantic amnesia presents as inconsistent recall: the model "remembers" some early content and "forgets" other early content in ways that appear random but are actually driven by token-level attention patterns.

Failure Mode 3: Surface-form binding. The model treats semantically equivalent content as unrelated when it is expressed in different words, and treats semantically unrelated content as related when it shares surface vocabulary. A regulatory analysis system that correctly identifies "the applicant must provide documentation within thirty days" as equivalent to "a thirty-day submission window is required" — when both are in the same document, close together — may fail to identify the same equivalence when the two phrasings appear in different documents, or when the documents use different vocabulary entirely. Surface-form binding is the root cause of cross-lingual failures in token-level systems.

1.4 The Task Unit Test

Task Unit Test — three questions that determine whether a task warrants LCM consideration

Figure 1.1 — The Task Unit Test. Three questions map any enterprise task to an LLM or LCM recommendation. Score all three "yes" for a strong LCM candidate.

Not every enterprise task hits the token ceiling. Many tasks — perhaps most tasks — are better served by LLMs, and the overhead of LCM adoption is not justified. The Task Unit Test is a three-question diagnostic that identifies whether a given task warrants LCM consideration.

Question 1: What is the natural unit of your task? Is it a token (code completion, autocorrect), a sentence (classification, sentiment analysis), a passage (RAG retrieval, question answering), or a concept (cross-document comparison, multi-document synthesis, hierarchical planning)? Tasks whose natural unit is a concept or above are candidates for LCM consideration. Tasks whose natural unit is a token, sentence, or passage are generally better served by LLMs.

Question 2: Does meaning change when the surface form changes? If two sentences in your task domain mean the same thing but use different words, does your task need to recognize that equivalence? Cross-lingual tasks almost always answer yes. Many regulatory, legal, and policy tasks answer yes. Conversational tasks usually answer no — a chatbot does not need to recognize that "I want to cancel my subscription" and "please terminate my membership" are semantically equivalent, because the LLM's token-level patterns handle that case.

Question 3: Does global consistency matter more than local fluency? If you read the output sentence by sentence and each sentence is excellent, but the document contradicts itself across sections, has the task been done correctly? For most conversational and short-form tasks, the answer is no — local fluency is sufficient. For planning, synthesis, and long-form generation tasks, the answer is yes — global consistency is the quality criterion that matters, and it is the criterion that token-level systems fail on first.

A task that answers "concept," "yes," and "yes" to these three questions is a strong candidate for LCM consideration. A task that answers "token or sentence," "no," and "no" should stay with an LLM.

1.5 Enterprise Task Mapping

The following table maps common enterprise task types to their natural unit, semantic equivalence requirement, and consistency requirement, producing a recommendation.

Task	Natural Unit	Semantic Equivalence?	Global Consistency?	Recommendation
Customer service chatbot	Token / sentence	No	No	LLM
Code generation and review	Token	No	No	LLM
Short document Q&A (RAG)	Passage	No	No	LLM
Multi-document policy comparison	Concept	Yes	Yes	LCM
Cross-lingual regulatory mapping	Concept	Yes	Yes	LCM
Strategic plan generation (multi-year)	Concept	Yes	Yes	LCM
Email summarization	Sentence	No	No	LLM
Contract clause extraction	Passage	No	No	LLM
Cross-jurisdiction compliance analysis	Concept	Yes	Yes	LCM
Multilingual executive brief	Concept	Yes	Yes	LCM
Technical documentation Q&A	Passage	No	No	LLM
Hierarchical project decomposition	Concept	Yes	Yes	LCM

The pattern is clear. LCM candidacy clusters around three enterprise domains: multi-document synthesis and comparison, cross-lingual reasoning, and hierarchical planning and decomposition. These are the domains where the token ceiling is a structural constraint, not an implementation detail. They are also, not coincidentally, the domains where enterprise value is highest and where current LLM-based solutions produce the most consistent user complaints.

1.6 What the Token Ceiling Is Not

Clarity requires ruling out common misidentifications.

The token ceiling is not a context window size problem. Increasing the context window does not eliminate the ceiling; it raises the altitude at which the same failure modes appear. A model with a 1,000,000-token context window still attends unevenly, still conflates surface form with meaning, and still cannot reason across documents using concepts rather than tokens.

The token ceiling is not a model scale problem. Larger models with more parameters produce better results on many benchmarks, but they do not eliminate the representational constraint. GPT-4 and its successors hit the same ceiling on the same task types as their predecessors — at higher quality thresholds, but with the same structural failure modes.

The token ceiling is not a prompt engineering problem. Better prompts, chain-of-thought reasoning, and structured output instructions improve LLM performance on concept-level tasks. They do not eliminate the fundamental mismatch between the token as the unit of representation and the concept as the unit of the task. Prompt engineering is a multiplier on the capabilities of the underlying architecture. It cannot change the architecture.

Summary

The token ceiling is a representational constraint, not a scale constraint. Tokens are subword units that encode statistical co-occurrence, not meaning. Large Concept Models operate at a different level of abstraction — the semantic concept, encoded as a dense sentence-level embedding — and this difference enables a category of reasoning that the token paradigm cannot reliably reach.

The natural unit matters. Tasks whose natural unit is a concept — cross-document comparison, hierarchical planning, cross-lingual synthesis — will hit the token ceiling before they hit any other constraint.
Three failure modes. False coherence, semantic amnesia, and surface-form binding are the characteristic ways LLMs fail on concept-level tasks. Recognizing them is the first step toward knowing when to reach for a different tool.
The Task Unit Test. Three questions — natural unit, semantic equivalence requirement, global consistency requirement — provide a systematic way to identify LCM candidates without requiring deep architectural knowledge.
LCMs are not replacements. They are an extension of the AI stack upward, into the domain of concept-level reasoning. The rest of this book builds the knowledge needed to use that extension correctly.

Exercises

Type	Exercise	Description
Conceptual	Ceiling diagnosis	A procurement team complains that their LLM-powered RFP analysis tool produces accurate summaries of individual vendor proposals but cannot reliably rank vendors against each other. Apply the Task Unit Test to this complaint. What failure mode is most likely occurring, and why does the Task Unit Test predict LCM consideration?
Design	Task inventory	Survey five AI use cases in your organization. Apply the Task Unit Test to each. How many are LCM candidates? What is the highest-value LCM candidate on your list, and what evidence do you have that it currently hits the token ceiling?
Analysis	Context window economics	A team proposes solving their cross-document comparison problem by loading all documents into a 1,000,000-token context window. Estimate the token count for their use case (twelve regulatory documents, average 50 pages each). Calculate the inference cost per analysis run at $0.01 per 1,000 tokens. Compare this to the cost of encoding the same documents into SONAR concept embeddings (covered in Chapter 3). What is the break-even frequency of analysis runs?

← Back to Beyond LLMs: Large Concept Models