Chapter 06 of 21

When LLMs Win

Intellectual honesty requires acknowledging what LLMs do better. For token-sensitive tasks — code, chat, RAG, creative generation — LCMs add overhead without adding capability. This chapter tells you when to stop reading and go back to your LLM stack.

9 min read

Part 2 — The Comparison Layer

When LLMs Win

This book's commercial justification requires acknowledging the cases where LLMs are the correct tool. A practitioner who replaces a working LLM system with an LCM for the wrong task type will spend more, build more, and get worse results. That is not a hypothetical — it is a predictable outcome of applying concept-level reasoning to token-level tasks.

This chapter is an honest account of where LLMs win. It is not a chapter about LLM limitations. Those come in Chapter 7. This chapter makes the affirmative case for LLMs on the task types where they are genuinely superior, and it ends with a clear instruction: if the tasks described here are your primary tasks, stop reading this book and go back to your LLM stack.

What You Will Learn

Identify the four task categories where LLMs outperform LCMs structurally
Explain why each category is an LLM advantage, using the architectural dimensions from Chapter 5
Recognize when a hybrid system's LCM component adds cost without adding capability
Apply the LLM-first default to common enterprise AI use cases

6.1 The LLM-First Default

The right default for enterprise AI is LLMs, not LCMs. LLMs have a four-year head start in production deployment. The tooling ecosystem — SDKs, orchestration frameworks, evaluation libraries, observability platforms — is mature. Hosted APIs from multiple providers offer competitive pricing, SLAs, and support. Teams know how to prompt them, evaluate them, and monitor them.

LCMs are the right tool for a specific set of tasks where LLMs have structural limitations. Outside that set, LCMs add complexity without adding capability. The discipline is knowing where that set begins and ends.

The Task Unit Test from Chapter 5 provides the formal filter. This chapter provides the intuition behind the cases where the test correctly returns "LLM."

6.2 Code Generation and Software Engineering Tasks

Code is the clearest LLM domain, and it is worth understanding why precisely.

Code has two properties that make token-level reasoning exactly right. First, tokens in code have rigidly defined syntax: a missing bracket is a parse error, not a stylistic choice. The token-by-token generation pressure that LLMs impose — each token must be plausible given the tokens before it — enforces syntactic correctness by construction. Second, meaning in code is highly local: a function's behavior is determined by its tokens, and the relationship between a function and the rest of the codebase is expressed through explicit interfaces (imports, type signatures, function calls) rather than semantic proximity.

An LCM operating on code would encode functions as sentence-level concept embeddings. But a function body is not a proposition — it is a procedure. SONAR's concept space is not organized around procedural semantics. The concept embedding of a Python function that sorts a list and the concept embedding of a Python function that searches a list would likely be close neighbors (both are list operations, both use similar vocabulary), but they produce entirely different outputs. Concept-space proximity is not a useful proxy for behavioral equivalence in code.

Where LLMs win on code: code completion, code generation from natural language, code review and explanation, test generation, refactoring, and debugging. All of these tasks require token-level fidelity to syntax and local semantic coherence. None of them require reasoning across documents using semantic equivalence.

The edge: architectural analysis across a large codebase — "which modules are semantically similar and should be refactored together?" — is a concept-level task. This is the one code-adjacent use case where SONAR encoding could add value, and it is closer to a semantic search task than a code generation task.

6.3 Conversational AI and Customer-Facing Applications

Conversational AI is a token-level task for several reinforcing reasons.

Turn-level fluency is the quality criterion. In a customer service conversation, the user measures success turn by turn. A response that is globally coherent across all previous turns of the conversation but locally awkward in its current turn is a worse user experience than a response that is locally fluent but references only recent context. The token-plausibility pressure that LLM generation imposes is the right pressure for conversational tasks.

Context is short. Customer service conversations average 3-7 turns. Each turn is a sentence or two. The total context that matters is rarely more than a few thousand tokens — well within LLM context windows and well below the scale where concept-level reasoning adds value.

Response time matters. LCMs add encoding and decoding overhead to the generation pipeline. For conversational applications where response latency directly affects user experience, the additional roundtrip through SONAR encoding and concept-model reasoning is a cost with no benefit for short-context tasks.

Intent is token-level. Understanding "I want to cancel my subscription" does not require concept-space proximity analysis. The LLM's token-level attention handles intent recognition, entity extraction, and response generation correctly for conversational tasks.

Where LLMs win on conversational AI: customer service chatbots, internal helpdesks, HR FAQs, conversational data exploration, and any user-facing interface where response quality is measured at the turn level.

The edge: a conversational interface that synthesizes answers from a very large knowledge base with cross-lingual content is an LCM candidate for the retrieval and synthesis backend, with an LLM handling the conversational wrapper. This is the concept elevator hybrid pattern from Chapter 12.

6.4 Retrieval-Augmented Generation over Short Documents

RAG — retrieving relevant passages from a knowledge base and using them to ground LLM responses — is a well-established pattern that LLMs handle correctly for most enterprise knowledge management use cases.

The key condition is "short documents." When the knowledge base consists of documents where the relevant answer lives in a single passage (a policy document, a FAQ, a product specification), token-level chunk retrieval and LLM response generation is the right architecture. The retrieval problem is solved by dense passage embeddings (models like BGE, E5, or OpenAI's text-embedding-3-small) that operate at the passage level. The generation problem is solved by an LLM that synthesizes retrieved passages into a coherent response.

This pattern works correctly for tasks where the unit of relevance is a passage and the unit of generation is a single response. It starts to fail when relevance requires cross-document comparison (which Chapter 9 addresses with LCM-based concept retrieval) or when the response requires synthesizing across many documents (which Chapter 12's hybrid architecture addresses).

Where LLMs win on RAG: internal knowledge bases, product documentation Q&A, policy compliance checking against single documents, and any retrieval task where the answer lives in one document or one passage within a document.

The edge: multi-hop reasoning — "where do our procurement policy and our vendor code of conduct create conflicting obligations?" — requires cross-document concept retrieval. That is an LCM task.

6.5 Creative and Stylistic Generation

Creative writing, marketing copy, content generation, and stylistic variation tasks are LLM-native for a structural reason: the token-plausibility pressure that LLMs operate under is exactly what produces stylistic richness.

A skilled creative writer does not select words for their semantic accuracy to a concept embedding — they select words for their rhythm, connotation, cultural association, and relationship to the surrounding prose. These properties are encoded in token co-occurrence statistics: a word that sounds right in a given context sounds right because similar words appear in similar contexts in the training corpus. The token-level statistical richness of LLM generation is the mechanism behind effective creative output.

An LCM's concept decoder is a language model, so it produces grammatically fluent output. But its output is constrained by the concept embedding it is decoding — the concept model determined the meaning; the decoder renders it. The decoder has less freedom to make stylistic choices that deviate from the concept embedding in service of local effect. For creative tasks, that constraint is a limitation, not a feature.

Where LLMs win on creative tasks: marketing copy, email drafting, content generation, speech writing, creative writing assistance, style transfer, and any task where the quality criterion is local impact rather than global semantic accuracy.

The edge: generating a globally consistent long-form narrative (a book chapter, a strategic narrative document) where sections must be semantically coherent with each other is an LCM task for the global structure and an LLM task for the local prose. The hybrid approach from Chapter 12 addresses this.

6.6 Short-Form Classification and Extraction

NLP classification tasks — sentiment analysis, intent classification, named entity recognition, relation extraction, document categorization — operate at the sentence or passage level and require token-level precision.

Named entity recognition requires identifying token spans: "the agreement dated March 15, 2026" requires locating "March 15, 2026" as a date entity at the token level. Sentiment analysis on a sentence requires token-level attention to negation, intensifiers, and hedging language ("not particularly impressed" vs. "impressed"). Relation extraction requires understanding which token spans are subjects and objects of which predicates.

SONAR's sentence-level encoding compresses all of this into a single concept embedding. The intra-sentence structure that short-form classification and extraction tasks depend on is not recoverable from the concept embedding. LLMs, with their token-level attention, handle these tasks correctly and efficiently.

Where LLMs win on classification and extraction: named entity recognition, sentiment analysis, intent classification, relation extraction, document categorization, and any extraction task that requires locating specific token spans or reasoning about intra-sentence syntactic structure.

The edge: cross-document thematic classification — grouping a corpus of 10,000 documents by semantic theme across languages — benefits from SONAR encoding and concept-space clustering as the classification mechanism, with LLMs providing the category labels.

6.7 The LLM-First Checklist

Before considering an LCM for a new use case, confirm that the use case is not better served by an LLM. The following checklist operationalizes the LLM-first default.

The use case is LLM-appropriate if any of the following are true:

The task involves code generation, code review, or code transformation
The task is conversational (turn-level quality, short context)
The task requires token-level extraction (entity extraction, span detection, syntactic parsing)
The answer lives in a single document or passage (RAG over short docs)
The quality criterion is local fluency, stylistic richness, or creative variation
The task does not require cross-document reasoning or cross-lingual reasoning
The team has no LCM tooling or evaluation expertise

If none of these apply, proceed to the Task Unit Test. If the Task Unit Test returns "concept-level," proceed to Chapter 7 and evaluate whether your specific use case is among the LCM wins.

Summary

LLMs win on code generation, conversational AI, RAG over short documents, creative generation, and short-form classification and extraction. These wins are structural: the token-level architecture is the right tool for the task's natural unit, and the LCM's concept-level encoding adds overhead without adding capability.

The LLM-first default is correct. LLM tooling is mature, teams know how to use it, and most enterprise AI tasks are token-level tasks. Start there.
Token-level tasks are well-defined. Code, chat, single-document Q&A, creative copy, and token-span extraction are all tasks where the token is the right representational unit.
The LLM-first checklist prevents premature LCM adoption. Run through it before considering LCMs for any new use case.
The edges are real. Several task types benefit from LLM components in a hybrid architecture with an LCM backend. Chapter 12 covers those patterns.

Exercises

Type	Exercise	Description
Analysis	Stack audit	Inventory the AI systems currently running in your organization. Apply the LLM-first checklist to each. How many are correctly using LLMs? Are there any using LCMs (or complex architectures) for tasks that would be better served by simpler LLM patterns?
Design	RAG architecture review	A team has built a RAG system that answers employee questions about HR policies. The knowledge base contains 200 policy documents averaging 15 pages each. Users complain that questions involving obligations from multiple policies are answered inconsistently. Apply the Task Unit Test and the LLM-first checklist. Is this a pure LLM case, a pure LCM case, or a hybrid? Design the appropriate architecture.
Conceptual	The creative edge	The chapter identifies long-form narrative generation as an edge case where a hybrid LLM-LCM approach might outperform a pure LLM approach. Describe what "global semantic coherence" means for a 20-chapter technical book. What would the LCM component provide, and what would the LLM component provide? Where would you draw the boundary between the two?

← Back to Beyond LLMs: Large Concept Models