Chapter 41 of 75
Cost and Performance Engineering for AI
AI inference costs are not fixed — they are engineering decisions. Teams that treat cost and performance as afterthoughts discover at scale that they have built capabilities they cannot afford to run. Cost engineering applied early produces AI systems that are both capable and economically sustainable.
Part IV — Enterprise AI Architecture
Cost and Performance Engineering for AI
AI inference costs are real, variable, and controllable. Unlike traditional software infrastructure where adding users primarily increases fixed infrastructure costs, AI inference costs scale directly with usage — token by token. At low usage volumes, this is manageable. At enterprise scale, the economics become a first-order engineering concern. The organizations that build cost engineering into their AI architectures from the start maintain sustainable economics as they scale; those that do not face difficult choices between capability and affordability.
41.1 Understanding AI Cost Drivers
Token consumption. For LLM-based systems, cost is primarily driven by token counts — input tokens (the prompt, context, and user input) and output tokens (the generated response). Input and output tokens are priced differently by most providers; output tokens are typically more expensive because they are generated sequentially.
System prompt overhead. System prompts are included in the input token count on every inference call. A 2,000-token system prompt on 10,000 calls per day consumes 20 million tokens per day in system prompt overhead alone. Long system prompts are often the largest cost lever — trimming them has immediate cost impact.
Context window usage. RAG systems that retrieve large amounts of context per query consume input tokens proportional to the context. Retrieving 5 chunks of 500 tokens each adds 2,500 input tokens per query. Context window optimization — retrieving only the most relevant context, compressing context before including it — directly reduces cost.
Model selection. Model pricing varies by orders of magnitude. GPT-4 class models cost significantly more per token than GPT-3.5 class models, which cost more than fine-tuned smaller models, which cost more than running open-source models on your own infrastructure. The appropriate model for a task is not always the most capable model — it is the cheapest model that achieves the required quality.
Output length. Output tokens are expensive and controllable. Instructions that constrain output length ("respond in three bullet points", "summary in 50 words or fewer") directly reduce output token consumption. Open-ended generation instructions produce longer outputs and higher costs.
41.2 Cost Optimization Techniques
Prompt compression. Reduce system prompt length without degrading quality. Techniques: remove redundant instructions, remove examples that are not needed (few-shot examples consume significant tokens), use concise phrasing instead of explanatory prose, move static background information to retrieval rather than the system prompt.
Context compression. For RAG systems, summarize retrieved chunks before including them in the prompt. A retrieved passage of 500 tokens summarized to 150 tokens reduces input token consumption by 70% for that chunk, at the cost of one additional (cheap) summarization call.
Model routing. Route queries to cheaper models when the task does not require the full capability of the frontier model. Simple queries go to a fast, cheap model; complex queries that require reasoning or domain knowledge go to the more capable (and more expensive) model. Model routing requires a classification step to determine query complexity, but the cost savings often justify the classification overhead.
Caching. Cache the outputs of LLM calls when the same input is likely to be submitted multiple times. Semantic caching — caching by query meaning rather than exact string match — increases cache hit rates for conversational systems where the same intent is expressed in different words. Caching is most effective for high-volume systems with repetitive query patterns.
Batch processing. For non-real-time workloads (document processing, data enrichment, scheduled analysis), batch multiple requests into a single API call where the provider supports batch pricing (typically 50% of real-time pricing). Batch processing introduces latency but dramatically reduces per-unit cost.
Fine-tuning. For high-volume, well-defined tasks, fine-tuning a smaller model on examples of the task can match or exceed the quality of prompting a larger model at significantly lower cost. Fine-tuning requires an upfront investment in training data and compute, but the per-inference cost advantage compounds at scale.
41.3 Latency Engineering
AI inference latency has three components: network latency (time to reach the API), queue latency (time waiting for a model instance), and generation latency (time to generate tokens). Generation latency dominates for longer outputs and is proportional to output length.
Streaming. For user-facing features, streaming (returning tokens as they are generated rather than waiting for the complete response) reduces perceived latency significantly. The first token appears quickly; the user sees the response building in real time. Total generation time is the same, but the user experience is dramatically better.
Speculative decoding. Some model serving implementations use a small "draft" model to speculatively generate tokens that a larger "verifier" model then approves or rejects. When the draft model is correct (which it is for most common token sequences), generation is significantly faster than running the large model alone.
Caching for latency. Exact-match caching (serving cached responses for repeated identical queries) eliminates inference latency entirely for cache hits. This is particularly valuable for common queries in high-traffic applications.
Asynchronous processing. For tasks where results are not needed immediately, move AI inference off the critical path. The user initiates a request, receives an acknowledgment, and is notified when the result is ready. Asynchronous processing removes latency as a user-facing concern for batch-oriented AI features.
Regional deployment. For globally distributed users, deploying AI serving infrastructure in multiple regions reduces network latency for each user's nearest region. Most cloud AI providers offer regional deployments.
41.4 The Cost-Quality Tradeoff Framework
Model selection is an optimization problem: find the cheapest model that achieves the required quality threshold on the evaluation dataset.
Step 1: Define the quality threshold. What evaluation score does the feature need to achieve? This must be defined from business requirements, not from what the most capable model achieves.
Step 2: Evaluate a range of models. Run the evaluation dataset through multiple model tiers — the frontier model, mid-tier models, smaller models, fine-tuned smaller models — and measure quality scores for each.
Step 3: Calculate cost at production volume. For each model that achieves the quality threshold, calculate the monthly cost at the anticipated production query volume and average token consumption.
Step 4: Select the cheapest qualifying model. The production model is the cheapest model that meets the quality threshold — not the most capable model, not the model with the best marketing.
Step 5: Reassess quarterly. Model capabilities improve and costs decline. A model that did not meet the quality threshold six months ago may meet it today at lower cost than the model currently in production.
The cost-quality tradeoff framework applied systematically across an enterprise AI portfolio can reduce AI infrastructure costs by 40–70% without degrading user-facing quality — because most AI features are over-engineered with models more capable than the task requires.