What Is an AI Gateway?
If your organization has more than one team calling LLM APIs, you need a gateway. An AI gateway sits between your application and LLM providers, centralizing cross-cutting concerns like authentication, rate limiting, routing, guardrails, and observability.
┌──────────┐ ┌──────────────────────────────────┐ ┌──────────────┐
│ │ │ AI Gateway │ │ OpenAI │
│ App / │ │ │ ├──────────────┤
│ Agent │────>│ Auth -> Rate Limit -> Guardrails │────>│ Anthropic │
│ Service │<────│ -> Router -> Log -> Transform │<────│ Google │
│ │ │ │ ├──────────────┤
└──────────┘ └──────────────────────────────────┘ │ Local/OSS │
└──────────────┘
Core Components
Every AI gateway is built from the same building blocks. Understanding each component helps you decide whether to buy a managed gateway or build a custom one -- and what to prioritize first.
| Component | Purpose | Implementation |
|---|---|---|
| Authentication | Validate API keys, JWT tokens | API key store, OAuth2 |
| Rate limiting | Prevent abuse, manage quotas | Token bucket, sliding window |
| Input guardrails | Block harmful/invalid requests | PII scan, injection detection |
| Model router | Select provider/model per request | Rules, cost optimizer, fallback chain |
| Request transform | Normalize to provider format | OpenAI -> Anthropic format mapping |
| Response transform | Normalize provider responses | Unified response schema |
| Output guardrails | Filter harmful responses | Toxicity, PII, format validation |
| Logging/telemetry | Audit trail, analytics | Structured logs, traces |
| Caching | Reduce cost for repeated queries | Semantic cache, exact match cache |
| Cost tracking | Per-team/project attribution | Token counting, pricing tables |
Gateway Products and Frameworks
The gateway market ranges from open-source proxies you run yourself to fully managed commercial platforms. Your choice depends on whether you need simple routing or enterprise features like audit logging and cost attribution.
| Product | Type | Key Features |
|---|---|---|
| LiteLLM | Open source proxy | 100+ providers, OpenAI-compatible API |
| Portkey | Commercial | Caching, fallbacks, load balancing |
| Helicone | Commercial | Logging, caching, rate limiting |
| Kong AI Gateway | Commercial | Enterprise, plugin-based |
| Cloudflare AI Gateway | Commercial | Edge caching, analytics |
| MLflow AI Gateway | Open source | Centralized credentials, rate limits |
| Semantic Router | Open source | Intent-based routing |
| Custom (FastAPI) | DIY | Full control |
Authentication and Authorization
The gateway is the single point where you map internal team keys to provider credentials, enforce permissions, and set spending limits. Getting this right prevents unauthorized model access and surprise bills.
API Key Management
# Gateway authenticates app, not end-user
# Map internal keys to provider keys
KEY_MAP = {
"app-key-team-alpha": {
"team": "alpha",
"budget_monthly_usd": 500,
"allowed_models": ["gpt-4o", "claude-sonnet-4"],
"rate_limit_rpm": 60
}
}
def authenticate(request):
api_key = request.headers.get("Authorization", "").replace("Bearer ", "")
config = KEY_MAP.get(api_key)
if not config:
raise HTTPException(401, "Invalid API key")
return config
Per-Team Permissions
| Permission | Description | Example |
|---|---|---|
allowed_models | Which models the team can use | ["gpt-4o-mini", "claude-haiku"] |
max_tokens_per_request | Cap output size | 4096 |
rate_limit_rpm | Requests per minute | 100 |
rate_limit_tpm | Tokens per minute | 100000 |
budget_monthly_usd | Spending cap | 500 |
guardrail_level | Strictness of content filtering | "standard" or "strict" |
Rate Limiting
Without rate limiting, a single runaway script can burn through your entire monthly budget in hours. Rate limits protect against both abuse and accidental cost explosions.
Token Bucket Algorithm
import time
class TokenBucket:
def __init__(self, capacity: int, refill_rate: float):
self.capacity = capacity
self.tokens = capacity
self.refill_rate = refill_rate # tokens per second
self.last_refill = time.time()
def consume(self, tokens: int = 1) -> bool:
self._refill()
if self.tokens >= tokens:
self.tokens -= tokens
return True
return False
def _refill(self):
now = time.time()
elapsed = now - self.last_refill
self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
self.last_refill = now
Rate Limit Tiers
| Tier | RPM | TPM | Max Concurrent | Use Case |
|---|---|---|---|---|
| Free | 10 | 10K | 2 | Development |
| Standard | 60 | 100K | 10 | Internal tools |
| Premium | 300 | 500K | 50 | Production apps |
| Enterprise | 1000+ | 2M+ | 200+ | High-traffic services |
Model Routing
Intelligent routing is where the gateway pays for itself -- sending simple queries to cheap models and complex ones to capable models can cut costs by 50-80% without sacrificing quality.
Routing Strategies
| Strategy | Description | Best For |
|---|---|---|
| Cost-optimized | Route to cheapest model that can handle the task | Budget-conscious |
| Quality-first | Route to best model, fall back on failure | High-stakes tasks |
| Latency-first | Route to fastest responding model | Real-time apps |
| Load-balanced | Round-robin across providers | Even distribution |
| Capability-based | Route by task type (vision, code, etc.) | Multi-modal apps |
| Cascading | Try small model first, escalate if needed | Cost + quality balance |
Router Implementation
ROUTING_TABLE = {
"simple_qa": {
"primary": "gpt-4o-mini",
"fallback": ["claude-haiku-3.5", "gemini-flash"]
},
"complex_reasoning": {
"primary": "claude-sonnet-4",
"fallback": ["gpt-4o", "gemini-pro"]
},
"code_generation": {
"primary": "claude-sonnet-4",
"fallback": ["gpt-4o"]
},
"vision": {
"primary": "gpt-4o",
"fallback": ["gemini-pro", "claude-sonnet-4"]
}
}
def classify_and_route(request):
task_type = classify_task(request.messages) # lightweight classifier
config = ROUTING_TABLE[task_type]
return config["primary"]
Cascading Pattern
async def cascade_call(request, models=["gpt-4o-mini", "gpt-4o", "claude-sonnet-4"]):
"""Try cheaper model first; escalate if response quality is low."""
for model in models:
response = await call_model(model, request)
# Quality gate: check if response is good enough
quality = assess_quality(request, response)
if quality.score > 0.8:
return response
# If last model, return whatever we have
if model == models[-1]:
return response
return response
Failover Patterns
LLM providers have outages -- and when your entire product depends on a single API, that outage becomes your outage. Multi-provider failover with circuit breakers keeps your application running through provider disruptions.
| Pattern | Description | Recovery Time |
|---|---|---|
| Active-passive | Primary + standby | Seconds |
| Active-active | Multiple active providers | Instant (next request) |
| Circuit breaker | Disable failed provider temporarily | Configurable (30s-5m) |
| Retry with backoff | Retry on transient errors | Milliseconds-seconds |
| Hedge requests | Send to N providers, take first response | Instant |
Circuit Breaker
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=60):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.last_failure = None
self.state = "closed" # closed=normal, open=failing, half_open=testing
def call(self, fn, *args, **kwargs):
if self.state == "open":
if time.time() - self.last_failure > self.recovery_timeout:
self.state = "half_open"
else:
raise CircuitOpenError()
try:
result = fn(*args, **kwargs)
if self.state == "half_open":
self.state = "closed"
self.failure_count = 0
return result
except Exception as e:
self.failure_count += 1
self.last_failure = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "open"
raise
Caching
Caching is the easiest way to reduce LLM costs and latency simultaneously. For applications with repetitive queries, a semantic cache can eliminate 15-40% of LLM calls entirely.
| Cache Type | Hit Rate | Latency Savings | Cost Savings |
|---|---|---|---|
| Exact match | Low (5-15%) | 100% (no LLM call) | 100% |
| Semantic cache | Medium (15-40%) | 100% (no LLM call) | 100% |
| KV cache (streaming) | N/A | 30-50% | 30-50% |
Semantic Cache
def get_or_generate(query: str, threshold: float = 0.95):
query_embedding = embed(query)
# Search cache
cached = cache_db.search(query_embedding, top_k=1)
if cached and cached[0].score > threshold:
return cached[0].response # Cache hit
# Cache miss: call LLM
response = llm.generate(query)
cache_db.insert(query_embedding, response, ttl=3600)
return response
Cost Attribution
When multiple teams share LLM infrastructure, you need per-team cost tracking for budgeting, chargeback, and identifying optimization opportunities. Without attribution, nobody owns the bill.
Token-Based Cost Tracking
PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00}, # per 1M tokens
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"claude-sonnet-4": {"input": 3.00, "output": 15.00},
"claude-haiku-3.5": {"input": 0.80, "output": 4.00},
}
def calculate_cost(model, input_tokens, output_tokens):
prices = PRICING[model]
cost = (input_tokens * prices["input"] + output_tokens * prices["output"]) / 1_000_000
return round(cost, 6)
Cost Dashboard Metrics
| Metric | Granularity | Purpose |
|---|---|---|
| Total spend | Daily/weekly/monthly | Budget tracking |
| Cost per team | Per team/project | Chargeback |
| Cost per request | Per request | Optimization |
| Model mix | % by model | Routing efficiency |
| Cache hit savings | Daily | ROI on caching |
| Cost per successful task | Per task type | Efficiency |
Logging and Observability
You cannot debug, optimize, or audit what you do not log. Structured logging at the gateway layer gives you a single pane of glass across all LLM providers and teams.
What to Log
| Field | Purpose |
|---|---|
request_id | Correlation |
timestamp | Timeline |
team_id | Attribution |
model | Routing analysis |
input_tokens | Cost, monitoring |
output_tokens | Cost, monitoring |
latency_ms | Performance |
status_code | Error tracking |
guardrail_triggered | Safety monitoring |
cache_hit | Efficiency |
cost_usd | Finance |
Do NOT log: Full prompt/response text in production (PII risk). Use a separate, access-controlled audit store if needed.
Common Pitfalls
Gateway mistakes are amplified across every request in your organization. A missing timeout or absent rate limit affects every team and every application simultaneously.
| Pitfall | Problem | Fix |
|---|---|---|
| No failover | Single provider outage = total outage | Multi-provider with circuit breaker |
| Logging full prompts | PII exposure, storage cost | Log metadata only, separate audit store |
| No rate limiting | Cost explosion, abuse | Per-team token bucket |
| Static routing | Suboptimal cost/quality | Dynamic routing based on task |
| No cost alerts | Surprise bills | Budget caps with alerts at 50%, 80%, 100% |
| No caching | Paying for repeated queries | Semantic cache for common queries |
| Sync-only gateway | Bottleneck under load | Async processing, streaming passthrough |
| Missing timeout | Hung requests waste resources | 30-120s timeouts on all provider calls |