GenAI Architecture 01

Simple Chat API

The most fundamental GenAI pattern: a single stateless LLM call with a system prompt. Every generative AI application starts here. Understand request/response flow, prompt design, temperature tuning, token limits, streaming, and error handling.

4 min readSystem PromptsTemperature & SamplingStreaming ResponsesError HandlingOpen in Colab

1. Architecture Overview

The Simple Chat API is the most basic GenAI architecture pattern. It consists of a single, stateless request-response cycle: a user sends a message, the system prepends a system prompt, calls an LLM, and returns the generated text. No memory, no retrieval, no tool use — just a direct conversation turn.

When to Use

  • Single-turn Q&A applications (FAQ bots, helpdesk)
  • Text transformation tasks (summarization, translation, reformatting)
  • Code generation or completion from a single prompt
  • Prototyping and validating prompt designs before adding complexity

Complexity Level

Low. This is the starting point for every GenAI project. If this pattern solves your problem, do not add additional complexity. Many production applications are just well-crafted system prompts with good error handling.

Tip: Start with the simplest architecture that works. You can always layer on memory, RAG, or tool use later — but premature complexity is the biggest mistake in GenAI engineering.

2. Architecture Diagram

Diagram 1

Architecture diagram — Simple Chat API: stateless request-response with system prompt injection

3. Components Deep Dive

ComponentDescription
💬 System PromptDefines the LLM's role, personality, constraints, and output format. This is your primary lever for controlling behavior. Keep it clear, specific, and tested.
🌡 TemperatureControls randomness in token selection. 0.0 = deterministic (factual tasks), 0.7 = creative balance, 1.0+ = highly creative. Always tune for your use case.
📏 Max TokensUpper bound on output length. Set this to avoid runaway generation costs. Consider: input tokens + max_tokens must fit within the model's context window.
⚡ StreamingDelivers tokens incrementally via SSE (Server-Sent Events). Reduces perceived latency from seconds to milliseconds for first visible token. Essential for chat UIs.
🛡 API GatewayHandles authentication, rate limiting, request validation, and API key management. Sits between the client and the LLM provider to add security and control.
⚠ Error HandlingHandle rate limits (429), timeouts, malformed responses, and provider outages. Implement retries with exponential backoff and circuit breaker patterns.

4. Implementation

Basic Chat Completion

import anthropic

client = anthropic.Anthropic()

def chat(user_message: str, system: str = "You are a helpful assistant.") -> str:
    """Single-turn chat completion."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system=system,
        messages=[{"role": "user", "content": user_message}],
        temperature=0.3,
    )
    return response.content[0].text

Streaming Response

def chat_stream(user_message: str, system: str = "You are a helpful assistant."):
    """Stream tokens as they are generated."""
    with client.messages.stream(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system=system,
        messages=[{"role": "user", "content": user_message}],
    ) as stream:
        for text in stream.text_stream:
            print(text, end="", flush=True)

Error Handling with Retry

import time
from anthropic import RateLimitError, APITimeoutError

def chat_with_retry(user_message, max_retries=3):
    for attempt in range(max_retries):
        try:
            return chat(user_message)
        except RateLimitError:
            wait = 2 ** attempt  # exponential backoff
            print(f"Rate limited. Retrying in {wait}s...")
            time.sleep(wait)
        except APITimeoutError:
            print("Timeout. Retrying...")
    raise Exception("Max retries exceeded")

5. Data Flow

Here is the step-by-step flow of a single request through the Simple Chat API architecture:

Data Flow

StepActionDetails
1Client sends requestUser message + optional parameters (temperature, max_tokens) via HTTP POST
2API gateway validatesCheck auth token, rate limits, input length, and content policy
3System prompt prependedServer-side system prompt is added to the messages array (never exposed to client)
4LLM API calledRequest forwarded to the model provider (Anthropic, OpenAI, etc.)
5Tokens generatedModel generates output tokens autoregressively
6Response returnedComplete text (or streamed chunks) sent back to client
7Logging & metricsLog latency, token counts, and errors for observability

6. Trade-offs & Considerations

AdvantageLimitation
Simplest possible architectureNo conversation memory (stateless)
Low latency (single API call)Cannot access external data or tools
Easy to debug and testLimited to model's training knowledge
Minimal infrastructure neededSystem prompt engineering can be finicky
Low cost per requestNo built-in content grounding

When to upgrade: Move to Architecture 02 (Conversational Chatbot) when users need multi-turn context. Move to Architecture 03 (RAG) when the model needs access to your proprietary data.

7. Production Checklist

  • API key rotation and secrets management (e.g., AWS Secrets Manager, GCP Secret Manager)
  • Rate limiting per user/API key to prevent abuse
  • Input validation: max length, content filtering, injection detection
  • Output validation: format checks, PII scanning, toxicity filtering
  • Structured logging: request ID, latency, token usage, model version
  • Retry logic with exponential backoff and circuit breaker
  • Cost monitoring and alerting on token spend
  • Prompt versioning and A/B testing framework
  • Health check endpoint for load balancer
  • Graceful degradation when LLM provider is down