Quick Reference 03

LLM APIs

Quick reference for calling OpenAI, Anthropic, and Google LLM APIs with parameters, streaming, and function calling.

6 min readAI EngineeringQuick ReferenceDownload PDF

API Comparison at a Glance

Every LLM provider has different SDK conventions, parameter names, and authentication patterns. This table saves you from digging through three different sets of docs when you need to switch or compare providers.

FeatureOpenAIAnthropicGoogle (Gemini)
SDKopenaianthropicgoogle-genai
AuthOPENAI_API_KEYANTHROPIC_API_KEYGOOGLE_API_KEY
Modelsgpt-4o, gpt-4.1, o3claude-sonnet-4, claude-opus-4gemini-2.5-pro, gemini-2.5-flash
Max output16K (gpt-4o)128K (claude-opus-4)65K (gemini-2.5-pro)
StreamingYesYesYes
Function callingYes (tools)Yes (tools)Yes (tools)
VisionYesYesYes
System promptsystem rolesystem parametersystem_instruction

OpenAI API

OpenAI's API is the de facto standard that most other providers emulate. If you learn one API well, learn this one -- many proxy layers and gateways use its format as a universal interface.

Basic Completion

from openai import OpenAI
client = OpenAI()  # reads OPENAI_API_KEY

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain RAG in one paragraph."}
    ],
    temperature=0.7,
    max_tokens=500,
    top_p=0.95,
    frequency_penalty=0.0,
    presence_penalty=0.0,
    stop=["\n\n"]  # optional stop sequences
)
print(response.choices[0].message.content)

OpenAI Streaming

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
    stream=True
)
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

OpenAI Function Calling

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["location"]
        }
    }
}]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Weather in Paris?"}],
    tools=tools,
    tool_choice="auto"  # or "required" or {"type":"function","function":{"name":"get_weather"}}
)

# Check if tool call was made
tool_calls = response.choices[0].message.tool_calls
if tool_calls:
    name = tool_calls[0].function.name
    args = json.loads(tool_calls[0].function.arguments)

Anthropic API

Anthropic's API differs from OpenAI in several key ways: the system prompt is a top-level parameter, content blocks are structured arrays, and tool use returns typed blocks instead of JSON strings.

Basic Completion

from anthropic import Anthropic
client = Anthropic()  # reads ANTHROPIC_API_KEY

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system="You are a helpful assistant.",
    messages=[
        {"role": "user", "content": "Explain RAG in one paragraph."}
    ],
    temperature=0.7,
    top_p=0.95,
    stop_sequences=["\n\nHuman:"]
)
print(message.content[0].text)

Anthropic Streaming

with client.messages.stream(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Anthropic Tool Use

tools = [{
    "name": "get_weather",
    "description": "Get current weather for a location",
    "input_schema": {
        "type": "object",
        "properties": {
            "location": {"type": "string", "description": "City name"},
            "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
        },
        "required": ["location"]
    }
}]

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "Weather in Paris?"}]
)

# Check for tool use
for block in message.content:
    if block.type == "tool_use":
        tool_name = block.name
        tool_input = block.input  # already a dict
        tool_use_id = block.id

Google Gemini API

Gemini uses a distinct SDK structure from OpenAI and Anthropic, with config-based parameter passing and its own tool declaration format. It offers competitive pricing and tight integration with Google Cloud services.

Basic Completion

from google import genai

client = genai.Client()  # reads GOOGLE_API_KEY

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Explain RAG in one paragraph.",
    config={
        "system_instruction": "You are a helpful assistant.",
        "temperature": 0.7,
        "top_p": 0.95,
        "max_output_tokens": 1024,
    }
)
print(response.text)

Gemini Streaming

response = client.models.generate_content_stream(
    model="gemini-2.5-flash",
    contents="Hello"
)
for chunk in response:
    print(chunk.text, end="", flush=True)

Gemini Function Calling

from google.genai import types

weather_tool = types.Tool(
    function_declarations=[
        types.FunctionDeclaration(
            name="get_weather",
            description="Get current weather",
            parameters=types.Schema(
                type="OBJECT",
                properties={
                    "location": types.Schema(type="STRING"),
                    "unit": types.Schema(
                        type="STRING", enum=["celsius", "fahrenheit"]
                    ),
                },
                required=["location"],
            ),
        )
    ]
)

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Weather in Paris?",
    config={"tools": [weather_tool]}
)

Parameters Reference

The same concept often has different parameter names across providers. This cross-reference table prevents subtle bugs when porting code between OpenAI, Anthropic, and Gemini.

ParameterOpenAIAnthropicGeminiRangeDefault
Temperaturetemperaturetemperaturetemperature0.0-2.01.0
Top Ptop_ptop_ptop_p0.0-1.01.0
Max outputmax_tokensmax_tokensmax_output_tokens1-model maxVaries
Stop sequencesstopstop_sequencesstop_sequencesList[str]None
Top KN/Atop_ktop_k1-NN/A
Frequency penaltyfrequency_penaltyN/Afrequency_penalty-2.0-2.00
Presence penaltypresence_penaltyN/Apresence_penalty-2.0-2.00
SeedseedN/AseedintNone

Structured Output (JSON Mode)

Getting reliable JSON from an LLM is critical for any production pipeline that parses model output programmatically. Each provider handles structured output differently -- some have native JSON mode, others require workarounds.

OpenAI

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "List 3 cities as JSON"}],
    response_format={"type": "json_object"}
)

Anthropic

# Use tool_use with a schema for structured output
# Or instruct in prompt: "Respond in valid JSON only"

Gemini

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="List 3 cities as JSON",
    config={"response_mime_type": "application/json"}
)

Error Handling

LLM APIs fail in predictable ways -- rate limits, timeouts, and context overflows account for 90% of production errors. Building proper retry logic from the start prevents cascading failures in your application.

ErrorHTTP CodeCauseAction
Rate limit429Too many requestsExponential backoff
Auth error401Bad API keyCheck key
Context overflow400Input too longTruncate or chunk
Server error500/503Provider issueRetry with backoff
TimeoutN/ASlow responseIncrease timeout, retry
Content filter400Safety triggerRephrase input

Retry Pattern

import time
from openai import RateLimitError, APIError

def call_with_retry(fn, max_retries=5):
    for attempt in range(max_retries):
        try:
            return fn()
        except RateLimitError:
            wait = 2 ** attempt
            time.sleep(wait)
        except APIError as e:
            if e.status_code >= 500:
                time.sleep(2 ** attempt)
            else:
                raise
    raise Exception("Max retries exceeded")

Cost Estimation

Token pricing varies by 100x across models, and the gap between input and output costs can be 4-5x. Knowing these numbers before you architect your system prevents budget surprises at scale.

ModelInput (per 1M tokens)Output (per 1M tokens)
GPT-4o$2.50$10.00
GPT-4.1$2.00$8.00
GPT-4.1-mini$0.40$1.60
Claude Opus 4$15.00$75.00
Claude Sonnet 4$3.00$15.00
Claude Haiku 3.5$0.80$4.00
Gemini 2.5 Pro$1.25-2.50$10.00-15.00
Gemini 2.5 Flash$0.15$0.60

Token estimation: ~1 token per 4 characters in English; ~1 token per 0.75 words.

Common Pitfalls

These mistakes show up in nearly every first production deployment. Most are trivial to fix if you catch them early, but expensive to debug after launch.

PitfallProblemFix
No retry logicFailures on transient errorsImplement exponential backoff
Ignoring rate limits429 errors cascadeUse rate limiter, queue requests
Hardcoded model namesBreaks on deprecationUse config/env vars for model names
No timeoutHung requestsSet timeout parameter (30-120s)
Logging full responsesCost, privacy, storage issuesLog metadata only, redact PII
Not counting tokensSurprise billsPre-count with tiktoken/anthropic-tokenizer
Sync calls in async appBlocked event loopUse async client variants
Not handling empty responsesNoneType errorsCheck response content before accessing