Chapter 17 of 20
Capstone 2: Code Review Agent
Every engineering team has felt it: a pull request sits in the queue for two days because the only person who knows that subsystem is on vacation. When it finally gets reviewed, the reviewer catches a style violation and a missing null check but misses the SQL injection hiding behind a string interp
Part 5 — Capstones
Capstone 2: Code Review Agent
Every engineering team has felt it: a pull request sits in the queue for two days because the only person who knows that subsystem is on vacation. When it finally gets reviewed, the reviewer catches a style violation and a missing null check but misses the SQL injection hiding behind a string interpolation on line 247. Code reviews are simultaneously the most important quality gate in software development and the most inconsistent. This capstone builds an automated PR reviewer that combines static analysis, security scanning, style enforcement, and LLM-generated natural-language feedback into a single agent pipeline: the kind of system that ships in your portfolio and demonstrates every pattern from Parts 1 through 4.
What You Will Learn
- Design an orchestrator agent that delegates code analysis to specialist worker agents
- Integrate Git operations and webhook-driven triggers into an agentic pipeline
- Build AST-parsing and pattern-matching tools for syntax, style, and security analysis
- Implement confidence scoring so the agent distinguishes hard errors from soft suggestions
- Generate natural-language review comments that cite specific lines and explain reasoning
- Deploy the complete system behind a webhook endpoint with observability and cost controls
C2.1 The Problem: Reviews Are Bottlenecks
Software teams treat code review as a prerequisite for merging, and rightly so. Reviews catch bugs, enforce architectural standards, and transfer knowledge across the team. But the process has three structural weaknesses that no amount of process documentation fixes. First, reviews are inconsistent: the same diff reviewed by two engineers produces different findings because each reviewer carries different mental models. Second, reviews are slow: the median time-to-first-response for a PR in a large organization is four hours, and complex changes often wait more than a day. Third, reviews are shallow on security: spotting injection vulnerabilities or insecure deserialization requires specialized knowledge that generalist engineers do not exercise daily.
An automated code review agent does not replace human reviewers. It handles the mechanical work, style enforcement, known vulnerability patterns, documentation gaps, so that humans can focus on architecture, intent, and edge cases requiring domain judgment.
Scope Check
This capstone builds a review assistant, not a replacement for human judgment. The agent posts comments and flags issues; a human approves or requests changes. This human-in-the-loop pattern (Chapter 11) is essential for trust. Teams that ship fully autonomous merge bots discover quickly that LLMs hallucinate false positives, and nothing erodes developer trust faster than a bot blocking a correct PR.
C2.2 System Overview
The code review agent receives a webhook when a pull request is opened or updated. It extracts the diff, routes it through parallel analysis workers, merges findings into a unified report with confidence scores, generates human-readable comments with line-level citations, and posts them back to the pull request. The entire pipeline runs in under ninety seconds for a typical 300-line diff.
- Trigger & Extraction. A webhook endpoint receives the PR event, clones the repository at the target commit, and computes a structured diff with file paths, hunks, and line numbers.
- Parallel Analysis. Three specialist workers run concurrently: a security scanner, a style checker, and an LLM-powered logic analyzer.
- Finding Merge. The orchestrator collects results, deduplicates overlapping findings, and assigns confidence scores (0.0–1.0) based on corroboration across workers.
- Comment Generation. An LLM generates natural-language review comments for each finding, citing the exact file, line range, and code snippet.
- Posting. The agent posts inline comments on the PR via the platform API, batching low-confidence findings into a single summary.
Figure C2-1. Code review pipeline: from PR webhook through parallel analysis to posted review comments.
C2.3 Architecture: Orchestrator and Specialist Workers
The system follows the supervisor-worker pattern from Chapter 10. An orchestrator agent receives the structured diff and delegates to three workers, each with its own tools, system prompt, and output schema.
| Worker | Tools | Output |
|---|---|---|
| Security Scanner | Regex matcher, CVE lookup, entropy-based secret detector | Findings with CWE identifiers, severity, and affected lines |
| Style Checker | AST parser, lint rule engine, naming validator | Style violations with rule references and auto-fix suggestions |
| Logic Analyzer | LLM chain-of-thought (no external tools) | Bugs, edge cases, complexity warnings with reasoning traces |
The security scanner and style checker are deterministic tools that the agent wraps, while the logic analyzer is a pure LLM reasoning task. Two of three workers produce reproducible results independent of model temperature. The third contributes creative reasoning that static tools cannot provide.
Why Not One Big Prompt?
A single prompt that says “review this code for security, style, and logic” produces scattered, low-confidence output. By splitting into specialist workers, each has a narrower scope and structured output schema. The orchestrator can compare and merge results, catching cases where the security scanner flags a line that the logic analyzer considers safe.
C2.4 Git Integration and Diff Extraction
The pipeline begins when a webhook delivers a PR event. The handler validates the signature, extracts base and head SHAs, and computes a structured diff. We parse raw diffs into dataclasses so every downstream worker receives clean, typed data.
@dataclass
class DiffHunk:
start_line: int
end_line: int
content: str
added_lines: list[int] = field(default_factory=list)
@dataclass
class FileDiff:
path: str
language: str
hunks: list[DiffHunk] = field(default_factory=list)
is_new: bool = False
def extract_structured_diff(repo_path: str, base: str, head: str) -> list[FileDiff]:
raw = subprocess.run(
["git", "diff", "--unified=5", f"{base}...{head}"],
cwd=repo_path, capture_output=True, text=True, check=True,
).stdout
files, current_file, current_hunk = [], None, None
for line in raw.splitlines():
if line.startswith("diff --git"):
path = line.split(" b/")[-1]
current_file = FileDiff(path=path, language=_detect_language(path))
files.append(current_file)
elif line.startswith("@@") and current_file:
start = int(line.split(" ")[2].split(",")[0].lstrip("+"))
current_hunk = DiffHunk(start_line=start, end_line=start, content="")
current_file.hunks.append(current_hunk)
elif current_hunk is not None:
current_hunk.content += line + "\n"
if line.startswith("+") and not line.startswith("+++"):
current_hunk.added_lines.append(current_hunk.end_line)
current_hunk.end_line += 1
return files
C2.5 Building the Security Scanner
The security scanner applies pattern-matching rules to added lines. It uses regex patterns, entropy calculations, and a curated rule database — no LLM calls. This makes it fast, reproducible, and auditable.
@dataclass
class SecurityFinding:
rule_id: str
severity: str # "critical", "high", "medium", "low"
cwe: str
file: str
line: int
snippet: str
message: str
confidence: float # 0.0 to 1.0
SECURITY_RULES = [
{"id": "SEC-001", "cwe": "CWE-89", "severity": "critical",
"pattern": re.compile(r"""f['\"].*(?:SELECT|INSERT|UPDATE|DELETE)\s.*\{""", re.I),
"message": "SQL injection via f-string. Use parameterized queries.",
"confidence": 0.92},
{"id": "SEC-002", "cwe": "CWE-798", "severity": "high",
"pattern": re.compile(r"""(?:password|secret|api_key)\s*=\s*['\"][^'\"]{8,}['\"]"""),
"message": "Hardcoded credential. Move to environment variables.",
"confidence": 0.85},
{"id": "SEC-003", "cwe": "CWE-79", "severity": "high",
"pattern": re.compile(r"""\.innerHTML\s*=\s*[^;]*(?:user|input|query)""", re.I),
"message": "XSS via innerHTML with user data. Use textContent or sanitize.",
"confidence": 0.80},
{"id": "SEC-004", "cwe": "CWE-502", "severity": "critical",
"pattern": re.compile(r"""pickle\.loads?\s*\("""),
"message": "Unsafe deserialization. Use json or a safe serializer.",
"confidence": 0.95},
]
def scan_security(files: list[FileDiff]) -> list[SecurityFinding]:
findings = []
for f in files:
for hunk in f.hunks:
for i, line in enumerate(hunk.content.splitlines()):
if not line.startswith("+"):
continue
code = line[1:]
for rule in SECURITY_RULES:
if rule["pattern"].search(code):
findings.append(SecurityFinding(
rule_id=rule["id"], severity=rule["severity"],
cwe=rule["cwe"], file=f.path,
line=hunk.start_line + i, snippet=code.strip(),
message=rule["message"], confidence=rule["confidence"],
))
# Entropy-based secret detection
for token in re.findall(r"""['\"]([^'\"]{20,})['\"]""", code):
if _entropy(token) > 4.5:
findings.append(SecurityFinding(
rule_id="SEC-ENT", severity="high", cwe="CWE-798",
file=f.path, line=hunk.start_line + i,
snippet=code.strip(), confidence=0.70,
message=f"High-entropy string ({_entropy(token):.1f}) "
f"may be a hardcoded secret.",
))
return findings
False Positive Management
Every static analysis tool generates false positives. This system uses three strategies: (1) confidence scores let the orchestrator suppress low-confidence findings, (2) a .reviewignore file lets developers mark intentional patterns like test fixtures, and (3) the LLM-generated comment explains why the pattern was flagged, giving developers enough context to dismiss false positives quickly.
C2.6 AST Parsing and Style Checking
The style checker uses Abstract Syntax Tree parsing for structural awareness beyond regex. It understands scope, type annotations, and function signatures. Analysis is scoped to lines modified in the diff — developers rightly object when a bot comments on code they did not touch.
class PythonStyleAnalyzer(ast.NodeVisitor):
def __init__(self, path: str, changed_lines: set[int]):
self.path = path
self.changed = changed_lines
self.findings: list[StyleFinding] = []
def visit_FunctionDef(self, node):
if node.lineno not in self.changed:
return self.generic_visit(node)
length = (node.end_lineno or node.lineno) - node.lineno
if length > 40:
self._add("STY-001", node.lineno,
f"Function '{node.name}' is {length} lines. Extract helpers.",
"Break into smaller single-responsibility functions.", 0.80)
if node.returns is None and not node.name.startswith("_"):
self._add("STY-002", node.lineno,
f"Public function '{node.name}' lacks return type annotation.",
"Add -> ReturnType after parameters.", 0.85)
if len(node.args.args) > 5:
self._add("STY-003", node.lineno,
f"Function '{node.name}' has {len(node.args.args)} params.",
"Group related parameters into a dataclass.", 0.75)
self.generic_visit(node)
def visit_ExceptHandler(self, node):
if node.lineno in self.changed and node.type is None:
self._add("STY-004", node.lineno,
"Bare except catches KeyboardInterrupt and SystemExit.",
"Catch a specific exception type.", 0.95)
self.generic_visit(node)
def _add(self, rule_id, line, message, suggestion, confidence):
self.findings.append(StyleFinding(
rule_id=rule_id, file=self.path, line=line,
message=message, suggestion=suggestion, confidence=confidence))
C2.7 The Logic Analyzer: LLM-Powered Reasoning
The logic analyzer is the only worker that uses the LLM for its core analysis. It applies chain-of-thought reasoning (Chapter 5) to identify bugs, edge cases, and design problems that no static rule can catch.
class LogicFinding(BaseModel):
file: str
start_line: int
end_line: int
category: str # "bug", "edge_case", "complexity", "performance"
severity: str # "error", "warning", "suggestion"
description: str
reasoning: str # Chain-of-thought explanation
suggestion: str
confidence: float
LOGIC_SYSTEM_PROMPT = """You are an expert code reviewer. Analyze the diff
for bugs, edge cases, and design issues. Rules:
1. Focus ONLY on added/changed lines (starting with +).
2. Explain reasoning step by step for each finding.
3. Confidence: 0.9+ = certain bug, 0.7-0.9 = likely issue, 0.5-0.7 = suggestion.
4. Do NOT flag style or security issues (other workers handle those).
5. Cite specific line numbers and code snippets."""
def analyze_logic(files, client, model="gpt-4o"):
context = "\n\n".join(
f"### {f.path} (lines {h.start_line}-{h.end_line})\n```{f.language}\n{h.content}```"
for f in files if not f.is_new for h in f.hunks
)[:30_000]
response = client.beta.chat.completions.parse(
model=model, temperature=0.2,
messages=[
{"role": "system", "content": LOGIC_SYSTEM_PROMPT},
{"role": "user", "content": f"Review this diff:\n\n{context}"},
],
response_format=LogicAnalysisResult,
)
return response.choices[0].message.parsed
The temperature is 0.2 for analytical consistency. The system prompt explicitly excludes style and security concerns, preventing overlap with the deterministic workers. Pydantic structured output (Chapter 6) ensures the orchestrator can process findings programmatically.
C2.8 The Orchestrator: Merging and Confidence Scoring
The orchestrator runs all three workers concurrently, normalizes their outputs into a unified schema, and deduplicates overlapping findings. When multiple independent workers flag the same line, confidence is boosted. A SQL injection found by both regex and LLM reasoning is more credible than either alone.
async def run_review_pipeline(files, client, config):
security_task = asyncio.to_thread(scan_security, files)
style_task = asyncio.to_thread(analyze_style, files)
logic_task = asyncio.to_thread(analyze_logic, files, client, config["model"])
security, style, logic = await asyncio.gather(
security_task, style_task, logic_task)
# Normalize into UnifiedFinding, then deduplicate
unified = _normalize(security, style, logic.findings)
merged = _deduplicate(unified)
threshold = config.get("confidence_threshold", 0.6)
return [f for f in merged if f.confidence >= threshold]
def _deduplicate(findings):
by_loc = {}
for f in findings:
by_loc.setdefault((f.file, f.line), []).append(f)
merged = []
for group in by_loc.values():
primary = max(group, key=lambda f: f.confidence)
all_sources = list({s for f in group for s in f.sources})
# Each corroborating source adds 0.1, capped at 1.0
primary.confidence = min(1.0, primary.confidence + 0.1 * (len(all_sources) - 1))
primary.sources = all_sources
merged.append(primary)
return merged
C2.9 Generating Natural-Language Comments
Raw findings need transformation into readable, actionable PR comments. The comment generator uses a structured prompt to produce comments that start with a severity indicator, state the issue in one sentence, show the problematic snippet, explain the risk, and suggest a fix — all within 150 words.
COMMENT_PROMPT = """Write a code review comment for this finding.
Format: severity indicator, one-sentence issue, code snippet, why it matters,
specific fix suggestion. Keep under 150 words.
Finding: {file}:{line} [{severity}] {message}
Reasoning: {reasoning}
Snippet: {snippet}"""
async def generate_comments(findings, file_contents, client, model="gpt-4o"):
comments = []
for f in findings:
snippet = _extract_snippet(file_contents.get(f.file, ""), f.line)
resp = await asyncio.to_thread(lambda: client.chat.completions.create(
model=model, temperature=0.3, max_tokens=300,
messages=[{"role": "user", "content": COMMENT_PROMPT.format(
file=f.file, line=f.line, severity=f.severity,
message=f.message, reasoning=f.reasoning, snippet=snippet)}],
))
comments.append(ReviewComment(
file=f.file, line=f.line,
body=resp.choices[0].message.content, severity=f.severity))
return comments
C2.10 Posting Comments and the Webhook Endpoint
The final stage posts review comments atomically as a single GitHub review. The event is always "COMMENT". The agent surfaces information but never approves or blocks a PR.
async def post_review(comments, repo, pr_number, commit_sha, token):
review_body = {
"commit_id": commit_sha,
"event": "COMMENT", # Never auto-approve or request changes
"body": f"**Code Review Agent** found **{len(comments)}** item(s).",
"comments": [
{"path": c.file, "line": c.line, "body": c.body}
for c in comments
],
}
async with httpx.AsyncClient() as http:
resp = await http.post(
f"https://api.github.com/repos/{repo}/pulls/{pr_number}/reviews",
json=review_body,
headers={"Authorization": f"Bearer {token}",
"Accept": "application/vnd.github.v3+json"},
)
resp.raise_for_status()
@app.post("/webhook")
async def handle_webhook(request: Request):
# Validate signature, extract PR data, run pipeline, post review
payload = await request.json()
if payload.get("action") not in ("opened", "synchronize"):
return {"status": "skipped"}
pr = payload["pull_request"]
files = extract_structured_diff(repo_path, pr["base"]["sha"], pr["head"]["sha"])
findings = await run_review_pipeline(files, client, config)
comments = await generate_comments(findings, get_contents(files), client)
await post_review(comments, payload["repository"]["full_name"],
pr["number"], pr["head"]["sha"], GITHUB_TOKEN)
return {"findings": len(findings), "comments": len(comments)}
Never Auto-Approve
The event field is always "COMMENT", never "APPROVE" or "REQUEST_CHANGES". An automated agent should surface information, not make merge decisions. Teams that trust the agent after weeks of calibration can upgrade to "REQUEST_CHANGES" for critical-severity findings only.
C2.11 Observability and Cost Control
A production review agent needs monitoring across three dimensions: correctness (are findings useful?), latency (does the review post before the developer context-switches?), and cost (how many tokens per review?). Track three key metrics: the dismissal rate (how often developers dismiss findings), the catch rate (how often the agent flags real issues), and the cost per review.
async def run_pipeline_instrumented(files, client, config):
start = time.monotonic()
total_lines = sum(len(h.added_lines) for f in files for h in f.hunks)
logger.info("review.started", files=len(files), lines=total_lines)
findings = await run_review_pipeline(files, client, config)
elapsed = time.monotonic() - start
# Cost estimation: ~15 tokens/line, GPT-4o pricing
est_tokens = total_lines * 15
logger.info("review.completed", findings=len(findings),
seconds=round(elapsed, 2),
est_cost_usd=round(est_tokens * 0.000005, 4))
return findings
Feedback Loop
Store every finding and its resolution (accepted, dismissed, modified) in a database. Periodically review dismissed findings: if SEC-002 is dismissed 40% of the time in test files, add a suppression rule for **/test_*.py. This feedback loop turns a noisy tool into a trusted teammate.
Portfolio Project: Build Your Code Review Agent
Build and deploy a complete automated PR review agent. Your agent must receive webhook events, run at least two analysis workers in parallel, merge findings with confidence scores, and post inline review comments. Include a .reviewignore file for false positive suppression and structured logging for observability.
Choose Your Domain Variant
DevOps Pipeline Reviewer Dockerfile, CI/CD YAML, IaC. Flag insecure base images, exposed ports, missing health checks, overly permissive IAM.
FinTech Compliance Checker PCI-DSS violations: logged card numbers, unencrypted PII, missing audit trails, transaction validation gaps.
Healthcare HIPAA Scanner PHI exposure in logs, unencrypted data at rest, missing access controls, HIPAA-relevant data flows.
Open Source Maintainer Bot License compatibility, API breaking changes, documentation coverage, test completeness. Contributor-friendly tone.
Mobile App Reviewer Permission escalation, insecure local storage, certificate pinning, background data leaks, platform anti-patterns.
Data Pipeline Auditor Schema drift, missing validation, null propagation, partition skew, cost-explosive query patterns in ETL code.
Summary
This capstone assembled a complete automated PR review agent combining deterministic static analysis with LLM-powered reasoning. The supervisor-worker pattern runs three specialist agents in parallel — security scanning, style checking, and logic analysis — then merges findings with confidence scoring and posts natural-language review comments.
Key Takeaways
- Separate deterministic tools from LLM reasoning. Static rules for security and style are reproducible and auditable; reserve the LLM for judgment calls requiring contextual reasoning about logic, edge cases, and design.
- Confidence scoring turns noise into signal. Every finding carries a 0.0–1.0 score. Corroborating sources boost confidence. A configurable threshold lets teams tune thoroughness versus noise.
- Scope analysis to the diff. Developers lose trust in tools that comment on lines they did not change. Restrict analysis to added and modified lines only.
- Post comments, never approvals. Use
COMMENTnotAPPROVEorREQUEST_CHANGES. Surface information for human decision-makers; prevent false positives from blocking merges. - Build the feedback loop from day one. Track dismissal rates, catch rates, and cost per review. Store every finding and resolution to tune rules and thresholds over time.
Exercises
| Type | Exercise | Description |
|---|---|---|
| Conceptual | Confidence Calibration | The security scanner assigns pickle.loads a confidence of 0.95 and high-entropy strings 0.70. A team finds 30% of entropy findings are false positives (UUID constants) while pickle findings are 100% correct. How would you adjust scores, and what data would you collect to automate calibration? |
| Coding | Multi-Language Support | The style checker handles only Python via ast. Extend it to JavaScript/TypeScript using tree-sitter. Implement three rules: arrow function consistency, unused imports, and missing error handling in async/await chains. Match the existing StyleFinding schema. |
| Design | Rate Limiting and Cost Budgets | A monorepo with 50 developers generates 200 PRs/day. Each review uses ~15k input and ~3k output tokens. Design a system to stay within $500/month: consider per-PR token caps, priority queues for security paths, caching for unchanged files, and graceful degradation when budget is exhausted. |