Chapter 16 of 18

Capstone 3: Intelligent Test Suite Generator

Test suites decay: new features ship without tests, defect patterns repeat in the same modules, and nobody has time to reprioritize the regression suite. Build a system that generates, prioritizes, and maintains test suites by learning from your application's change history and defect data.

8 min read

Part 5 — Capstones

Capstone 3: Intelligent Test Suite Generator

Test suites decay. New features get added without corresponding tests, defect patterns repeat in the same modules, and nobody has time to reprioritize the regression suite. In this capstone, you will build a system that generates, prioritizes, and maintains test suites by learning from your application's change history and defect data, turning reactive QA into proactive quality engineering.

Building time: ~2 hours Chapters used: 9, 10, 12, 15

What You Will Build

A change analyzer that reads git diffs and identifies which features and modules were modified
A defect pattern engine that mines historical bug data to find recurring failure modes
An intelligent test generator that creates new test cases targeting high-risk changes
A test suite optimizer that ranks and trims the regression suite based on risk and coverage
A test data generator that produces realistic, privacy-safe test data for each test case

Diagram 1

Figure C3.1 — Three input channels (code changes, defect history, existing tests) feed into a risk analysis engine that drives targeted test generation.

Architecture Overview

The system has three input channels: code changes, defect history, and existing test coverage. These feed into a central intelligence layer. The intelligence layer uses an LLM to synthesize these signals into prioritized test generation decisions.

from pydantic import BaseModel, Field
from enum import Enum
from typing import Optional
from datetime import datetime

class RiskLevel(str, Enum):
    CRITICAL = "critical"
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"

class ChangeType(str, Enum):
    ADDED = "added"
    MODIFIED = "modified"
    DELETED = "deleted"
    RENAMED = "renamed"

class CodeChange(BaseModel):
    file_path: str
    change_type: ChangeType
    lines_added: int = 0
    lines_deleted: int = 0
    diff_content: str = ""
    module: str = Field(description="Inferred module or feature area")

class DefectRecord(BaseModel):
    id: str
    title: str
    module: str
    severity: str
    root_cause: Optional[str] = None
    date_found: str
    component: Optional[str] = None

class RiskAssessment(BaseModel):
    module: str
    risk_level: RiskLevel
    risk_score: float = Field(ge=0, le=100)
    change_factor: float = Field(description="Risk from recent changes")
    defect_factor: float = Field(description="Risk from historical defects")
    coverage_factor: float = Field(description="Risk from low test coverage")
    rationale: str

class GeneratedTest(BaseModel):
    id: str
    title: str
    module: str
    risk_level: RiskLevel
    test_type: str
    preconditions: list[str]
    steps: list[str]
    expected_result: str
    test_data: Optional[dict] = None
    triggered_by: str = Field(description="What triggered this test: change, defect, or gap")

Step 1: Setup and Data Ingestion

The change analyzer runs git log --name-status to find recently modified files and maps each file path to a logical module name (e.g., auth/login.py maps to "Authentication") using a configurable lookup table. For each changed file, it extracts the diff content and counts lines added and deleted. Files with large diffs signal higher change risk.

The defect analyzer loads historical bug records (from CSV or JSON), counts defects per module, and calculates defect density. It produces a hotspot report showing which modules have the most bugs, with severity breakdowns. Modules with critical defects get flagged as higher risk. A sample dataset with bugs across Authentication, Shopping Cart, Payment Processing, Reporting, and Order Management demonstrates the analysis.

Step 2: Core Processing Pipeline — Risk-Based Test Generation

The risk analysis engine combines change data and defect patterns to calculate a risk score for each module. Modules with both recent changes and a history of defects get the highest scores:

"""modules/risk_engine.py — Calculate risk scores and prioritize test generation."""
import json
import os
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def calculate_risk_scores(
    changes: list[dict],
    defect_patterns: dict,
    existing_test_count: dict = None,
) -> list[dict]:
    """Calculate risk score for each module based on changes, defects, and coverage."""
    if existing_test_count is None:
        existing_test_count = {}

    # Gather all modules mentioned in changes or defects
    modules = set()
    for c in changes:
        modules.add(c.get("module", "Unknown"))
    for hotspot in defect_patterns.get("hotspots", []):
        modules.add(hotspot["module"])

    assessments = []
    for module in sorted(modules):
        # Change factor: more lines changed = higher risk
        module_changes = [c for c in changes if c.get("module") == module]
        total_lines = sum(
            c.get("lines_added", 0) + c.get("lines_deleted", 0)
            for c in module_changes
        )
        change_factor = min(100, total_lines * 2)  # Cap at 100

        # Defect factor: more historical defects = higher risk
        hotspot = next(
            (h for h in defect_patterns.get("hotspots", [])
             if h["module"] == module),
            None,
        )
        defect_count = hotspot["defect_count"] if hotspot else 0
        has_critical = False
        if hotspot:
            has_critical = hotspot.get("severity_breakdown", {}).get("critical", 0) > 0
        defect_factor = min(100, defect_count * 20 + (30 if has_critical else 0))

        # Coverage factor: fewer existing tests = higher risk
        test_count = existing_test_count.get(module, 0)
        coverage_factor = max(0, 100 - test_count * 10)  # 0 tests = 100 risk

        # Weighted composite score
        risk_score = (
            change_factor * 0.40 +
            defect_factor * 0.35 +
            coverage_factor * 0.25
        )

        risk_level = (
            "critical" if risk_score >= 75 else
            "high" if risk_score >= 50 else
            "medium" if risk_score >= 25 else
            "low"
        )

        assessments.append({
            "module": module,
            "risk_level": risk_level,
            "risk_score": round(risk_score, 1),
            "change_factor": round(change_factor, 1),
            "defect_factor": round(defect_factor, 1),
            "coverage_factor": round(coverage_factor, 1),
            "rationale": (
                f"{len(module_changes)} files changed ({total_lines} lines), "
                f"{defect_count} historical defects"
                f"{' (includes critical)' if has_critical else ''}, "
                f"{test_count} existing tests"
            ),
        })

    # Sort by risk score descending
    assessments.sort(key=lambda a: a["risk_score"], reverse=True)
    return assessments

With risk scores calculated, the test generator focuses on the highest-risk modules:

"""modules/test_generator.py — Generate test cases based on risk assessments."""
import json
import os
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

GENERATION_PROMPT = """You are a senior QA engineer designing test cases for
a {module} module. This module has been flagged as {risk_level} risk.

Risk context:
{rationale}

Recent code changes in this module:
{change_summary}

Historical defect patterns in this module:
{defect_summary}

Generate {test_count} focused test cases that specifically target:
1. Areas where code changed recently (regression risk)
2. Failure patterns seen in historical defects (repeat risk)
3. Edge cases and boundary conditions in the changed code

For each test case, return a JSON object:
{{
  "title": "concise test title",
  "test_type": "regression|smoke|boundary|negative|integration",
  "preconditions": ["list of preconditions"],
  "steps": ["ordered test steps"],
  "expected_result": "what should happen",
  "triggered_by": "change|defect_pattern|coverage_gap"
}}

Return a JSON array. No markdown fences."""

_counter = 0

def generate_tests_for_module(
    assessment: dict,
    changes: list[dict],
    defect_patterns: dict,
) -> list[dict]:
    """Generate test cases for a specific module based on its risk profile."""
    global _counter

    module = assessment["module"]

    # Determine how many tests to generate based on risk
    risk_to_count = {"critical": 5, "high": 4, "medium": 2, "low": 1}
    test_count = risk_to_count.get(assessment["risk_level"], 2)

    # Summarize changes for this module
    module_changes = [c for c in changes if c.get("module") == module]
    if module_changes:
        change_summary = "\n".join(
            f"- {c['file_path']} ({c['change_type']}, "
            f"+{c.get('lines_added', 0)}/-{c.get('lines_deleted', 0)} lines)"
            for c in module_changes[:5]
        )
    else:
        change_summary = "No recent changes detected."

    # Summarize defect patterns
    hotspot = next(
        (h for h in defect_patterns.get("hotspots", [])
         if h["module"] == module),
        None,
    )
    if hotspot:
        defect_summary = (
            f"Total defects: {hotspot['defect_count']}\n"
            f"Severity breakdown: {json.dumps(hotspot.get('severity_breakdown', {}))}"
        )
    else:
        defect_summary = "No historical defects recorded."

    prompt = GENERATION_PROMPT.format(
        module=module,
        risk_level=assessment["risk_level"],
        rationale=assessment["rationale"],
        change_summary=change_summary,
        defect_summary=defect_summary,
        test_count=test_count,
    )

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a QA engineer. Return only a JSON array."},
            {"role": "user", "content": prompt},
        ],
        temperature=0.3,
        max_tokens=2000,
    )

    raw = response.choices[0].message.content.strip()
    try:
        test_cases = json.loads(raw)
    except json.JSONDecodeError:
        import re
        match = re.search(r"\[.*\]", raw, re.DOTALL)
        test_cases = json.loads(match.group()) if match else []

    # Assign IDs and module
    for tc in test_cases:
        _counter += 1
        tc["id"] = f"TC-{_counter:03d}"
        tc["module"] = module
        tc["risk_level"] = assessment["risk_level"]

    return test_cases


def generate_all_tests(
    assessments: list[dict],
    changes: list[dict],
    defect_patterns: dict,
) -> list[dict]:
    """Generate tests for all modules, prioritized by risk."""
    all_tests = []
    for assessment in assessments:
        if assessment["risk_level"] == "low":
            print(f"  Skipping {assessment['module']} (low risk)")
            continue
        print(f"  Generating tests for {assessment['module']} "
              f"({assessment['risk_level']} risk)...")
        tests = generate_tests_for_module(assessment, changes, defect_patterns)
        all_tests.extend(tests)
        print(f"    -> {len(tests)} tests generated")
    return all_tests

Step 3: Output Generation — Test Data and Suite Report

Each test case needs realistic test data. The test data generator uses a two-step approach: the LLM determines what data fields each test case needs (e.g., user_email: email, order_total: amount), then the Faker library generates realistic values for each field type. For negative tests, it produces intentionally invalid data including empty fields, SQL injection strings, XSS payloads, and oversized inputs. For boundary tests, it generates min/max values, unicode text, floating-point precision edge cases, and zero-length strings.

The suite report generator produces a markdown report with three sections: a risk assessment summary table showing each module's risk level, score, change factor, defect factor, and number of generated tests; grouped test cases by module with full details (preconditions, steps, expected results, and test data in JSON format); and a recommended execution order listing all tests ranked by risk level for maximum early defect detection.

Step 4: Validation and Quality

The suite validator performs two checks: coverage completeness verifies that critical-risk modules have at least 4 tests, high-risk at least 3, and medium at least 1, while also flagging high-risk modules with no negative tests; and duplicate detection sends all test titles to the LLM to identify semantically similar tests that may be redundant.

The main orchestrator:

"""main.py — Orchestrate the intelligent test suite generator."""
from modules.change_analyzer import get_recent_changes
from modules.defect_analyzer import load_defects, analyze_defect_patterns, SAMPLE_DEFECTS
from modules.risk_engine import calculate_risk_scores
from modules.test_generator import generate_all_tests
from modules.test_data import enrich_tests_with_data
from modules.suite_validator import check_coverage_completeness, check_duplicate_tests
from modules.suite_report import generate_suite_report


def run_pipeline(
    repo_path: str = ".",
    defect_file: str = None,
    since: str = "1 week ago",
):
    """Run the intelligent test suite generation pipeline."""
    print("=" * 60)
    print("Intelligent Test Suite Generator")
    print("=" * 60)

    # Stage 1: Gather inputs
    print("\n[1/5] Analyzing recent changes...")
    changes = get_recent_changes(repo_path, since=since)

    print("\n[2/5] Analyzing defect history...")
    if defect_file:
        defects = load_defects(defect_file)
    else:
        print("  Using sample defect data...")
        defects = SAMPLE_DEFECTS
    defect_patterns = analyze_defect_patterns(defects)

    # Stage 2: Risk analysis
    print("\n[3/5] Calculating risk scores...")
    assessments = calculate_risk_scores(changes, defect_patterns)
    for a in assessments:
        print(f"  {a['module']:25s} {a['risk_level']:10s} (score: {a['risk_score']:.0f})")

    # Stage 3: Generate tests
    print("\n[4/5] Generating targeted test cases...")
    test_cases = generate_all_tests(assessments, changes, defect_patterns)

    # Enrich with test data
    print("\n  Adding test data...")
    test_cases = enrich_tests_with_data(test_cases)

    # Stage 4: Validate
    print("\n[5/5] Validating test suite...")
    coverage_issues = check_coverage_completeness(assessments, test_cases)
    duplicate_issues = check_duplicate_tests(test_cases)

    all_issues = coverage_issues + duplicate_issues
    if all_issues:
        print(f"  Found {len(all_issues)} issues:")
        for issue in all_issues:
            print(f"    - {issue['message']}")
    else:
        print("  No issues found.")

    # Generate report
    print("\nGenerating report...")
    report_path = generate_suite_report(
        assessments, test_cases, defect_patterns
    )

    print("\n" + "=" * 60)
    print(f"Generated {len(test_cases)} test cases across "
          f"{len(set(t.get('module') for t in test_cases))} modules")
    print("=" * 60)

    return report_path


if __name__ == "__main__":
    import sys
    repo = sys.argv[1] if len(sys.argv) > 1 else "."
    run_pipeline(repo_path=repo)

Extensions and Portfolio Tips

Add CI/CD integration. Run the generator as a GitHub Action that triggers on every pull request. It analyzes the PR's diff, generates targeted test cases, and posts them as a PR comment. This demonstrates DevOps awareness and makes the tool immediately practical.
Build a test decay detector. Compare the existing test suite against recent code changes to identify tests that no longer exercise the code they were written for. Flag stale tests for review or deletion.
Implement a learning loop. Track which generated tests actually find defects. Feed this data back into the risk engine to improve future prioritization. Over time, the system learns which types of changes are most likely to cause failures.
Add visual coverage maps. Generate a heatmap showing test coverage across modules, color-coded by risk level. Use matplotlib or Plotly to create an interactive visualization.
Support multiple test frameworks. Generate test code directly in pytest, JUnit, or Cypress format instead of plain-text test cases. This makes the output immediately executable.

Demo this tool on a real open-source project. Clone a popular GitHub repo, run the generator against its recent commit history, and show the risk assessment and generated tests. Using real data rather than samples demonstrates that your tool works in the wild.

← Back to AI for Analysts and QA Teams — Revised