Chapter 08 of 18
Test Case Generation
Writing test cases manually is slow, inconsistent, and prone to blind spots. LLMs can generate categorized, boundary-aware, adversarial test suites from plain-language requirements in seconds — here is how to build that pipeline.
Part 3: Quality Assurance with LLMs
Test Case Generation
Writing test cases is the most time-consuming activity in the QA lifecycle. Most test cases follow predictable patterns that an LLM can generate in seconds. The QA analyst's role shifts from writing to reviewing, refining, and augmenting — a far better use of their expertise.
Reading time: ~25 min Project: Test Case Generator
9.1 The Testing Bottleneck
Every QA professional knows the pain: sprint planning reveals twelve new user stories, each needing test cases by end of week. Manual test case writing is slow, inconsistent, and prone to blind spots. Senior QAs write better tests than juniors. Even the best testers miss edge cases under time pressure.
Consider the numbers. A typical requirement like "Users can reset their password via email verification" needs at minimum:
| Test Category | Typical Count | Manual Time (min) |
|---|---|---|
| Happy path / positive tests | 3–5 | 15–25 |
| Negative / invalid input tests | 5–10 | 25–50 |
| Boundary value tests | 4–8 | 20–40 |
| Security-related tests | 3–6 | 15–30 |
| Integration / cross-system tests | 2–4 | 10–20 |
| Total | 17–33 | 85–165 |
An LLM can generate a first draft of all these categories in under a minute.
The bottleneck is not just speed — it is consistency. When five QAs write tests for five features, you get five different styles, five different coverage depths, and five different interpretations of "thorough." An LLM-driven pipeline standardizes the output format, ensures every category is considered, and provides a baseline the team can customize. LLM-generated test cases are a starting point. They may miss domain-specific constraints, security nuances, or regulatory requirements that only a human tester would know. Always review, validate, and supplement before adding anything to your test suite.
Figure 9-1. The test case generation pipeline transforms plain-language requirements into a prioritized, categorized test suite through LLM analysis.
Figure 9-2. A coverage matrix maps requirements against test types, making gaps immediately visible. Numbers indicate test case count; dashed cells highlight missing coverage.
9.2 Generating Tests from Requirements
The foundation of LLM-based test generation is a well-structured prompt that takes a requirement as input and produces categorized test cases as output. You provide the LLM with a clear taxonomy of test types and ask it to populate each category systematically.
The prompt instructs the LLM to act as a senior QA engineer, generating test cases across five categories — positive, negative, boundary, security, integration — with structured fields: test_id, category, title, preconditions, steps, expected_result, and priority.
For a requirement like "Users can reset their password by clicking 'Forgot Password' on the login page. The link expires after 30 minutes. The new password must be at least 8 characters with one uppercase letter, one number, and one special character. Users cannot reuse their last 5 passwords.", the LLM produces:
[TC-001] (positive) Successful password reset with valid email
Priority: high
Expected: User receives reset email within 2 minutes
[TC-002] (positive) Password reset with valid new password meeting all criteria
Priority: high
Expected: Password is updated, user can log in with new password
[TC-003] (negative) Reset link used after 30-minute expiration
Priority: high
Expected: System displays "Link expired" and prompts new reset request
[TC-004] (boundary) New password with exactly 8 characters meeting all criteria
Priority: medium
Expected: Password accepted and updated successfully
[TC-005] (security) Attempt to reuse the 5th most recent password
Priority: high
Expected: System rejects password with "Cannot reuse recent passwords" message
Set temperature=0.3 for test case generation. You want consistency: the same requirement should produce similar test cases each run. Use higher temperatures (0.7–0.9) only when you want exploratory, creative test ideas. The system prompt defines the exact schema you expect — without a clear output structure, the LLM may return test cases in an unpredictable format, making downstream processing brittle. Using response_format={"type": "json_object"} ensures valid JSON every time.
9.3 Boundary Value Analysis with LLMs
Boundary value analysis (BVA) is one of the most effective testing techniques and one of the most tedious to apply manually. For every input field with a defined range, you need to test at minimum: the lower boundary, just below it, just above it, the upper boundary, and nominal values in between.
LLMs excel at BVA because they parse natural-language constraints and systematically derive boundary values. You describe the requirement — "The new password must be between 8 and 64 characters long. The user's age must be between 18 and 120. The reset code is a 6-digit number." — and the LLM generates the full boundary table with min, max, and edge values for every bounded field it identifies:
| Field | Boundary | Value | Expected |
|---|---|---|---|
| password_length | min - 1 | 7 chars | FAIL |
| password_length | min | 8 chars | PASS |
| password_length | min + 1 | 9 chars | PASS |
| password_length | nominal | 20 chars | PASS |
| password_length | max - 1 | 63 chars | PASS |
| password_length | max | 64 chars | PASS |
| password_length | max + 1 | 65 chars | FAIL |
| user_age | min - 1 | 17 | FAIL |
| user_age | min | 18 | PASS |
| reset_code | min | 100000 | PASS |
| reset_code | max | 999999 | PASS |
| reset_code | max + 1 | 1000000 | FAIL |
Traditional BVA templates require you to manually identify each bounded field and fill in values. An LLM reads the requirement in natural language, identifies all bounded fields automatically, and generates the full boundary table. When requirements change — say the password max moves from 64 to 128 characters — you re-run the prompt and the entire table updates.
9.4 Equivalence Partitioning Automation
Equivalence partitioning divides input data into groups where all values in a group should produce the same behavior. Instead of testing every possible input, you test one representative from each partition. This reduces the number of tests while maintaining coverage.
An LLM can identify equivalence classes from requirement text and generate representative test values for each. Given a shipping calculator requirement with weight ranges, destination types, and insurance options, the prompt asks the LLM to identify valid and invalid partitions for every input field and provide a representative test value for each class:
| Field | Type | Class | Value | Expected |
|---|---|---|---|---|
| weight | valid | Light parcel (0.1–4.99 kg) | 2.5 | $5 base rate |
| weight | valid | Medium parcel (5.0–20.0 kg) | 12.0 | $15 base rate |
| weight | valid | Heavy parcel (20.01–50.0 kg) | 35.0 | $30 base rate |
| weight | invalid | Below minimum (less than 0.1 kg) | 0.05 | Error: weight too low |
| weight | invalid | Above maximum (over 50.0 kg) | 55.0 | Error: weight exceeds limit |
| destination | valid | Domestic | domestic | 1x rate multiplier |
| destination | valid | International standard | intl-std | 3x rate multiplier |
| destination | valid | International express | intl-exp | 5x rate multiplier |
| insurance | valid | With insurance | yes | +10% to total |
| insurance | valid | Without insurance | no | No surcharge |
The value of combining equivalence partitioning with LLMs becomes clear when you consider pairwise combinations. With three fields (weight: 5 classes, destination: 3, insurance: 2), full combinatorial testing requires 30 test cases. Pairwise testing covers all two-way interactions with far fewer. You can ask the LLM to generate a minimal pairwise covering array where every pair of classes from different fields appears in at least one test.
One caution: LLMs sometimes miss pairs in their generated covering arrays. Validate that the returned test set actually achieves full pairwise coverage by checking each pair programmatically. Use the generated set as a starting point and add any missing pairs.
9.5 Negative Test Case Generation
Negative testing verifies that the system handles invalid, unexpected, and malicious input gracefully. This is where LLMs have the most impact. Human testers tend toward happy-path bias — instinctively thinking about how users are supposed to use the system. Prompted correctly, an LLM generates an exhaustive catalogue of things that can go wrong.
The prompt instructs the LLM to act as a destructive tester across seven attack categories: invalid input, missing data, overflow, injection, race conditions, state violations, and authorization bypass. Use a slightly higher temperature (0.5) to encourage creative attack vector discovery. For a money transfer requirement, a well-prompted LLM generates test cases many testers would miss:
| Category | Attack | Payload | Expected Safe Behavior |
|---|---|---|---|
| Injection | SQL injection in account field | ' OR 1=1; DROP TABLE accounts;-- | Input rejected, error logged |
| Overflow | Transfer amount of MAX_FLOAT | 1.7976931348623157e+308 | Rejected: amount exceeds limit |
| Race condition | Two simultaneous transfers draining same account | Concurrent $500 transfers from $600 balance | Second transfer rejected or queued |
| State violation | Transfer from a frozen account | Source account with status=frozen | Transfer blocked with clear error message |
| Authorization | Transfer from another user's account | Source account owned by different user | 403 Forbidden, attempt logged |
| Invalid input | Negative transfer amount | -500.00 | Rejected: amount must be positive |
| Missing data | Empty destination account | "" | Validation error: destination required |
Layer negative tests by severity. Injection and authorization tests are critical — they represent real attack vectors. Missing data tests matter for UX. Overflow tests catch edge cases. Prioritize so the critical security tests run first in every regression cycle.
9.6 Test Case Prioritization
Generating 50 test cases is useful. Knowing which 15 to run when you have an hour before release is essential. LLMs prioritize test cases by analyzing risk factors, historical defect data, and business impact.
The approach uses three scoring dimensions: Risk (1–5, how likely is this to fail?), Impact (1–5, how severe if it fails in production?), and Coverage (1–5, how much unique functionality does it test?). These combine into a composite score: Risk × 0.4 + Impact × 0.4 + Coverage × 0.2. Given a time budget, the LLM ranks all tests and marks the top N as selected for execution:
>>> RUN 1. [TC-005] Score: 4.6 | Reuse of recent password (security)
Risk=5 Impact=5 Coverage=4
Password reuse bypass could lead to account compromise
>>> RUN 2. [TC-003] Score: 4.4 | Expired reset link used
Risk=5 Impact=4 Coverage=5
Only test covering expiration logic — critical timing boundary
>>> RUN 3. [TC-012] Score: 4.2 | SQL injection in email field
Risk=4 Impact=5 Coverage=4
Injection attacks are high-impact and commonly exploited
skip 9. [TC-008] Score: 2.4 | Valid reset with Gmail address
Risk=2 Impact=2 Coverage=2
Covered by other positive tests; low incremental value
If your defect tracker has data on which modules have the most bugs, include that context in the prompt — for example: "The authentication module has had 12 defects in the last 3 sprints, mostly around session handling." This lets the LLM weight risk scores based on real project history, not just general heuristics.
9.7 Coverage Analysis
Generating test cases is only half the work. You also need to verify that your test suite actually covers the requirement. LLMs perform a gap analysis by comparing the requirement text against the generated test cases and identifying untested scenarios.
The coverage prompt asks the LLM to compare the requirement text against existing test cases and classify each area as Covered, Partially Covered, Gap (no tests at all), or Implicit (requirements not stated but implied, such as performance, accessibility, or rate limiting). For each gap, it suggests specific test cases to fill it:
| Status | Area | Action Needed |
|---|---|---|
| Covered | Happy path password reset | None |
| Covered | Password complexity rules | None |
| Partial | Link expiration | Add test for link used at exactly 30 min |
| Gap | Multiple simultaneous reset requests | Test: user requests reset twice. Which link is valid? |
| Gap | Email delivery failure | Test: what happens when email service is down? |
| Implicit | Rate limiting on reset requests | Test: 100 reset requests in 1 minute from same IP |
| Implicit | Accessibility of reset form | Test: screen reader compatibility, keyboard navigation |
Run the coverage analysis, generate tests for the gaps, add them to your suite, then run the analysis again. Two or three iterations typically push coverage from 60–70% on first generation to 90%+ after gap-filling. Automate this loop in your CI pipeline for continuous coverage monitoring.
Project: Test Case Generator
Build a complete test case generation pipeline that takes a requirements document and produces a structured test plan with prioritized, categorized test cases and a coverage report.
Project Requirements
- Accept a text file containing multiple requirements (one per paragraph)
- Generate test cases for each requirement across all five categories
- Perform boundary value analysis for all bounded fields
- Generate negative and security test cases
- Prioritize the complete test suite
- Run coverage analysis and fill gaps
- Output a structured test plan in JSON and a human-readable summary
Pipeline Steps
The project follows these pipeline stages, each handled by a separate LLM call:
- Parse requirements from a text file (one requirement per paragraph)
- Generate core test cases across all five categories for each requirement
- Add boundary and negative tests via specialized prompts
- Deduplicate using the LLM to identify semantically identical test cases
- Analyze coverage and fill gaps with additional targeted tests
- Prioritize the complete suite across all requirements
- Export a structured JSON test plan and human-readable summary
Extension Ideas
- Add Gherkin (Given/When/Then) output format for BDD teams
- Integrate with Jira or Azure DevOps to push test cases directly into your test management tool
- Add a Streamlit UI for non-technical stakeholders to input requirements and review generated tests
- Export to CSV/Excel for teams that use spreadsheet-based test management
Exercises
- Generate and compare. Take a real requirement from your current project and generate test cases using the pipeline in this chapter. Compare them against your existing test cases. What did the LLM find that you missed? What did you have that the LLM missed?
- Tune the temperature. Run the negative test generator three times at temperatures 0.2, 0.5, and 0.9. Compare the creativity and relevance of the generated attacks at each setting.
- Build a coverage dashboard. Extend the coverage analysis to produce a visual HTML report showing covered, partially covered, and uncovered requirement areas with color coding.
- Cross-requirement coverage. Modify the pipeline to detect when test cases from one requirement also cover aspects of another requirement, reducing total test count.
- Gherkin output. Add an output formatter that converts generated test cases into Gherkin (Given/When/Then) syntax suitable for Cucumber or Behave.