Chapter 08 of 18

Test Case Generation

Writing test cases is the most time-consuming activity in the QA lifecycle — yet most test cases follow predictable patterns that an LLM can generate in seconds. In this chapter, you will build a system that reads a requirement and produces comprehensive, categorized test cases automatically.

12 min read

Part 3: Quality Assurance with LLMs

Test Case Generation

Writing test cases is the most time-consuming activity in the QA lifecycle. Most test cases follow predictable patterns that an LLM can generate in seconds. In this chapter, you will build a system that reads a requirement and produces comprehensive, categorized test cases automatically.

Reading time: ~25 min Project: Test Case Generator

What You Will Learn

How to prompt LLMs to generate functional, negative, and boundary test cases from plain-language requirements
Techniques for boundary value analysis and equivalence partitioning using LLMs
Methods for generating negative test cases that probe failure modes
How to prioritize generated test cases by risk and coverage
Strategies for measuring and improving test coverage with LLM assistance
Building a reusable Test Case Generator pipeline in Python

9.1 The Testing Bottleneck

Every QA professional knows the pain: a sprint planning session reveals twelve new user stories, each needing test cases by the end of the week. Manual test case writing is slow, inconsistent, and prone to blind spots. Senior QAs write better tests than juniors. Even the best testers miss edge cases when working under time pressure.

Consider the numbers. A typical requirement like "Users can reset their password via email verification" needs at minimum:

Test Category	Typical Count	Manual Time (min)
Happy path / positive tests	3–5	15–25
Negative / invalid input tests	5–10	25–50
Boundary value tests	4–8	20–40
Security-related tests	3–6	15–30
Integration / cross-system tests	2–4	10–20
Total	17–33	85–165

An LLM can generate a first draft of all these categories in under a minute. The QA analyst's role shifts from writing to reviewing, refining, and augmenting. This is a far better use of their expertise.

Human-in-the-loop is non-negotiable. LLM-generated test cases are a starting point. They may miss domain-specific constraints, security nuances, or regulatory requirements that only a human tester would know. Always review, validate, and supplement generated test cases before adding them to your test suite.

The bottleneck is not just speed. It is consistency. When five QAs write tests for five features, you get five different styles, five different levels of coverage depth, and five different interpretations of "thorough." An LLM-driven pipeline standardizes the output format, ensures every category is considered, and provides a baseline that the team can then customize.

Diagram 1

Figure 9-1. The test case generation pipeline transforms plain-language requirements into a prioritized, categorized test suite through LLM analysis.

Diagram 2

Figure 9-2. A coverage matrix maps requirements against test types, making gaps immediately visible. Numbers indicate test case count; dashed cells highlight missing coverage.

9.2 Generating Tests from Requirements

The foundation of LLM-based test generation is a well-structured prompt that takes a requirement as input and produces categorized test cases as output. Provide the LLM with a clear taxonomy of test types and ask it to populate each category systematically.

The approach is straightforward: describe the requirement in plain language, tell the LLM what test categories to cover, and specify the output format you want. The prompt instructs the LLM to act as a senior QA engineer, generating test cases across five categories (positive, negative, boundary, security, integration) with structured fields: test_id, category, title, preconditions, steps, expected_result, and priority.

For a requirement like "Users can reset their password by clicking 'Forgot Password' on the login page. The link expires after 30 minutes. The new password must be at least 8 characters with one uppercase letter, one number, and one special character. Users cannot reuse their last 5 passwords.", the LLM produces output such as:

[TC-001] (positive) Successful password reset with valid email
  Priority: high
  Expected: User receives reset email within 2 minutes

[TC-002] (positive) Password reset with valid new password meeting all criteria
  Priority: high
  Expected: Password is updated, user can log in with new password

[TC-003] (negative) Reset link used after 30-minute expiration
  Priority: high
  Expected: System displays "Link expired" and prompts new reset request

[TC-004] (boundary) New password with exactly 8 characters meeting all criteria
  Priority: medium
  Expected: Password accepted and updated successfully

[TC-005] (security) Attempt to reuse the 5th most recent password
  Priority: high
  Expected: System rejects password with "Cannot reuse recent passwords" message

Prompt engineering tip. Setting temperature=0.3 makes the output more deterministic across runs. For test case generation, you want consistency: the same requirement should produce similar test cases each time. Use higher temperatures (0.7-0.9) only when you want creative, exploratory test ideas.

The system prompt defines the exact schema you expect. This is critical: without a clear output structure, the LLM may return test cases in an unpredictable format, making downstream processing brittle. Using response_format={"type": "json_object"} ensures you always get valid JSON back.

9.3 Boundary Value Analysis with LLMs

Boundary value analysis (BVA) is one of the most effective testing techniques and one of the most tedious to apply manually. For every input field with a defined range, you need to test at minimum the lower boundary, just below it, just above it, the upper boundary, and nominal values in between.

LLMs excel at BVA because they can parse natural-language constraints and systematically derive boundary values. You describe the requirement, for example: "The new password must be between 8 and 64 characters long. The user's age must be between 18 and 120. The reset code is a 6-digit number." The LLM generates the full boundary table with min, max, and edge values for every bounded field it identifies.

The output produces a comprehensive boundary table:

Field	Boundary	Value	Expected
password_length	min - 1	7 chars	FAIL
password_length	min	8 chars	PASS
password_length	min + 1	9 chars	PASS
password_length	nominal	20 chars	PASS
password_length	max - 1	63 chars	PASS
password_length	max	64 chars	PASS
password_length	max + 1	65 chars	FAIL
user_age	min - 1	17	FAIL
user_age	min	18	PASS
reset_code	min	100000	PASS
reset_code	max	999999	PASS
reset_code	max + 1	1000000	FAIL

Why LLMs beat templates for BVA. Traditional BVA templates require you to manually identify each bounded field and fill in values. An LLM reads the requirement in natural language, identifies all bounded fields automatically, and generates the full boundary table. When requirements change, say the password max moves from 64 to 128 characters, you re-run the prompt and the entire table updates.

9.4 Equivalence Partitioning Automation

Equivalence partitioning divides input data into groups (partitions) where all values in a partition should produce the same behavior. Instead of testing every possible input, you test one representative from each partition. This reduces the number of tests while maintaining coverage.

An LLM can identify equivalence classes from requirement text and generate representative test values for each. Given a shipping calculator requirement with weight ranges, destination types, and insurance options, the prompt asks the LLM to identify valid and invalid partitions for every input field and provide a representative test value for each class.

The LLM identifies partitions such as:

Field	Type	Class	Value	Expected
weight	valid	Light parcel (0.1–4.99 kg)	2.5	$5 base rate
weight	valid	Medium parcel (5.0–20.0 kg)	12.0	$15 base rate
weight	valid	Heavy parcel (20.01–50.0 kg)	35.0	$30 base rate
weight	invalid	Below minimum (less than 0.1 kg)	0.05	Error: weight too low
weight	invalid	Above maximum (over 50.0 kg)	55.0	Error: weight exceeds limit
destination	valid	Domestic	domestic	1x rate multiplier
destination	valid	International standard	intl-std	3x rate multiplier
destination	valid	International express	intl-exp	5x rate multiplier
insurance	valid	With insurance	yes	+10% to total
insurance	valid	Without insurance	no	No surcharge

The power of combining equivalence partitioning with LLMs becomes clear when you consider pairwise combinations. With three fields (weight: 5 classes, destination: 3, insurance: 2), full combinatorial testing requires 30 test cases. Pairwise testing covers all two-way interactions with far fewer. You can ask the LLM to generate a minimal pairwise covering array where every pair of classes from different fields appears in at least one test.

Verify pairwise coverage. LLMs sometimes miss pairs in their generated covering arrays. Always validate that the returned test set actually achieves full pairwise coverage by checking each pair of classes programmatically. Use the generated set as a starting point and add any missing pairs.

9.5 Negative Test Case Generation

Negative testing verifies that the system handles invalid, unexpected, and malicious input gracefully. This is where LLMs excel. Human testers tend to have a "happy path bias," instinctively thinking about how users are supposed to use the system. LLMs, prompted correctly, generate an exhaustive catalogue of things that can go wrong.

The prompt instructs the LLM to act as a destructive tester across seven attack categories: invalid input, missing data, overflow, injection, race conditions, state violations, and authorization bypass. You use a slightly higher temperature (0.5) to encourage creative attack vector discovery. For a money transfer requirement, a well-prompted LLM generates test cases that many testers would miss:

Category	Attack	Payload	Expected Safe Behavior
Injection	SQL injection in account field	`' OR 1=1; DROP TABLE accounts;--`	Input rejected, error logged
Overflow	Transfer amount of MAX_FLOAT	`1.7976931348623157e+308`	Rejected: amount exceeds limit
Race condition	Two simultaneous transfers draining same account	Concurrent $500 transfers from $600 balance	Second transfer rejected or queued
State violation	Transfer from a frozen account	Source account with status=frozen	Transfer blocked with clear error message
Authorization	Transfer from another user's account	Source account owned by different user	403 Forbidden, attempt logged
Invalid input	Negative transfer amount	`-500.00`	Rejected: amount must be positive
Missing data	Empty destination account	`""`	Validation error: destination required

Layer negative tests by severity. Not all negative tests are equal. Injection and authorization tests are critical: they represent real attack vectors. Missing data tests are important for UX. Overflow tests catch edge cases. Prioritize your negative test suite so the critical security tests run first in every regression cycle.

9.6 Test Case Prioritization

Generating 50 test cases is useful. Knowing which 15 to run when you only have an hour before release is essential. LLMs prioritize test cases by analyzing risk factors, historical defect data, and business impact.

The approach uses three scoring dimensions: Risk (1-5, how likely is this to fail?), Impact (1-5, how severe if it fails in production?), and Coverage (1-5, how much unique functionality does it test?). These combine into a composite score: Risk * 0.4 + Impact * 0.4 + Coverage * 0.2. Given a time budget, the LLM ranks all tests and marks the top N as "selected" for execution.

The prioritization output gives QA leads a clear execution order:

>>> RUN  1. [TC-005] Score: 4.6 | Reuse of recent password (security)
         Risk=5 Impact=5 Coverage=4
         Password reuse bypass could lead to account compromise

>>> RUN  2. [TC-003] Score: 4.4 | Expired reset link used
         Risk=5 Impact=4 Coverage=5
         Only test covering expiration logic — critical timing boundary

>>> RUN  3. [TC-012] Score: 4.2 | SQL injection in email field
         Risk=4 Impact=5 Coverage=4
         Injection attacks are high-impact and commonly exploited

    skip  9. [TC-008] Score: 2.4 | Valid reset with Gmail address
         Risk=2 Impact=2 Coverage=2
         Covered by other positive tests; low incremental value

Feed historical data for smarter prioritization. If your defect tracker has data on which modules and features have the most bugs, include that context in the prompt. For example: "The authentication module has had 12 defects in the last 3 sprints, mostly around session handling." This lets the LLM weight risk scores based on real project history, not just general heuristics.

9.7 Coverage Analysis

Generating test cases is only half the work. You also need to verify that your test suite actually covers the requirement. LLMs perform a gap analysis by comparing the requirement text against the generated test cases and identifying untested scenarios.

The coverage prompt asks the LLM to compare the requirement text against existing test cases and classify each area as Covered, Partially Covered, Gap (no tests at all), or Implicit (requirements not stated but implied, such as performance, accessibility, or rate limiting). For each gap, it suggests specific test cases to fill it.

A typical coverage analysis reveals gaps like these:

Status	Area	Action Needed
Covered	Happy path password reset	None
Covered	Password complexity rules	None
Partial	Link expiration	Add test for link used at exactly 30 min
Gap	Multiple simultaneous reset requests	Test: user requests reset twice. Which link is valid?
Gap	Email delivery failure	Test: what happens when email service is down?
Implicit	Rate limiting on reset requests	Test: 100 reset requests in 1 minute from same IP
Implicit	Accessibility of reset form	Test: screen reader compatibility, keyboard navigation

Iterative coverage improvement. Run the coverage analysis, generate tests for the gaps, add them to your suite, and run the analysis again. Two or three iterations typically push coverage from 60–70% (first generation) to 90%+ (after gap-filling). Automate this loop in your CI pipeline for continuous coverage monitoring.

Project: Test Case Generator

Build a complete test case generation pipeline that takes a requirements document (multiple requirements) and produces a structured test plan with prioritized, categorized test cases and a coverage report.

Project Requirements

Accept a text file containing multiple requirements (one per paragraph)
Generate test cases for each requirement across all five categories
Perform boundary value analysis for all bounded fields
Generate negative and security test cases
Prioritize the complete test suite
Run coverage analysis and fill gaps
Output a structured test plan in JSON and a human-readable summary

Pipeline Steps

The project follows these pipeline stages, each handled by a separate LLM call:

Parse requirements from a text file (one requirement per paragraph)
Generate core test cases across all five categories for each requirement
Add boundary and negative tests via specialized prompts
Deduplicate using the LLM to identify semantically identical test cases
Analyze coverage and fill gaps with additional targeted tests
Prioritize the complete suite across all requirements
Export a structured JSON test plan and human-readable summary

Extension Ideas

Add Gherkin (Given/When/Then) output format for BDD teams
Integrate with Jira or Azure DevOps to push test cases directly into your test management tool
Add a Streamlit UI for non-technical stakeholders to input requirements and review generated tests
Export to CSV/Excel for teams that use spreadsheet-based test management

Summary

LLMs accelerate test case generation by producing categorized test cases from plain-language requirements in seconds rather than hours.
Boundary value analysis becomes automated. The LLM identifies bounded fields and generates the full BVA table with min, max, and edge values.
Equivalence partitioning is enhanced by LLMs that can identify valid and invalid classes and generate pairwise covering arrays.
Negative testing benefits most from LLMs because they generate adversarial scenarios (injection, race conditions, authorization bypasses) that human testers often overlook.
Prioritization uses risk, impact, and coverage scores to ensure the most critical tests run first when time is limited.
Coverage analysis closes the loop by identifying gaps, implicit requirements, and areas needing additional test cases.
Human review remains essential. LLM output is a high-quality first draft, not a finished product.

Exercises

Generate and compare. Take a real requirement from your current project and generate test cases using the pipeline in this chapter. Compare them against your existing test cases. What did the LLM find that you missed? What did you have that the LLM missed?
Tune the temperature. Run the negative test generator three times at temperatures 0.2, 0.5, and 0.9. Compare the creativity and relevance of the generated attacks at each setting.
Build a coverage dashboard. Extend the coverage analysis to produce a visual HTML report showing covered, partially covered, and uncovered requirement areas with color coding.
Cross-requirement coverage. Modify the pipeline to detect when test cases from one requirement also cover aspects of another requirement, reducing total test count.
Gherkin output. Add an output formatter that converts generated test cases into Gherkin (Given/When/Then) syntax suitable for Cucumber or Behave.

← Back to AI for Analysts and QA Teams