Chapter 03 of 9
Grounded Delivery: The Five Phases
Why Agile breaks for AI — and the five-phase delivery methodology built for systems whose outputs cannot be predicted with certainty.
Overview
Grounded Delivery is the second pillar of the LegacyForward.ai framework — a five-phase delivery methodology designed specifically for non-deterministic AI systems. It does not replace engineering discipline; it replaces the Agile assumptions that break when applied to systems whose outputs cannot be predicted with certainty.
The five phases — Frame, Explore, Shape, Harden, Operate — address the full lifecycle of an AI initiative, from defining what "good" looks like before a line of code is written to monitoring quality in production as a permanent operational discipline.
Why Agile Breaks for AI
Agile works for deterministic systems, where software follows rules and the same function called with the same input produces the same output every time. Agile's foundational constructs — user stories with binary acceptance criteria, sprint velocity estimates, regression test suites — all depend on that property.
AI systems are non-deterministic. The same prompt, applied to the same document, at different times, can produce different outputs. Not because of a bug. Because of how large language models and probabilistic systems work. An answer that is correct ninety-four times can be wrong on the ninety-fifth with no error code, no warning, and no traceable root cause.
Here is what breaks when you apply Agile to AI:
| Agile Assumption | AI Reality |
|---|---|
| User stories have binary acceptance criteria (pass/fail) | AI output quality is a distribution, not a point — you need confidence intervals, not checkboxes |
| Sprint velocity is measurable and predictable | AI exploration does not have predictable output — a two-week spike might produce a breakthrough or a dead end |
| Regression tests prevent quality degradation | Model drift, prompt drift, and data shift degrade AI quality between releases without any code change |
| "Done" means releasable | AI systems require ongoing monitoring, retraining, and quality management — they are never done |
| The build phase follows the design phase | AI development is iterative discovery — you cannot design what you have not yet explored |
| Definition of Done is fixed at sprint start | AI quality thresholds evolve as the team learns what "good" means for this specific use case |
Think of it like this: Agile is a production line. You design the car, break the design into parts, build the parts, assemble the car, test it, and ship it. Grounded Delivery is more like running a clinical trial. You form a hypothesis, design structured experiments, analyze the results, and define what "works" based on evidence rather than requirements. You establish ongoing monitoring because the treatment that works today may not work the same way next year.
Grounded Delivery vs. Traditional Agile
| Dimension | Traditional Agile | Grounded Delivery |
|---|---|---|
| Quality model | Binary pass/fail tests | Quality distributions with confidence intervals |
| Planning unit | Sprint velocity | Experiment results and learning milestones |
| Acceptance criteria | Deterministic (output equals expected) | Probabilistic (output meets threshold at defined confidence level) |
| Exploration | Time-boxed spikes within sprints | Full dedicated phase (Explore) with explicit gate criteria |
| Test assets | Regression test suite | Evaluation dataset — a first-class asset rivaling production code |
| Done criteria | Feature complete, tests pass | Quality thresholds met, monitoring in place, drift detection active |
| Post-deployment | Maintenance mode | Operate — permanent monitoring, retraining, ongoing evaluation |
| Failure handling | Bug report, fix, regression test | Drift detection, quality gate breach, probabilistic review |
| Governance | Velocity and burn-down | Phase gate decisions: GO / PIVOT / KILL |
Phase 1: Frame
Objective: Establish the value target, define what "good" means probabilistically, and create the conditions for honest exploration before development begins.
Frame does not define what to build. It defines what value to pursue and how to know whether the initiative is achieving it. It takes the Value Hypothesis from Signal Capture and translates it into operational terms that can drive delivery decisions.
Activities:
- Translate the Value Hypothesis into measurable success criteria expressed as probability distributions, not binary targets. "Summarize regulatory documents with 85% analyst-agreement score at the 90th percentile" is a Frame output. "AI summarizes documents" is not.
- Define the evaluation dataset requirements: what data is needed to know whether the AI is working? The evaluation dataset is designed here, not in Harden.
- Map the legacy integration landscape for this initiative. Which legacy systems participate? What data is needed, at what latency, in what format?
- Identify failure modes: what happens when the AI is wrong? What is the business impact? What is the fallback? Non-deterministic systems must have fallbacks designed from the start, not added as an afterthought.
- Define the go/no-go criteria for the Explore phase. What would cause this initiative to be killed or pivoted before significant development investment?
Gate: GO / NO-GO. Frame exits with documented success criteria, evaluation dataset design, integration map, and failure mode analysis. If these cannot be produced — if the team cannot articulate what "good" looks like before they start building — the initiative should not proceed to Explore.
Phase 2: Explore
Objective: Validate whether current AI capabilities can actually deliver the claimed value, against real data, with real edge cases — before committing to production development.
Explore is not a spike. It is not a two-week experiment wedged into a sprint. It is a full delivery phase with its own objectives, activities, and gate criteria. The purpose is to discover what is possible, what is not possible, and what is possible under which conditions — before anyone writes production code.
Activities:
- Run structured experiments against production-representative data — not curated subsets or synthetic data, but real data that includes the edge cases and outliers that curated data hides.
- Build and iterate the evaluation dataset. The evaluation dataset is the most important asset produced in Explore — it defines what "good" means for this specific use case and becomes the quality benchmark for everything that follows.
- Explore multiple technical approaches. Do not commit to the first approach that produces a promising demo. Explore competing architectures, model configurations, and prompt strategies. Document what each approach can and cannot do.
- Validate legacy data access. Can the initiative actually access the data it needs from the legacy systems that hold it? In what format? At what latency? Does the actual data quality match the assumed data quality?
- Produce a capability map: a documented assessment of what the AI can reliably do, what it cannot do, and where its performance degrades. This is not a demo reel. It is an honest assessment.
Gate: GO / PIVOT / KILL. Explore ends with a Go/Pivot/Kill decision based on evidence. GO means the capability map demonstrates that the value hypothesis is achievable with current technology. PIVOT means the hypothesis needs to be revised. KILL means the hypothesis is not achievable with current technology, the data does not support it, or the integration constraints make it infeasible. Killing an initiative at Explore is success — it means the organization did not spend production development resources on something that would fail.
Phase 3: Shape
Objective: Design the production architecture, separating deterministic and non-deterministic components, defining fallback paths, and creating the technical foundation for a system that can be operated and monitored in production.
Shape is where Grounded Delivery's most important architectural principle is implemented: deterministic and non-deterministic components must be separated and governed independently.
Activities:
- Design the system architecture with explicit separation between deterministic components (business logic, data validation, integration code, routing) and non-deterministic components (model inference, prompt execution, AI-generated content).
- Design fallback paths for every non-deterministic component. What happens when the AI component is unavailable? When quality drops below acceptable thresholds? When the model produces output that fails validation? Fallbacks must be designed before Harden, not after.
- Define monitoring requirements. What metrics will be tracked? What constitutes a quality gate breach in production? What triggers human review? What triggers automatic rollback?
- Select Legacy Coexistence patterns for each integration point.
- Design the evaluation pipeline. How will the system be evaluated continuously in production? What is the evaluation cadence? Who reviews the results?
Gate: GO / REDESIGN. Shape exits with a production architecture that has explicit fallback paths, monitoring design, and legacy integration patterns. An architecture without fallbacks does not exit Shape.
Phase 4: Harden
Objective: Build the production system against the architecture defined in Shape, continuously evaluating quality against the thresholds established in Frame, and making explicit decisions when quality gates are not met.
Activities:
- Implement deterministic components with conventional engineering rigor: unit tests, integration tests, code review, regression testing.
- Implement non-deterministic components against the evaluation dataset. Every iteration is evaluated against the quality thresholds defined in Frame. Output quality is tracked as a distribution over time, not a point-in-time pass/fail.
- Implement legacy integrations and test against production-representative legacy behavior. Mocked legacy systems in test environments do not replicate the timing, data quality, format variations, and error conditions of production legacy systems.
- Implement monitoring infrastructure before going to production, not after. Drift detection, quality gate thresholds, alerting, and the operational dashboard are built in Harden.
- Conduct red-team evaluation: deliberately attempt to produce failure modes identified in Explore. Document the failure rate. Determine whether it is within the acceptable threshold defined in Frame.
Gate: GO / CONTINUE / KILL. Harden exits when quality thresholds from Frame are consistently met in evaluation, legacy integrations are validated against production-representative data, and monitoring is operational.
Phase 5: Operate
Objective: Run the AI system in production as a permanent operational discipline — monitoring quality, detecting drift, managing ongoing performance, and feeding evidence back into Signal Capture.
Operate is not maintenance mode. An AI system that is not actively monitored and managed will degrade. Model drift, prompt drift, data distribution shift, and changing business context all erode AI performance between releases — often without any code change, without any error log, and without any user complaint until the damage is significant.
Activities:
- Run continuous evaluation against the evaluation dataset. Quality is not a deployment-time property — it is an ongoing property that must be measured.
- Monitor for drift: model drift, prompt drift, and data drift.
- Operate governance checkpoints: regular reviews of quality metrics by the team responsible for the system. These are not optional. An AI system without a defined governance cadence is not in production — it is abandoned.
- Feed operational evidence back into Signal Capture. Is the initiative capturing the value that was hypothesized?
- Manage the human-AI trust boundary. As confidence in the system builds, the trust boundary can be adjusted — from 100% human verification to sampling-based verification to exception-only review. Trust graduation criteria must be explicit.
There is no exit gate for Operate. Operate is forever. If the organization is not willing to maintain an AI system in production as a permanent operational discipline, the system should not be deployed.