I built the SDLC I always wanted, and let an agent run it
Fifteen years building software and fixing how engineering organisations work, five of them coaching agile transformations, taught me which gates, handoffs, and traceability links the teams that ship actually keep alive. Black Box App Factory is that distilled discipline, encoded as 22 phases an AI agent runs end-to-end, with humans only at the gates that matter.
The pattern in the teams that shipped
For years, my consulting work has been operational transformation: walking into teams that build software, finding out what was holding their delivery together when it flowed, and where the same teams later lost their edge. Different industries, different stacks. The shape of the winning practice repeated.
Requirements that traced back to a single source the developer actually read at sprint planning. Acceptance criteria written before the test, then linked. Architecture decisions captured in a place that survived the meeting and stayed binding three months later. Gates that someone owned, with consequences. By the time QA touched the build, the trace from a passing test back to a user story was a one-click hop.
None of this is exotic. The artifacts a healthy SDLC produces (traceability matrices, decision logs, gate evaluations, role handoffs with full context) are not hard to produce. The hard part is keeping them alive.
I had spent a lot of professional time studying what those someones did, and why the matrix stayed alive in their teams.
Why an agent can hold the line
The other face of the same coin is why even good teams eventually drift. People do not keep the gate alive because they stop caring. They do it less often because they have eight other things demanding the same attention, and the gate has no immediate consequence for being skipped this once.
An agent has none of that.
An agent does not get tired of writing the traceability row. It does not negotiate down the rigor of a security review because sprint capacity was tight. It does not forget which user story a test was meant to verify. If you tell it the gate exists and to evaluate it, it evaluates it. Every time.
So I sat down and wrote the SDLC I would have given my best teams, if attention had been infinite.
What Black Box App Factory is
Black Box App Factory is an SDLC framework for AI-driven software development. The methodology is markdown-only and AI-agnostic, so any harness that can read files and dispatch agents can run it.
The substance:
- 22 phases, Phase 0 through Phase 22 (ID 15 retired; ID 8 is conditional on whether CI/CD is in use). Coverage runs from project setup to requirements, architecture, CI/CD setup, development, testing, deployment, and operations.
- 7 roles the orchestrator dispatches as subagents: Product Manager, Workflow Architect, Architect, Developer, QA, Security Expert, User Tester.
- 10 configurable modules, named slices of process rigor: architecture depth, implementation planning, code review, security audit, functional testing, persona testing, accessibility, stability verification, deployment, operations. The Workflow Architect tunes each across 8 dimensions of project profile (scale, integration complexity, AI/LLM components, deployment target, user diversity, compliance, security sensitivity, accessibility). Each module has a tier, typically Skip / Lite / Standard / Full, that scales the effort on its associated phases.
- A file-based state machine. A single
specs/.workflow-state.jsonis the authoritative state. Phase outputs go to other files underspecs/. The orchestrator reads files, dispatches subagents, updates the state file. It does not keep working memory in conversation context.
Phase map
This is the shape of a cycle. Read left to right, top to bottom.
Under the hood
Two artefacts hold the framework together: the dispatch the orchestrator hands a subagent (a list of file paths, never spec contents), and the state file the orchestrator reads at the start of every dispatch.
# Phase 14, Step 1: persona testing
orchestrator dispatch:
role: roles/qa.md
task: persona testing
inputs:
- specs/personas/luca-low-tech.md
- specs/features/*.md
- specs/user-stories/*.md
- app at http://localhost:5173
output: specs/persona-report-luca.md
state: specs/.workflow-state.json (read-only)
The state file the subagent reads. One file per project, the only place phase status, module tiers, and gate outcomes live:
{
"schema_version": 3,
"project": {
"name": "claubar",
"repo": "gitlab.com/nirmak-group/claubar",
"ci_active": false
},
"modules": {
"M1": "Lite", "M2": "Lite", "M3": "Standard", "M4": "Standard",
"M5": "Lite", "M6": "Lite", "M7": "Skip", "M8": "Lite",
"M9": "Lite", "M10": "Skip"
},
"current_phase": 14,
"phases": {
"13": { "status": "complete", "started_at_sha": "c904bb2" },
"14": { "status": "in_progress", "started_at_sha": "e1f0aa8" }
},
"gates": {
"phase_3_user_validates_workflow": { "result": "pass", "attempts": 1 },
"phase_5_user_validates_specs": { "result": "pass", "attempts": 1 },
"phase_13_functional_testing": { "result": "pass", "attempts": 2 }
}
}
That is the whole interface. Every subagent reads paths, writes paths, returns. The state file is the only resume contract; a fresh session reconstructs the cycle from it.
The 7 roles
The orchestrator dispatches each role as a subagent with a single contract and a characteristic output.
- Product Manager. Owns the what and the why. Enriches the source request, writes the spec (DoD in Gherkin, personas, features, user stories, NFRs, traceability matrix), and makes go/no-go decisions against quality thresholds.
- Workflow Architect. Owns the dial. Scores the project across the 8 profile dimensions, sets the tier for each of the 10 modules, and enforces cross-module consistency (for example, Code Review tier cannot exceed Architecture tier + 1).
- Architect. Owns the how at system level. C4 model at the M1 tier, ADRs, the five mandatory sections (environment, IaC, CI/CD, rollback, observability), three-tier code review, plus a gate that validates the architecture's AI surface against the AI/LLM tier set in Phase 2.
- Developer. Owns the how at code level. Strict TDD, the 70/20/10 testing pyramid, reversible migrations, the AI eval substrate (golden set, harness, drift baseline, prompt-injection tests) when applicable, deployments, and post-deployment observability.
- QA. Owns testing rigor. Enriches scenarios at source, runs automated tests, dispatches persona and naive User Tester subagents, runs WCAG audits, files bugs, finalises the traceability matrix with defect metrics and gate statuses.
- Security Expert. Owns assurance. OWASP Top 10, dependency scanning, SBOM, auth and API review, GDPR where it applies, AI-specific security (prompt injection, adversarial input, output sanitisation), and security-relevant acceptance criteria on Feature and User Story files.
- User Tester. A naive evaluator with zero knowledge of the spec. Receives only the URL and one natural-language goal, interacts via the MCP browser, documents what surprises a first-time user. Multiple instances can be dispatched against different goals.
The 22 phases
One short paragraph per phase, in the order they run.
specs/, verifies access, and records the source request verbatim.Claubar, run through the framework
Claubar is the Waybar fork I shipped through Black Box App Factory: a permanent Claude session in a drop-down pane below my Linux status bar. Here is what each phase produced for it. Most modules ran at Lite. Claubar is the smallest project that still exercises every phase; on a regulated or accessibility-heavy app, the same phases run at Full.
Coming next
Two properties of Black Box App Factory deserve their own write-ups. Hints below; full articles to follow.
The contextless approach. The orchestrator never holds the project in working memory. The truth lives in files, subagents are dispatched with paths rather than spec contents, and a crashed session is a non-event. This is what makes the framework scale past any single conversation, and what makes it AI-agnostic: any harness that reads files and dispatches subagents can run it.
A different shape of test phase. The testing block (phases 12 to 18) is the part I have iterated on the hardest. Persona testing where each persona subagent files a per-persona report and QA writes a cross-persona analysis with a worst-dimension table. A User Tester role that stays naïve to the spec. A stability gate that decides whether a build holds up under sustained use. Mundane on their own. Combined, run every cycle by an agent that does not get tired, they catch what end-of-sprint QA misses.
Not a manifesto
Black Box App Factory is not a thought experiment. It is the SDLC I run. Claubar shipped through it, more is in flight. MIT-licensed, alpha, public: gitlab.com/nirmak-group/blackboxappfactory.
This is the bridge in code: the operational discipline I used to sell as a consultant, now running itself.
← Back to home