Project

I built the SDLC I always wanted, and let an agent run it

Fifteen years building software and fixing how engineering organisations work, five of them coaching agile transformations, taught me which gates, handoffs, and traceability links the teams that ship actually keep alive. Black Box App Factory is that distilled discipline, encoded as 22 phases an AI agent runs end-to-end, with humans only at the gates that matter.

The pattern in the teams that shipped

For years, my consulting work has been operational transformation: walking into teams that build software, finding out what was holding their delivery together when it flowed, and where the same teams later lost their edge. Different industries, different stacks. The shape of the winning practice repeated.

Requirements that traced back to a single source the developer actually read at sprint planning. Acceptance criteria written before the test, then linked. Architecture decisions captured in a place that survived the meeting and stayed binding three months later. Gates that someone owned, with consequences. By the time QA touched the build, the trace from a passing test back to a user story was a one-click hop.

None of this is exotic. The artifacts a healthy SDLC produces (traceability matrices, decision logs, gate evaluations, role handoffs with full context) are not hard to produce. The hard part is keeping them alive.

I had spent a lot of professional time studying what those someones did, and why the matrix stayed alive in their teams.

Why an agent can hold the line

The other face of the same coin is why even good teams eventually drift. People do not keep the gate alive because they stop caring. They do it less often because they have eight other things demanding the same attention, and the gate has no immediate consequence for being skipped this once.

An agent has none of that.

An agent does not get tired of writing the traceability row. It does not negotiate down the rigor of a security review because sprint capacity was tight. It does not forget which user story a test was meant to verify. If you tell it the gate exists and to evaluate it, it evaluates it. Every time.

So I sat down and wrote the SDLC I would have given my best teams, if attention had been infinite.

The discipline I had watched the best teams maintain was discipline an orchestrator-shaped AI could simply run.

What Black Box App Factory is

Black Box App Factory is an SDLC framework for AI-driven software development. The methodology is markdown-only and AI-agnostic, so any harness that can read files and dispatch agents can run it.

The substance:

Phase map

This is the shape of a cycle. Read left to right, top to bottom.

Setup
0Project Setup
Requirements
1Requirements Gathering
2Workflow Architecture
3User Validates Workflow
4Specification
5User Validates Specs
Architecture
6ArchitectureM1
7Implementation PlanningM2
CI/CD Setup
8CI/CD Pipeline Setupif ci_active
Development
9Development
10Code ReviewM3
11Security AuditM4
Testing
12QA Preparation
13Functional TestingM5
14Persona & AccessibilityM6+M7
16User Testing
17QA Go/No-Go
18Stability VerificationM8
Acceptance
19PM Final Review
20User Acceptance
Deployment & Ops
21Progressive DeploymentM9
22Operations & SLO WatchM10
Core
ConfigurableM1-M10
Conditional
User gate

Under the hood

Two artefacts hold the framework together: the dispatch the orchestrator hands a subagent (a list of file paths, never spec contents), and the state file the orchestrator reads at the start of every dispatch.

# Phase 14, Step 1: persona testing
orchestrator dispatch:
  role:    roles/qa.md
  task:    persona testing
  inputs:
    - specs/personas/luca-low-tech.md
    - specs/features/*.md
    - specs/user-stories/*.md
    - app at http://localhost:5173
  output:  specs/persona-report-luca.md
  state:   specs/.workflow-state.json (read-only)

The state file the subagent reads. One file per project, the only place phase status, module tiers, and gate outcomes live:

{
  "schema_version": 3,
  "project": {
    "name": "claubar",
    "repo": "gitlab.com/nirmak-group/claubar",
    "ci_active": false
  },
  "modules": {
    "M1": "Lite", "M2": "Lite", "M3": "Standard", "M4": "Standard",
    "M5": "Lite", "M6": "Lite", "M7": "Skip",     "M8": "Lite",
    "M9": "Lite", "M10": "Skip"
  },
  "current_phase": 14,
  "phases": {
    "13": { "status": "complete",    "started_at_sha": "c904bb2" },
    "14": { "status": "in_progress", "started_at_sha": "e1f0aa8" }
  },
  "gates": {
    "phase_3_user_validates_workflow": { "result": "pass", "attempts": 1 },
    "phase_5_user_validates_specs":    { "result": "pass", "attempts": 1 },
    "phase_13_functional_testing":     { "result": "pass", "attempts": 2 }
  }
}

That is the whole interface. Every subagent reads paths, writes paths, returns. The state file is the only resume contract; a fresh session reconstructs the cycle from it.

The 7 roles

The orchestrator dispatches each role as a subagent with a single contract and a characteristic output.

The 22 phases

One short paragraph per phase, in the order they run.

Setup
0
Project Setup. The user hands over the raw project brief and, when relevant, the repository and CI choice. The orchestrator scaffolds specs/, verifies access, and records the source request verbatim.
Requirements
1
Requirements Gathering. The PM enriches the source request with vision, target users, key workflows, success metrics, assumptions, and constraints.
2
Workflow Architecture. The Workflow Architect scores the project across 8 dimensions and assigns a Skip / Lite / Standard / Full tier to each of the 10 modules, with cross-module consistency rules. The dimensions: scale, integration complexity, AI/LLM components (none / consumer / producer), deployment target, user diversity (single user / team / public-facing), compliance, security sensitivity, accessibility.
3
User Validates Workflow. User gate. The user signs off on the tier configuration before any spec is written, or sends back targeted feedback.
4
Specification. The PM writes the definition of done in Gherkin, at least three personas (including a low-tech and an accessibility persona), features, user stories, non-functional requirements, and the traceability matrix. A cross-check subagent validates internal consistency.
5
User Validates Specs. User gate.
Architecture
6
Architecture (M1). The Architect designs the system with the C4 model at the depth set by M1, captures ADRs, documents the five mandatory sections (environment, IaC, CI/CD, rollback, observability), and clears a gate that checks the architecture's AI surface matches the AI/LLM tier set in Phase 2.
7
Implementation Planning (M2). The Developer plans the build as ordered, independent work groups. When M2 is Full, the Architect validates the plan and reviews any AI prompts.
CI/CD Setup
8
CI/CD Pipeline Setup. Conditional. When the project uses CI/CD, the pipeline is authored from the architecture's CI/CD strategy, the Architect reviews it against that strategy, and a smoke test verifies it runs green.
Development
9
Development. The Developer implements user stories via strict TDD (Red-Green-Refactor) against a 70/20/10 testing pyramid, writes reversible migrations, and builds the AI eval substrate (golden set, eval harness, drift baseline, prompt-injection tests) when AI is in scope.
10
Code Review (M3). The Architect runs a three-tier review: Tier 1 correctness and security, Tier 2 architecture, Tier 3 quality. Lite or Standard security scanning rides inline here.
11
Security Audit (M4). Conditional. When M4 is Full, the Security Expert runs a standalone OWASP Top 10 audit, dependency scan, SBOM, auth and API review, plus GDPR and AI-security checks where they apply. Lower tiers fold the audit inline into Phase 10.
Testing
12
QA Preparation. QA resolves any open spec questions, collects API credentials, enriches Feature and User Story test scenarios additively at source, and starts the dev server.
13
Functional Testing (M5). QA executes automated scenarios, files bugs, the Developer fixes Critical and High before proceeding, QA re-verifies. M5 = Full adds an MCP smoke test.
14
Persona & Accessibility (M6+M7). QA dispatches persona subagents (M6 = Full: every persona; Standard: two key personas) and runs a WCAG 2.1 AA audit at the M7 tier (Full: manual; Lite: automated only). Phase 15, a former standalone accessibility phase, was folded in here.
16
User Testing. QA dispatches one or more naive User Tester subagents, each isolated from specs, given the URL and one natural-language goal. They explore, document friction, file bugs. QA finalises the traceability matrix here, with defect density, DRE, and quality gate statuses.
17
QA Go/No-Go. The PM weighs test reports, persona feedback, user testing, bugs, and the traceability matrix against quality gates, runs a deferred-bug resolution loop, and makes a go or no-go call.
18
Stability Verification (M8). Performance, load, stress, soak, and spike testing at the tier M8 sets. Full runs the full battery; Standard runs performance and load only.
Acceptance
19
PM Final Review. The PM drives the running app via MCP, compares it against the original vision and the spec, writes the final report, and asks the user whether they want a guided demo before deployment.
20
User Acceptance. User gate. The user reviews the final report or runs the guided MCP demo, then accepts or sends back targeted feedback.
Deployment & Ops
21
Progressive Deployment (M9). Canary deployment with metric checks against NFR thresholds when M9 is Full; simpler deploys at Standard or Lite. Observability is confirmed flowing before full rollout.
22
Operations & SLO Watch (M10). Production health, SLO monitoring, and AI-drift monitoring at the M10 tier. If M10 is Skip, the cycle ends here.

Claubar, run through the framework

Claubar is the Waybar fork I shipped through Black Box App Factory: a permanent Claude session in a drop-down pane below my Linux status bar. Here is what each phase produced for it. Most modules ran at Lite. Claubar is the smallest project that still exercises every phase; on a regulated or accessibility-heavy app, the same phases run at Full.

Setup
0
Project Setup. Source request: "I want a permanent Claude session in a drop-down pane below my status bar."
Requirements
1
Requirements Gathering. Stories around always-on access, no Waybar config disruption, fire-and-forget toggling.
2
Workflow Architecture. Single-user desktop alpha profile. Most modules at Lite. Security at Standard (the pane hosts an agent with shell access). Code Review raised to Standard to match: M3 ≤ M1+1, and the VTE / shell integration is the highest-risk surface.
3
User Validates Workflow. Approved.
4
Specification. Waybar fork, one extra top-level config key, VTE-based lower pane, existing Waybar modules and CSS untouched. Primary persona (the alpha user) plus an accessibility persona retained in the backlog for the desktop track.
5
User Validates Specs. Approved.
Architecture
6
Architecture (M1). VTE for the terminal pane, Waybar's GTK base preserved, drop-down isolated as a single widget under the bar. ADRs captured the VTE-vs-embedded-terminal trade-off and the single-config-key contract.
7
Implementation Planning (M2). Ordered work groups: fork Waybar, add the config key, wire the VTE pane, implement the toggle, verify module compatibility.
CI/CD Setup
8
CI/CD Pipeline Setup. Skipped. No CI for the alpha.
Development
9
Development. Fork compiles, pane renders, prompt accepts input, session persists across Waybar restarts.
10
Code Review (M3). Standard. Tier 1 focused on the VTE integration boundary and the config-key surface (the two places an injection bug could escape).
11
Security Audit (M4). Standard, inline. Scope: credential handling, permission boundary, and what runs unsupervised.
Testing
12
QA Preparation. Scenarios for hide and show, focus capture, persistence across Waybar restarts, behaviour alongside existing module configurations.
13
Functional Testing (M5). Lite, happy path: pane opens, prompt works, session survives a Waybar restart.
14
Persona & Accessibility (M6+M7). Lite. Primary persona (the alpha user). Accessibility deferred at this tier; flagged in the backlog for the desktop track.
16
User Testing. Me, daily-driver use across several days. Friction notes captured against the goal: "ask Claude something without leaving the keyboard."
17
QA Go/No-Go. Go, with alpha rough edges noted.
18
Stability Verification (M8). Lite. Soak under normal use across a week, journal clean, no crashes.
Acceptance
19
PM Final Review. Alpha scope accepted against the original vision.
20
User Acceptance. Accepted.
Deployment & Ops
21
Progressive Deployment (M9). Lite. Open source on GitLab, MIT, README and the blog post.
22
Operations & SLO Watch (M10). Skip. Personal alpha, no SLOs to watch.
Markdown framework Claude Code File-based state machine Multi-agent orchestration 22 phases / 7 roles / 10 modules MIT

Coming next

Two properties of Black Box App Factory deserve their own write-ups. Hints below; full articles to follow.

The contextless approach. The orchestrator never holds the project in working memory. The truth lives in files, subagents are dispatched with paths rather than spec contents, and a crashed session is a non-event. This is what makes the framework scale past any single conversation, and what makes it AI-agnostic: any harness that reads files and dispatches subagents can run it.

A different shape of test phase. The testing block (phases 12 to 18) is the part I have iterated on the hardest. Persona testing where each persona subagent files a per-persona report and QA writes a cross-persona analysis with a worst-dimension table. A User Tester role that stays naïve to the spec. A stability gate that decides whether a build holds up under sustained use. Mundane on their own. Combined, run every cycle by an agent that does not get tired, they catch what end-of-sprint QA misses.

Not a manifesto

Black Box App Factory is not a thought experiment. It is the SDLC I run. Claubar shipped through it, more is in flight. MIT-licensed, alpha, public: gitlab.com/nirmak-group/blackboxappfactory.

This is the bridge in code: the operational discipline I used to sell as a consultant, now running itself.

Jean-Philippe Arné
Operational Transformation Consultant & AI Agent Builder
I fix operational bottlenecks during the day, for my clients. I build AI agents when I'm back home. I'm looking for the role where they stop being two separate jobs.
← Back to home