Project

I built the SDLC I always wanted, and let an agent run it

Fifteen years building software and fixing how engineering organisations work, five of them coaching agile transformations, taught me which gates, handoffs, and traceability links the teams that ship actually keep alive. Black Box App Factory is that distilled discipline, encoded as 22 phases an AI agent runs end-to-end, with humans only at the gates that matter.

The pattern in the teams that shipped

For years, my consulting work has been operational transformation: walking into teams that build software, finding out what was holding their delivery together when it flowed, and where the same teams later lost their edge. Different industries, different stacks. The shape of the winning practice repeated.

Requirements that traced back to a single source the developer actually read at sprint planning. Acceptance criteria written before the test, then linked. Architecture decisions captured in a place that survived the meeting and stayed binding three months later. Gates that someone owned, with consequences. By the time QA touched the build, the trace from a passing test back to a user story was a one-click hop.

None of this is exotic. The artifacts a healthy SDLC produces (traceability matrices, decision logs, gate evaluations, role handoffs with full context) are not hard to produce. The hard part is keeping them alive.

I had spent a lot of professional time studying what those someones did, and why the matrix stayed alive in their teams.

Why an agent can hold the line

The other face of the same coin is why even good teams eventually drift. People do not keep the gate alive because they stop caring. They do it less often because they have eight other things demanding the same attention, and the gate has no immediate consequence for being skipped this once.

An agent has none of that.

An agent does not get tired of writing the traceability row. It does not negotiate down the rigor of a security review because sprint capacity was tight. It does not forget which user story a test was meant to verify. If you tell it the gate exists and to evaluate it, it evaluates it. Every time.

So I sat down and wrote the SDLC I would have given my best teams, if attention had been infinite.

The discipline I had watched the best teams maintain was discipline an orchestrator-shaped AI could simply run.

What Black Box App Factory is

Black Box App Factory is an SDLC framework for AI-driven software development. The methodology is markdown-only and AI-agnostic, so any harness that can read files and dispatch agents can run it.

The substance:

22 phases, Phase 0 through Phase 22 (ID 15 retired; ID 8 is conditional on whether CI/CD is in use). Coverage runs from project setup to requirements, architecture, CI/CD setup, development, testing, deployment, and operations.
7 roles the orchestrator dispatches as subagents: Product Manager, Workflow Architect, Architect, Developer, QA, Security Expert, User Tester.
10 configurable modules, named slices of process rigor: architecture depth, implementation planning, code review, security audit, functional testing, persona testing, accessibility, stability verification, deployment, operations. The Workflow Architect tunes each across 8 dimensions of project profile (scale, integration complexity, AI/LLM components, deployment target, user diversity, compliance, security sensitivity, accessibility). Each module has a tier, typically Skip / Lite / Standard / Full, that scales the effort on its associated phases.
A file-based state machine. A single specs/.workflow-state.json is the authoritative state. Phase outputs go to other files under specs/. The orchestrator reads files, dispatches subagents, updates the state file. It does not keep working memory in conversation context.

Phase map

This is the shape of a cycle. Read left to right, top to bottom.

Setup

0Project Setup

Requirements

1Requirements Gathering

2Workflow Architecture

3User Validates Workflow

4Specification

5User Validates Specs

Architecture

6ArchitectureM1

7Implementation PlanningM2

CI/CD Setup

8CI/CD Pipeline Setupif ci_active

Development

9Development

10Code ReviewM3

11Security AuditM4

Testing

12QA Preparation

13Functional TestingM5

14Persona & AccessibilityM6+M7

16User Testing

17QA Go/No-Go

18Stability VerificationM8

Acceptance

19PM Final Review

20User Acceptance

Deployment & Ops

21Progressive DeploymentM9

22Operations & SLO WatchM10

Core

ConfigurableM1-M10

Conditional

User gate

Under the hood

Two artefacts hold the framework together: the dispatch the orchestrator hands a subagent (a list of file paths, never spec contents), and the state file the orchestrator reads at the start of every dispatch.

# Phase 14, Step 1: persona testing
orchestrator dispatch:
  role:    roles/qa.md
  task:    persona testing
  inputs:
    - specs/personas/luca-low-tech.md
    - specs/features/*.md
    - specs/user-stories/*.md
    - app at http://localhost:5173
  output:  specs/persona-report-luca.md
  state:   specs/.workflow-state.json (read-only)

The state file the subagent reads. One file per project, the only place phase status, module tiers, and gate outcomes live:

{
  "schema_version": 3,
  "project": {
    "name": "claubar",
    "repo": "gitlab.com/nirmak-group/claubar",
    "ci_active": false
  },
  "modules": {
    "M1": "Lite", "M2": "Lite", "M3": "Standard", "M4": "Standard",
    "M5": "Lite", "M6": "Lite", "M7": "Skip",     "M8": "Lite",
    "M9": "Lite", "M10": "Skip"
  },
  "current_phase": 14,
  "phases": {
    "13": { "status": "complete",    "started_at_sha": "c904bb2" },
    "14": { "status": "in_progress", "started_at_sha": "e1f0aa8" }
  },
  "gates": {
    "phase_3_user_validates_workflow": { "result": "pass", "attempts": 1 },
    "phase_5_user_validates_specs":    { "result": "pass", "attempts": 1 },
    "phase_13_functional_testing":     { "result": "pass", "attempts": 2 }
  }
}

That is the whole interface. Every subagent reads paths, writes paths, returns. The state file is the only resume contract; a fresh session reconstructs the cycle from it.

The 7 roles

The orchestrator dispatches each role as a subagent with a single contract and a characteristic output.

Product Manager. Owns the what and the why. Enriches the source request, writes the spec (DoD in Gherkin, personas, features, user stories, NFRs, traceability matrix), and makes go/no-go decisions against quality thresholds.
Workflow Architect. Owns the dial. Scores the project across the 8 profile dimensions, sets the tier for each of the 10 modules, and enforces cross-module consistency (for example, Code Review tier cannot exceed Architecture tier + 1).
Architect. Owns the how at system level. C4 model at the M1 tier, ADRs, the five mandatory sections (environment, IaC, CI/CD, rollback, observability), three-tier code review, plus a gate that validates the architecture's AI surface against the AI/LLM tier set in Phase 2.
Developer. Owns the how at code level. Strict TDD, the 70/20/10 testing pyramid, reversible migrations, the AI eval substrate (golden set, harness, drift baseline, prompt-injection tests) when applicable, deployments, and post-deployment observability.
QA. Owns testing rigor. Enriches scenarios at source, runs automated tests, dispatches persona and naive User Tester subagents, runs WCAG audits, files bugs, finalises the traceability matrix with defect metrics and gate statuses.
Security Expert. Owns assurance. OWASP Top 10, dependency scanning, SBOM, auth and API review, GDPR where it applies, AI-specific security (prompt injection, adversarial input, output sanitisation), and security-relevant acceptance criteria on Feature and User Story files.
User Tester. A naive evaluator with zero knowledge of the spec. Receives only the URL and one natural-language goal, interacts via the MCP browser, documents what surprises a first-time user. Multiple instances can be dispatched against different goals.

The 22 phases

One short paragraph per phase, in the order they run.

Setup

Project Setup. The user hands over the raw project brief and, when relevant, the repository and CI choice. The orchestrator scaffolds specs/, verifies access, and records the source request verbatim.

Requirements

Requirements Gathering. The PM enriches the source request with vision, target users, key workflows, success metrics, assumptions, and constraints.

Workflow Architecture. The Workflow Architect scores the project across 8 dimensions and assigns a Skip / Lite / Standard / Full tier to each of the 10 modules, with cross-module consistency rules. The dimensions: scale, integration complexity, AI/LLM components (none / consumer / producer), deployment target, user diversity (single user / team / public-facing), compliance, security sensitivity, accessibility.

User Validates Workflow. User gate. The user signs off on the tier configuration before any spec is written, or sends back targeted feedback.

Specification. The PM writes the definition of done in Gherkin, at least three personas (including a low-tech and an accessibility persona), features, user stories, non-functional requirements, and the traceability matrix. A cross-check subagent validates internal consistency.

User Validates Specs. User gate.

Architecture

Architecture (M1). The Architect designs the system with the C4 model at the depth set by M1, captures ADRs, documents the five mandatory sections (environment, IaC, CI/CD, rollback, observability), and clears a gate that checks the architecture's AI surface matches the AI/LLM tier set in Phase 2.

Implementation Planning (M2). The Developer plans the build as ordered, independent work groups. When M2 is Full, the Architect validates the plan and reviews any AI prompts.

CI/CD Setup

CI/CD Pipeline Setup. Conditional. When the project uses CI/CD, the pipeline is authored from the architecture's CI/CD strategy, the Architect reviews it against that strategy, and a smoke test verifies it runs green.

Development

Development. The Developer implements user stories via strict TDD (Red-Green-Refactor) against a 70/20/10 testing pyramid, writes reversible migrations, and builds the AI eval substrate (golden set, eval harness, drift baseline, prompt-injection tests) when AI is in scope.

Code Review (M3). The Architect runs a three-tier review: Tier 1 correctness and security, Tier 2 architecture, Tier 3 quality. Lite or Standard security scanning rides inline here.

Security Audit (M4). Conditional. When M4 is Full, the Security Expert runs a standalone OWASP Top 10 audit, dependency scan, SBOM, auth and API review, plus GDPR and AI-security checks where they apply. Lower tiers fold the audit inline into Phase 10.

Testing

QA Preparation. QA resolves any open spec questions, collects API credentials, enriches Feature and User Story test scenarios additively at source, and starts the dev server.

Functional Testing (M5). QA executes automated scenarios, files bugs, the Developer fixes Critical and High before proceeding, QA re-verifies. M5 = Full adds an MCP smoke test.

Persona & Accessibility (M6+M7). QA dispatches persona subagents (M6 = Full: every persona; Standard: two key personas) and runs a WCAG 2.1 AA audit at the M7 tier (Full: manual; Lite: automated only). Phase 15, a former standalone accessibility phase, was folded in here.

User Testing. QA dispatches one or more naive User Tester subagents, each isolated from specs, given the URL and one natural-language goal. They explore, document friction, file bugs. QA finalises the traceability matrix here, with defect density, DRE, and quality gate statuses.

QA Go/No-Go. The PM weighs test reports, persona feedback, user testing, bugs, and the traceability matrix against quality gates, runs a deferred-bug resolution loop, and makes a go or no-go call.

Stability Verification (M8). Performance, load, stress, soak, and spike testing at the tier M8 sets. Full runs the full battery; Standard runs performance and load only.

Acceptance

PM Final Review. The PM drives the running app via MCP, compares it against the original vision and the spec, writes the final report, and asks the user whether they want a guided demo before deployment.

User Acceptance. User gate. The user reviews the final report or runs the guided MCP demo, then accepts or sends back targeted feedback.

Deployment & Ops

Progressive Deployment (M9). Canary deployment with metric checks against NFR thresholds when M9 is Full; simpler deploys at Standard or Lite. Observability is confirmed flowing before full rollout.

Operations & SLO Watch (M10). Production health, SLO monitoring, and AI-drift monitoring at the M10 tier. If M10 is Skip, the cycle ends here.

Claubar, run through the framework

Claubar is the Waybar fork I shipped through Black Box App Factory: a permanent Claude session in a drop-down pane below my Linux status bar. Here is what each phase produced for it. Most modules ran at Lite. Claubar is the smallest project that still exercises every phase; on a regulated or accessibility-heavy app, the same phases run at Full.

Setup

Project Setup. Source request: "I want a permanent Claude session in a drop-down pane below my status bar."

Requirements

Requirements Gathering. Stories around always-on access, no Waybar config disruption, fire-and-forget toggling.

Workflow Architecture. Single-user desktop alpha profile. Most modules at Lite. Security at Standard (the pane hosts an agent with shell access). Code Review raised to Standard to match: M3 ≤ M1+1, and the VTE / shell integration is the highest-risk surface.

User Validates Workflow. Approved.

Specification. Waybar fork, one extra top-level config key, VTE-based lower pane, existing Waybar modules and CSS untouched. Primary persona (the alpha user) plus an accessibility persona retained in the backlog for the desktop track.

User Validates Specs. Approved.

Architecture

Architecture (M1). VTE for the terminal pane, Waybar's GTK base preserved, drop-down isolated as a single widget under the bar. ADRs captured the VTE-vs-embedded-terminal trade-off and the single-config-key contract.

Implementation Planning (M2). Ordered work groups: fork Waybar, add the config key, wire the VTE pane, implement the toggle, verify module compatibility.

CI/CD Setup

CI/CD Pipeline Setup. Skipped. No CI for the alpha.

Development

Development. Fork compiles, pane renders, prompt accepts input, session persists across Waybar restarts.

Code Review (M3). Standard. Tier 1 focused on the VTE integration boundary and the config-key surface (the two places an injection bug could escape).

Security Audit (M4). Standard, inline. Scope: credential handling, permission boundary, and what runs unsupervised.

Testing

QA Preparation. Scenarios for hide and show, focus capture, persistence across Waybar restarts, behaviour alongside existing module configurations.

Functional Testing (M5). Lite, happy path: pane opens, prompt works, session survives a Waybar restart.

Persona & Accessibility (M6+M7). Lite. Primary persona (the alpha user). Accessibility deferred at this tier; flagged in the backlog for the desktop track.

User Testing. Me, daily-driver use across several days. Friction notes captured against the goal: "ask Claude something without leaving the keyboard."

QA Go/No-Go. Go, with alpha rough edges noted.

Stability Verification (M8). Lite. Soak under normal use across a week, journal clean, no crashes.

Acceptance

PM Final Review. Alpha scope accepted against the original vision.

User Acceptance. Accepted.

Deployment & Ops

Progressive Deployment (M9). Lite. Open source on GitLab, MIT, README and the blog post.

Operations & SLO Watch (M10). Skip. Personal alpha, no SLOs to watch.

Markdown framework Claude Code File-based state machine Multi-agent orchestration 22 phases / 7 roles / 10 modules MIT

Coming next

Two properties of Black Box App Factory deserve their own write-ups. Hints below; full articles to follow.

The contextless approach. The orchestrator never holds the project in working memory. The truth lives in files, subagents are dispatched with paths rather than spec contents, and a crashed session is a non-event. This is what makes the framework scale past any single conversation, and what makes it AI-agnostic: any harness that reads files and dispatches subagents can run it.

A different shape of test phase. The testing block (phases 12 to 18) is the part I have iterated on the hardest. Persona testing where each persona subagent files a per-persona report and QA writes a cross-persona analysis with a worst-dimension table. A User Tester role that stays naïve to the spec. A stability gate that decides whether a build holds up under sustained use. Mundane on their own. Combined, run every cycle by an agent that does not get tired, they catch what end-of-sprint QA misses.

Not a manifesto

Black Box App Factory is not a thought experiment. It is the SDLC I run. Claubar shipped through it, more is in flight. MIT-licensed, alpha, public: gitlab.com/nirmak-group/blackboxappfactory.

This is the bridge in code: the operational discipline I used to sell as a consultant, now running itself.

Jean-Philippe Arné

AI Transformation Lead

I lead AI transformation: challenging existing workflows, then building the automation where it belongs.

← Back to home