A workflow is a state machine of specialized agents, gated by humans and punctuated by deterministic commands. Each agent runs in its own Docker container, or shares one while the policy engine hot-swaps personas at every transition. Every transition is checkpointed, so a run survives crashes and resumes cleanly.
A definition is plain YAML. You pick the model: Anthropic, OpenAI, Google, or an open-weight alternative like GLM 5.1. Four kinds of state compose it:
agent runs an AI agent with a role-specific prompt and persona-scoped policy
human_gate pauses for your review, with approve, revise, or abort
deterministic runs shell commands (tests, lint, typecheck) without an LLM
terminal end state (success or aborted)
> Hero case: vulnerability discovery
Point the workflow at a file, a directory, or a subsystem. Give it a task description as simple as find a vulnerability in h264_slice.c. Twelve rounds later you get a triaged report with reproduction steps, coverage evidence, and a severity call. Or “no exploitable vulnerability found,” which is a valid outcome.
Static analysis is cheap, and models hallucinate when they stop there. Building a harness that actually proves or rules out a hypothesis is the expensive part, and that is what this workflow automates.
With minimal human guidance, the workflow has reproduced two Claude Mythos Preview findings: the 27-year-old OpenBSD TCP SACK bug (a commit I landed in November 1998), and the 16-year-old heap overwrite in h264_slice.c. Against current FFmpeg it also surfaced a new zero-day, reported to the maintainers.
> Human gates
Two moments pause for you. Each gate surfaces the relevant artifacts and accepts three actions: APPROVE, FORCE_REVISION, ABORT. Revision feedback is forwarded verbatim into the next agent’s prompt, so the text you write becomes the next round’s directive.
harness_review
Fires when the harness design-review or validate loop hits its retry cap.
You see: analysis, latest design, reviewer notes, validation report. Push through, redirect to a different tier, or abort.
report_review
Fires when the orchestrator declares the investigation complete.
You see: the final report, raw discoveries, triage, full journal. Approve to finish. Revision sends the orchestrator back to re-investigate. Human feedback overrides agent verdicts.
> Harness tiers
Each hypothesis deserves a different amount of infrastructure. The orchestrator picks a tier and the designer matches it.
T1 Isolated function. One function extracted verbatim, stubs for I/O. Millions of trials per second. Good for single-function arithmetic bugs.
T2 Multi-component. Real source files linked together, real data structures, real inter-function calls. Good for cross-function interactions and sentinel collisions.
T3 Full instrumented build. Actual project with sanitizers and coverage, crafted inputs through real entry points. Good for protocol framing and global state.
> Run it
$ ironcurtain daemon --web-ui --no-signal
Web UI: http://127.0.0.1:7400?token=…
IronCurtain daemon started.
Open the URL, pick vuln-discovery, point it at a workspace, start.