prompt-language: Engineering Retrospective

prompt-language: A Structured Language for Prompt Workflows

Ad-hoc prompt strings work until they don’t. As long as you’re writing one-off queries or simple instruction prompts, raw strings are fine. Once you need flow control — retry on condition, branch on output, enforce a gate before proceeding — you’re either reimplementing those constructs in your application code on every project or you’re reaching for something that treats prompts as first-class structured objects. prompt-language was built to be that something: a domain-specific language for defining prompt workflows with explicit flow control, checkpoints, and gate enforcement.

What Changed

The core design question was where to put the logic. One option is to keep prompts as dumb strings and handle flow in the surrounding application code. The problem with that approach is that the flow logic ends up tightly coupled to whatever framework or language you’re using, making it hard to port, reuse, or hand off to another system. The other option — encoding the flow in the prompt definition itself — means the prompt file is self-describing. You can read a workflow definition and understand what it does without tracing through application logic.

prompt-language landed on the second approach. A workflow definition specifies the sequence of prompt steps, the conditions under which the workflow advances or retries, and the gates that must be passed before certain steps execute. Gates were the most important design decision: a gate is an explicit assertion that some condition is true before proceeding, and a failed gate produces a structured error rather than silently continuing with bad state. This made workflows fail loudly at the right point rather than producing subtly wrong outputs several steps later.

The language also needed to handle the autonomous agent use case, where a workflow might run without human review at each step. For that, checkpoints were added — named points in the workflow where state is persisted and from which execution can be resumed if a step fails or the process is interrupted. This turned out to be essential for any workflow longer than three or four steps, because LLM API calls are unreliable enough that assuming you’ll get through a ten-step workflow without interruption is optimistic.

Why It Mattered

The practical benefit was reusability. Once a workflow was defined in prompt-language, it could be run by any executor that understood the format, tested in isolation, and modified without touching application code. Debugging also improved: because each step’s input and output were explicit, tracing a failure meant looking at the checkpoint logs rather than setting breakpoints in application code.

The deeper benefit was forcing clarity about what a workflow was actually supposed to do. Writing a workflow in a structured language requires you to specify the success conditions, the gate criteria, and the failure modes explicitly. Ad-hoc prompt strings let you skip that work; the DSL doesn’t. That forcing function caught a lot of underspecified logic that would have produced confusing results in production.

What Held Up / What Didn’t

The gate and checkpoint model held up well and became the parts of the language most worth keeping. The syntax itself went through more iteration than expected — the first version was too verbose, requiring boilerplate that made short workflows feel heavy. Trimming the surface area without losing expressiveness took several rounds. The honest answer to the overengineered question: a full DSL is probably overkill for workflows under five steps, but once you’re chaining LLM calls with conditional logic and autonomous execution requirements, the structure pays for itself quickly.