hci-review-skill: Engineering Retrospective

hci-review-skill: HCI Methodology as a Runnable Skill

Running a proper HCI review manually is time-consuming enough that it rarely happens unless someone specifically schedules it. You need to evaluate information architecture, interaction patterns, feedback loops, error handling, accessibility considerations, and a handful of other dimensions — each of which requires thinking carefully about the specific interface rather than applying a checklist mechanically. The hci-review-skill project encodes that review methodology as a Claude Code skill: a structured, runnable capability that can be invoked against any interface and produces a consistent set of review artifacts without requiring the reviewer to hold the full methodology in their head during the session.

What Changed

The core design decision was how to represent the HCI methodology inside the skill. The naive approach is a long system prompt describing all the review dimensions, which works for simple cases but tends to produce reviews that feel generic because the model is trying to cover everything at once rather than developing each dimension with the specificity it deserves. The approach taken here was to structure the review as a sequence of focused phases, each with its own prompt and evaluation criteria, and to enforce that each phase produces a named output artifact before the next phase begins.

The prototype-hci-pack pattern emerged from this: a standard set of eleven output documents that together constitute a complete review. Each document covers a distinct dimension — navigation and information architecture, interaction model consistency, feedback and system status, error prevention and recovery, accessibility, cognitive load, and so on. The skill works through each document in sequence, producing them as actual files rather than inline responses. This means the review output is portable — it can be reviewed asynchronously, stored with the project, referenced in design discussions, and compared across review cycles.

Gate enforcement between phases was important for maintaining review quality. Each phase gate checks that the previous document was actually produced and contains substantive content rather than placeholder text before allowing the skill to proceed. Without these gates, early versions would skip ahead when the model decided a dimension didn’t apply, producing incomplete packs that were harder to trust because you couldn’t tell if a missing document meant “not applicable” or “skipped.”

Why It Mattered

The primary benefit was consistency. Ad-hoc HCI reviews vary widely in depth and coverage depending on who runs them and what they happen to focus on in the moment. The skill produces the same eleven documents regardless of who invokes it, which makes the output comparable across projects and over time. A team reviewing the same interface six months apart gets the same structure, which makes it straightforward to identify what changed and whether it changed for better or worse.

The secondary benefit was lowering the activation energy for running reviews at all. When a review requires scheduling a session, gathering the right people, and working through a methodology document manually, it happens infrequently. When it’s a single command that produces output in a few minutes, it happens every time there’s a meaningful design change. That frequency difference matters more than any individual review’s quality.

What Held Up / What Didn’t

The eleven-document structure held up well and is the part of the project most worth keeping as a template. The gate enforcement logic needed several rounds of tuning — the initial gates were too strict and rejected documents that were substantively complete but didn’t match the exact format check, and too loose gates let thin documents through. The right calibration is asserting that specific required sections exist rather than checking content length or format. The skill also works better on interfaces with existing design documentation than on purely visual interfaces where the model has to infer structure from screenshots alone; that’s a limitation worth documenting clearly before handing the skill to someone else.