Consistent UI at Scale

  • 2026
  • Experiment
  • Research + Engineering
  • Claude Code, SvelteKit

I’m a design systems lead. Our team got dramatically faster with AI agents, but without guardrails the speed produced a mess with custom styles, one-off components, and inconsistent UX patterns. UI updates became disproportionately hard, and even though we shipped more, quality and consistency dropped, and with that, the user experience worsened.

I saw this happening in our team, and on my personal projects, and as a designer, didn’t like it at all. I decided to find a way to guide agents to follow design system standards.

I ran an experiment to see what kind of documentation and prompting keeps agents on track. I tried different ways to write prompts and document everything. I also tested various models, mostly Claude, but I think the results apply to others as well.

I ended up with a solid framework that reduced style violations by 95% compared to having no documentation. With a few extra tweaks, the agents even started suggesting their own improvements to the design system.

Bare branch: hand-rolled UI without design-system documentation
without documentation
Loaded branch: composed from documented primitives
with documentation

The framework

Here’s the setup if you just want to duplicate it, and the reasoning is right after. I’d recommend to adjust it for your own needs. AI will probably do a good enough job with the listed prompts, but it’s still worth checking.

The only prerequisite is a component library. I’ll assume you have one for the purpose of the article.

Start with primitives. You probably already have a file with core styles, maybe even semantically named. This is the foundation of every consistent UI, and it’s a good idea to establish some rules here. Try this prompt, you can adjust it for your project’s specifics:

Then document the components. I found JSDoc the most useful, because the docs are visible in code editors on hover. Very convenient when you’re pairing with a model.

Now add some automation. Create ESLint rules and, if you’re on Claude Code, a Stop hook to run those checks automatically after every change.

Then create a design system skill. The skill will trigger every time an agent works on front-end tasks. It enforces the rules of the design system: both the rules baked into the components and tokens, and any patterns you want to encode, think of empty states, when to invoke which type of modal, accessibility rules, or content design guides.

Finally, CLAUDE.md. This is the file Claude reads at the start of every session, so keep it lean, don’t overwhelm the context window with things that aren’t always relevant. Use the prompt below, or factor the design-system rules into a separate referenced file if your CLAUDE.md is already large and UI isn’t the only thing in your repo.

At the end your project will be structured like this:

your-project/
  path-to/components/
    Button.svelte         ← JSDoc on every component
    Field.svelte
    ...
  CLAUDE.md               ← rules, component inventory, imports
  .claude/
    skills/build-with-design-system/
      SKILL.md            ← decision tree
      patterns/           ← forms, lists, success, …
    hooks/design-system-check.mjs  ← stop hook → re-runs ESLint
  eslint-rules/
    project-design-system.js       ← no-bare-input, no-css-literals

That’s the whole setup, and you can build from here. I found that a well-structured Figma file is the best way to prompt the model, but a messier Figma file, Claude design files, sketches, or even plain text PRDs (I’d still recommend illustrate them with some sketches) all still work, just less precisely or more expensive.

You can also add another review step at the PR level, e.g. a GitHub Action that posts review comments. I found it useful, especially for proposing new components and improvements to the design system, but it adds cost, so weigh it against the size of the project.

The experiment

I tested the setup a few different ways. The mock app is called finn — a personal-finance app built with SvelteKit, and plain CSS. No React, no Tailwind, and no out-of-the-box component library. I picked an this stack on purpose, because every single product is unique, and I didn’t want to test it on the default for AI stack.

The finn app — a personal-finance mock used as the experiment fixture

I started by trying to figure out the best way to document things. Three formats:

  1. No documentation (control)
  2. A traditional CLAUDE.md plus a folder of markdown files generated by Anthropic’s /design-system skill
  3. JSDoc on every component plus a dedicated build-with-design-system skill

Markdown files and the skill-plus-JSDoc approach produced similar results, but I find markdown files harder to work with, since you have to manually search for them, and generally remember to maintain. JSDoc is inline, and shows up in code editors on hover, which makes the experience much better. I went with a custom skill because it only triggers when you’re working on the UI. This way, it doesn’t clutter your context when agent is working on things that aren’t related to the front end. You can configure around it, but since the two approaches scored roughly the same, I stuck with skill + JSDoc for everything that followed.

The format comparison (4 runs each, opus-4.7):

MetricMarkdownSkill + JSDoc
Bypass markers25.5 ± 12.414.5 ± 10.8
Primitives used34 ± 3.536.25 ± 0.8
Style escapes7.25 ± 4.46.0 ± 4.5
Cost / run$2.29 ± $0.27$2.40 ± $0.22

Bypass markers show how many times the agent ignored existing components and created one-offs.

Next, I wanted to figure out what belongs in CLAUDE.md versus the skill versus the component layer. After a few rounds it became clear that a mixed approach worked best.

On top of an ordinary codebase I layered four things:

  • Inline-documented primitives and components
  • An ESLint rule that enforces use of those primitives
  • An umbrella skill (SKILL.md) describing the inventory and composition rules
  • A CLAUDE.md with a general index and a few short guidelines

Finding the best way to prompt

I wanted to see which input format the model handled best. I compared:

  • A text prompt, detailed, but still just text
  • A pencil sketch plus a short text brief
  • A well-organized Figma file

Then for the sake of comparison I also added a less well-organized Figma file and a Claude Design handoff generated from it.

Build the Send screen

/send is currently a placeholder. Build it out into a working "send money" flow.

What the screen does

A user opens /send to send money to someone. They:

  1. Pick a recipient — choose from existing contacts, or enter an email address.
  2. Enter the amount — type a number, pick the currency.
  3. Add a note — a short message with the transfer.
  4. See a confirmation — recipient, amount, fees, and a clear "Send" action.
  5. Submit — show success, then offer "Send another" or "Back to convert".

What to handle

  • Recipient search by name or email
  • Amount: decimals only, reject letters
  • Validate before submit (amount > 0, recipient selected)
  • Network failure: error state with retry

Existing data

  • src/lib/stores/contacts
  • src/lib/data/currencies
  • src/lib/utils/format
  • src/lib/types.ts

Done looks like

  • Clean flow that fits the app
  • bun run check passes
  • bun run build passes
text PRD
A pencil sketch of the Send screen used as a visual prompt
pencil sketch
Figma file

I ran three rounds with Claude Opus 4.7 against both the undocumented and well-documented branches — 18 runs total per condition, 30 runs across modalities × branches × reps.

Documented branch only (the part that matters once you’ve done the setup):

Modality$/runLinesEscapesImports σVerdict
Text PRD$1.643131.00.0Lost on every dimension
Sketch$1.0310300.6Cheapest path that still works
Figma$1.207100.0Practical sweet spot
Claude Design$1.375100.0Tightest output; downstream of Figma

Text prompts performed worst. My guess is they leave too much for the model to decide on. Claude Design handoffs technically produced the tightest output, but you have to spend tokens to get there, so it’s not really a win on effort. Sketches did surprisingly well, but consistency was weaker than I’d want. Well-organized Figma was the practical winner: the fewest escapes, the fewest lines of code, perfect agreement on which primitives to use, and a price within rounding of sketch.

How it holds up over time

The next question was: what happens as the codebase grows? Does it create new components when it should? Does it snowball errors and one-offs?

I assumed Figma would still win at scale with how well it performed, so there was no sense in testing with it. So I prompted with text PRDs and sketches and asked the agents to add five more screens on top of the existing app. Three reps per screen per branch, 30 runs in total.

The undocumented branch snowballed pretty hard. New one-off styles, and within a few commits the UI had drifted noticeably from the original setup. The well-documented branch held together much better.

Cumulative state after 6 chained commits (round 1):

BranchScreensCumulative bypassCumulative escapesΔ escapes / new screen
Bare943311.00
Documented10210.25

A 44× gap on new escapes per added screen. The bare branch picked up 33 new style escapes across the chain; the documented branch picked up one.

One caveat: the skill is invoked non-deterministically. On the documented branch it fired 5 of 6 commits. The one miss (a success-screen prompt with a text brief) was also the only commit where the documented branch picked up an escape. So if it matters, either mention the skill explicitly in the prompt or enforce its use more aggressively.

One interesting point is that because models are so good at coding, it is hard to tell if something is wrong just by looking at the results. Both the bare version on the left and the documented version on the right look okay at first glance. However, the bare version actually created new styles for the input field, used the wrong text style for the bottom text, and made a completely new component for the slider.

Bare branch — notification threshold screen with a one-off slider component and a re-styled input
bare
Documented branch — same screen composed from the existing primitives
documented

Self-propagating design system

The other thing worth noting: even the well-documented branch started proposing new components, because new screens occasionally need shapes the existing library doesn’t cover. That raised the next question. What’s the right way to propose new components, and can the system become self-propagating?

I tried three approaches before finding one that worked.

Approach 1: a Stop hook. Runs after every change the agent completes, reviews what it just did, and proposes design-system additions. It worked, but it ran after every completed task and produced too much proposals, while also spending more tokens than I’d like it too.

Approach 2: bake “propose new components” step into the build-with-design-system skill. This barely worked. The skill triggered, but the proposal step was mostly skipped. New components got proposed exactly once across the round; most of the time the agent saw “you can either build this directly or propose a new primitive,” picked “build directly,” and moved on. Again it probably would be done better and it’s possible to make it work but as with the stop hook I didn’t want it to run after every single task because it would spend too many tokens so I considered this approach a fail too.

Approach 3: a separate review pass at the PR level. It can be invoked after the main body of work is done, and on demand, making it only requested when considerable amount of front-end work was done. The review agent flags pattern gaps, and files proposals into a _proposals/ folder. From there I can review, accept, extract the new component into the library, and continue with an updated design system. This approach worked well enough.

Interesting fact. On the documented branch, the review agent showed meaningful gaps (missing patterns or components). On the bare branch, it kept proposing basic primitives, even though I already had them: buttons, inputs, fields. The pattern library doesn’t give the review agent more to find, because it is focused on diffs and not the whole codebase.

Review-bot results, 5 reviews per branch:

BranchReviewsProposalsTotal costAvg / review
Documented516$6.13$1.23
Bare (control)516$6.54$1.31

What this looks like in practice

The workflow that worked for me had three steps:

  • Step 1 — build. The agent ships the feature. It reads the components, patterns, and all the documents, and builds consistently. It can propose, but isn’t expected to.
  • Step 2 — review. A fresh review agent (GitHub Action) with no build context sees only the diff and the worktree. Its only job is to surface gaps and file proposals.
  • Step 3 — triage. I review the proposal queue, approves/edits/rejects, and add approved components and patterns.

Models

I was curious how Haiku, Sonnet, and Opus compared on the same task. I ran the same screen-building exercise across all three.

Expectedly enough, Opus performed best, Sonnet sat in the middle, and Haiku trailed. I suppose you can use Sonnet or Haiku, but expect to run more rounds of review and lint-driven correction.

Just a few more thoughts

Documentation works the best if the agent reads it at the right moment. Putting everything into markdown files or a single skill is better than nothing, but it works best when you mix different tools. Using a combination of deterministic and non-deterministic approaches that only run when needed makes the whole system much more efficient.

Visual prompts are better than text prompts. Figma was the sweet spot at every scale, although I haven’t tested other design tools (might be an interesting next experiment). Sketches worked surprisingly well, and pure text PRDs were the worst on every metric I tracked, but even text plus a documented codebase landed near zero escapes.

Different cognitive modes need different agent passes. Building and reviewing are not the same task, and the same agent with the same skills and the same helper won’t switch between them mid-flow. Once I split up building, reviewing, and proposing future changes, everything worked much better.

If you have anything to add, any feedback, or any ideas, feel free to reach out on LinkedIn or X. The repository I experimented with is available on GitHub: finn.

v. 4.26