Consistent UI at Scale

2026
Experiment
Research + Engineering
Claude Code, SvelteKit
GitHub

I’m a design systems lead. Our team got dramatically faster with AI agents, but without guardrails the speed produced a mess with custom styles, one-off components, and inconsistent UX patterns. UI updates became disproportionately hard, and even though we shipped more, quality and consistency dropped, and with that, the user experience worsened.

I decided to find a way to guide agents to follow design system standards.

I ran an experiment to see what kind of documentation and prompting keeps agents on track. As a result, I ended up with a solid framework that cut down style violations by 95% compared to what I got using the same prompts on the same codebase used without any documentation. With a few extra tweaks, the agents even started suggesting their own improvements to the design system.

Bare branch: hand-rolled UI without design-system documentation — without documentation

Loaded branch: composed from documented primitives — with documentation

The framework

Here’s the setup if you just want to duplicate it, and the reasoning is right after. I’d recommend to adjust it for your own needs. AI will probably do a good enough job with the listed prompts, but it’s still worth checking.

The only prerequisite is a component library.

Start with primitives. You probably already have a file with core styles, maybe even semantically named. Try this prompt, you can adjust it for your project’s specifics:

Primitives prompt

Then document the components. I found JSDoc the most useful, because the docs are visible in code editors on hover. Very convenient when you’re pairing with a model.

Component documentation prompt

Now add some automation. Create ESLint rules and, if you’re on Claude Code, a Stop hook to run those checks automatically after every change.

ESLint + Stop hook prompt

Then create a design system skill. The skill will trigger every time an agent works on front-end tasks. It enforces the rules of the design system: both the rules baked into the components and tokens, and any patterns you want to encode, think of empty states, when to invoke which type of modal, accessibility rules, or content design guides.

Design system skill prompt

Finally, CLAUDE.md. This is the file Claude reads at the start of every session, so keep it lean, don’t overwhelm the context window with things that aren’t always relevant. Use the prompt below, or factor the design-system rules into a separate referenced file if your CLAUDE.md is already large and UI isn’t the only thing in your repo.

CLAUDE.md prompt

At the end your project will be structured like this:

your-project/
  path-to/components/
    Button.svelte         ← JSDoc on every component
    Field.svelte
    ...
  CLAUDE.md               ← rules, component inventory, imports
  .claude/
    skills/build-with-design-system/
      SKILL.md            ← decision tree
      patterns/           ← forms, lists, success, …
    hooks/design-system-check.mjs  ← stop hook → re-runs ESLint
  eslint-rules/
    project-design-system.js       ← no-bare-input, no-css-literals

That’s the whole setup, and you can build from here. I found that a well-structured Figma file is the best way to prompt the model, but a messier Figma file, Claude design files, sketches, or even plain text PRDs (I’d still recommend illustrate them with some sketches) all still work, just less precisely or more expensive.

You can also add another review step at the PR level, e.g. a GitHub Action that posts review comments. I found it useful, especially for proposing new components and improvements to the design system, but it adds cost, so weigh it against the size of the project.

The experiment

I tested the setup a few different ways. The mock app is called finn — a personal-finance app built with SvelteKit, and plain CSS. No React, no Tailwind, and no out-of-the-box component library. I picked an this stack on purpose, because every single product is unique, and I didn’t want to test it on the default for AI stack.

I started by trying to figure out the best way to document things. Three formats:

No documentation (control)
A traditional CLAUDE.md plus a folder of markdown files generated by Anthropic’s /design-system skill
JSDoc on every component plus a dedicated build-with-design-system skill

Markdown files and the skill-plus-JSDoc approach produced similar results, but I find markdown files harder to work with, since you have to manually search for them, and generally remember to maintain. JSDoc is inline, and shows up in code editors on hover, which makes the experience much better. I went with a custom skill because it only triggers when you’re working on the UI. This way, it doesn’t clutter your context when agent is working on things that aren’t related to the front end.

The format comparison (4 runs each, opus-4.7):

Metric	Markdown	Skill + JSDoc
Bypass markers	25.5 ± 12.4	14.5 ± 10.8
Primitives used	34 ± 3.5	36.25 ± 0.8
Style escapes	7.25 ± 4.4	6.0 ± 4.5
Cost / run	$2.29 ± $0.27	$2.40 ± $0.22

Bypass markers show how many times the agent ignored existing components and created one-offs.

Next, I wanted to figure out what belongs in CLAUDE.md versus the skill versus the component layer. After a few rounds it became clear that a mixed approach worked best.

On top of an ordinary codebase I layered four things:

Inline-documented primitives and components
An ESLint rule that enforces use of those primitives
An umbrella skill (SKILL.md) describing the inventory and composition rules
A CLAUDE.md with a general index and a few short guidelines

Finding the best way to prompt

I wanted to see which input format the model handled best. I compared:

A text prompt, detailed, but still just text
A pencil sketch plus a short text brief
A well-organized Figma file

Then for the sake of comparison I also added a less well-organized Figma file and a Claude Design handoff generated from it.

Build the Send screen

/send is currently a placeholder. Build it out into a working "send money" flow.

What the screen does

A user opens /send to send money to someone. They:

Pick a recipient — choose from existing contacts, or enter an email address.
Enter the amount — type a number, pick the currency.
Add a note — a short message with the transfer.
See a confirmation — recipient, amount, fees, and a clear "Send" action.
Submit — show success, then offer "Send another" or "Back to convert".

What to handle

Recipient search by name or email
Amount: decimals only, reject letters
Validate before submit (amount > 0, recipient selected)
Network failure: error state with retry

Existing data

src/lib/stores/contacts
src/lib/data/currencies
src/lib/utils/format
src/lib/types.ts

Done looks like

Clean flow that fits the app
bun run check passes
bun run build passes

text PRD

A pencil sketch of the Send screen used as a visual prompt — pencil sketch

Figma file

I ran three rounds with Claude Opus 4.7 against both the undocumented and well-documented branches — 18 runs total per condition, 30 runs across modalities × branches × reps.

Documented branch only (the part that matters once you’ve done the setup):

Modality	$/run	Lines	Escapes	Imports σ	Verdict
Text PRD	$1.64	313	1.0	0.0	Lost on every dimension
Sketch	$1.03	103	0	0.6	Cheapest path that still works
Figma	$1.20	71	0	0.0	Practical sweet spot
Claude Design	$1.37	51	0	0.0	Tightest output; downstream of Figma

Text prompts performed worst. My guess is they leave too much for the model to decide on. Claude Design handoffs technically produced the tightest output, but you have to spend tokens to get there, so it’s not really a win on effort. Sketches did surprisingly well, but consistency was weaker than I’d want. Well-organized Figma was the practical winner: the fewest escapes, the fewest lines of code, perfect agreement on which primitives to use, and a price within rounding of sketch.

How it holds up over time

The next question was: what happens as the codebase grows? Does it create new components when it should? Does it snowball errors and one-offs?

I assumed Figma would still win at scale with how well it performed, so there was no sense in testing with it. So I prompted with text PRDs and sketches and asked the agents to add five more screens on top of the existing app. Three reps per screen per branch, 30 runs in total.

The undocumented branch snowballed pretty hard. New one-off styles, and within a few commits the UI had drifted noticeably from the original setup. The well-documented branch held together much better.

Cumulative state after 6 chained commits (round 1):

Branch	Screens	Cumulative bypass	Cumulative escapes	Δ escapes / new screen
Bare	9	4	33	11.00
Documented	10	2	1	0.25

A 44× gap on new escapes per added screen. The bare branch picked up 33 new style escapes across the chain; the documented branch picked up one.

One caveat: the skill is invoked non-deterministically. On the documented branch it fired 5 of 6 commits. The one miss (a success-screen prompt with a text brief) was also the only commit where the documented branch picked up an escape. So if it matters, either mention the skill explicitly in the prompt or enforce its use more aggressively.

Because models are so good at coding, it is hard to tell if something is wrong just by looking at the results. Both the bare version on the left and the documented version on the right look okay at first glance. However, the bare version actually created new styles for the input field, used the wrong text style for the bottom text, and made a completely new component for the slider.

Bare branch — notification threshold screen with a one-off slider component and a re-styled input — bare

Documented branch — same screen composed from the existing primitives — documented

Self-propagating design system

The other thing worth noting: even the well-documented branch started proposing new components, because new screens occasionally need shapes the existing library doesn’t cover. That raised the next question. What’s the right way to propose new components, and can the system become self-propagating?

I tried three approaches before finding one that worked.

Approach 1: a Stop hook. Runs after every change the agent completes, reviews what it just did, and proposes design-system additions. It worked, but it ran after every completed task and produced too much proposals, while also spending more tokens than I’d like it too.

Approach 2: bake “propose new components” step into the build-with-design-system skill. This barely worked. The skill triggered, but the proposal step was mostly skipped. New components got proposed exactly once across the round; most of the time the agent saw “you can either build this directly or propose a new primitive,” picked “build directly,” and moved on. Again it probably would be done better and it’s possible to make it work but as with the stop hook I didn’t want it to run after every single task because it would spend too many tokens so I considered this approach a fail too.

Approach 3: a separate review pass at the PR level. It can be invoked after the main body of work is done, and on demand, making it only requested when considerable amount of front-end work was done. The review agent flags pattern gaps, and files proposals into a _proposals/ folder. From there I can review, accept, extract the new component into the library, and continue with an updated design system. This approach worked well enough.

Review-bot results, 5 reviews per branch:

Branch	Reviews	Proposals	Total cost	Avg / review
Documented	5	16	$6.13	$1.23
Bare (control)	5	16	$6.54	$1.31

What this looks like in practice

The workflow that worked for me had three steps:

Step 1 — build. The agent ships the feature. It reads the components, patterns, and all the documents, and builds consistently. It can propose, but isn’t expected to.
Step 2 — review. A fresh review agent (GitHub Action) with no build context sees only the diff and the worktree. Its only job is to surface gaps and file proposals.
Step 3 — triage. I review the proposal queue, approves/edits/rejects, and add approved components and patterns.

Models

I was curious how Haiku, Sonnet, and Opus compared on the same task. I ran the same screen-building exercise across all three.

Expectedly enough, Opus performed best, Sonnet sat in the middle, and Haiku trailed. I suppose you can use Sonnet or Haiku, but expect to run more rounds of review and lint-driven correction.

Just a few more thoughts

Documentation works the best if the agent reads it at the right moment. Putting everything into markdown files or a single skill is better than nothing, but it works best when you mix different tools. Using a combination of deterministic and non-deterministic approaches that only run when needed makes the whole system much more efficient.

Visual prompts are better than text prompts. Figma was the sweet spot at every scale, although I haven’t tested other design tools (might be an interesting next experiment). Sketches worked surprisingly well, and pure text PRDs were the worst on every metric I tracked, but even text plus a documented codebase landed near zero escapes.

Different cognitive modes need different agent passes. Building and reviewing are not the same task, and the same agent with the same skills and the same helper won’t switch between them mid-flow. Once I split up building, reviewing, and proposing future changes, everything worked much better.

Every design system is unique. A lot of what you’d call common sense never makes it into the docs, it stays in your head, or in shared context of the team. The model doesn’t have any of that on the first run, and it won’t pick it up on its own.

What worked for me was treating the skill as something I keep growing. Whenever I caught the agent doing something I’d want it to do differently next time, I told it to add the rule to the patterns, or whichever doc fit best, right there in chat by typing something like /btw Memorize and add a rule to the "build-with-design-system" skill: never create new text sizes unless specifically asked. Over time the skill gets more precise, and the corrections taper off.

If you have anything to add, any feedback, or any ideas, feel free to reach out on LinkedIn or X. The repository I experimented with is available on GitHub: finn.