Agents are the most honest reviewers your design system will ever have

Design systems teams have always been short on honest data about their own work. The tools we've built to answer whether the system is being used – import counts from a bundler, npm downloads, Figma instance counts, the quarterly satisfaction survey, the linter that nags PR authors about raw hex values – all share the same limitation. They can tell you what got used. They can't tell you what someone wanted and couldn't find.

That gap is where the interesting failures live. The component that sounds reasonable in documentation but doesn't map to how anyone thinks about the problem. The token whose purpose is ambiguous enough that three people on the team bypass it in the same week without mentioning it to anyone. The prop that everyone expects to exist and doesn't. None of this shows up in import counts because nobody imports something they can't locate. It shows up as silence, and silence has always been the hardest thing for a design systems team to measure.

Agents break that silence, because they're noisier than humans. Every time an agent encounters something it can't resolve cleanly, it leaves a trace. The trace is being written to disk on every laptop on your team that's used Claude Code or Codex this week. Most of it is sitting in folders nobody opens, and Claude Code wipes its sessions after thirty days by default. The richest record of how your system fails under real use is being deleted on a schedule, on hardware you already own, by people who don't know it's there.

What's actually in an agent trace

When someone on your team runs Claude Code against a task that touches your design system, the session produces a conversation log. It contains the prompts, the tool calls, the file reads, the diffs the agent proposed, and the diffs the human accepted or rejected. It sits on disk in ~/.claude/projects/ as JSONL, one file per session, retained for thirty days by default unless someone changes cleanupPeriodDays in their settings. Codex writes equivalent rollout files under ~/.codex/sessions/YYYY/MM/DD/. Cursor's local history is patchier and less reliable to grep against. When an agent pulls context from an MCP server, those calls are recorded in the same session log. OpenTelemetry already publishes semantic conventions for GenAI agent spans, covering model invocations, tool calls, and agent operations, so the infrastructure for treating this as proper observability data exists if you want it.

None of this is speculative. It's sitting on the laptops of everyone on your team who's used an AI coding tool this week. What's missing is the idea that it might mean something to the people who maintain the system the agent was working against.

The four signals I'd start with

Treat agent traces as system telemetry and four patterns turn up. Each one is a problem your design system has that you previously had no reliable way to detect.

Hallucinated props

The agent writes <Button variant="tertiary-ghost"> and your Button doesn't have one. It didn't make that up from nothing. It made it up because the combination felt plausible given the rest of your API, and because your documentation didn't give it enough to rule the invention out. Every hallucinated prop is a hypothesis the agent formed about what your component should do, and a sign that your docs left room for the hypothesis in the first place. I wrote about this in AI doesn't need your design system to be perfect, it needs it to be honest, but I was thinking about it as a readiness problem. It's also a detection problem. You can now see, in retrospect, exactly where the dishonesty lives.

Pipe a session file through jq and a hallucinated prop looks like this:

{
  "type": "user",
  "timestamp": "2026-04-01T14:21:52.103Z",
  "message": {
    "role": "user",
    "content": "make the signup button less prominent on the hero section"
  }
}
{
  "type": "assistant",
  "timestamp": "2026-04-01T14:22:17.891Z",
  "message": {
    "role": "assistant",
    "content": [
      {
        "type": "tool_use",
        "name": "Edit",
        "input": {
          "file_path": "src/components/Hero.tsx",
          "old_string": "<Button>Sign up</Button>",
          "new_string": "<Button variant=\"tertiary-ghost\">Sign up</Button>"
        }
      }
    ]
  }
}
{
  "type": "user",
  "timestamp": "2026-04-01T14:22:43.214Z",
  "message": {
    "role": "user",
    "content": "that's not a real variant. just use ghost and add a class for the smaller size."
  }
}

Two records, twenty-six seconds apart. The first is the hallucinated prop. The second is the corrective turn. If you wanted to find every hallucinated component prop your team produced last month, you'd grep for tool_use blocks where the input contained an attribute that doesn't match anything in your component library, then check whether the next user record pushed back. That query is doable in a few lines of jq.

Bypassed tokens

The agent writes color: #3B82F6 in a component where it should have written var(--color-interactive-primary). Sometimes that's a gap in how tokens are exposed to the tool. More often it's semantic ambiguity. The agent couldn't work out whether the blue in question was your brand colour, your primary action colour, or your hover state, so it picked the raw value and moved on. In the entropy article I called this token ambiguity and treated it as drift, which it is, except the trace lets you point at the specific moment of ambiguity and the specific decision the agent made. Drift you can only see in aggregate. This you can see in the act.

Novel components

This is the one I think matters most, because it's the one traditional telemetry cannot see at all. The agent generates a PromoBanner from scratch when your system has a perfectly usable CalloutBanner. The component the agent built doesn't exist in your library, so it can never show up in a usage report. The only record of the event is the trace. Every novel component is a discoverability failure dressed up as a capability gap. The agent looked, didn't find what it needed under the name it expected, and built its own. That maps to the problem I described in the component adoption gap when I wrote about parallel navigation structures, except now you can see the failures as they happen rather than inferring them from survey data months later.

Corrective turns

Someone on the team reviews the agent's output, pushes back, and the agent revises. That should use our Card component, not a div with a border. The spacing is wrong, use the token. We don't call that a modal, we call it a sheet. Every one of these corrections is a labelled training example for what your documentation should have said in the first place. The human's correction is the ground truth. The agent's original output is the failure mode. And the pair is sitting in the transcript, already tagged by the fact that a correction happened. If you want a list of the things your system is most ambiguous about, pull every corrective turn from last month's Claude Code sessions across your team and read the human side of each one. It's the most depressing and useful hour you'll spend on your system this year.

Those four signals give you something the design systems field has never had. Adoption metrics tell you about use. Agent traces tell you about intent, about the gap between what people were trying to do and what your system let them do. That's the gap you've been trying to close for years with surveys, retros, and "how did you find documentation" interviews that nobody remembered accurately. It was always there. You couldn't see it.

What you can do with this on Monday

Open ~/.claude/projects/ and pick one of your own sessions from the last week where you were working on something that touched the design system. Better still, ask a few teammates to do the same on their machines. Read the transcripts for hesitation, ignoring whether the code is correct. Watch for the places where the agent paused, guessed, invented, or got corrected, and write down both what it tried to use and what the correction was. Do this for ten sessions across the team and you'll have a better list of system gaps than any survey has ever given you.

After that the work scales as far as you want to take it. You can grep for not found results, write Code Connect mappings against the gaps you find, or wire up a proper OpenTelemetry pipeline if you've got the appetite for it. The reframing matters more than the tooling does. The artefacts already exist. You're recognising them as signal.

What this changes about how you think about the system

If you take this seriously, you end up somewhere uncomfortable. The agent is the most honest reviewer your design system will ever have. It has no politeness filter, no learned workarounds, no investment in the relationship with the team that built the component. It doesn't know that everyone on the product team has been using a raw div because the Card has been weird since Q2. It tries the Card, fails, and leaves a trace. Every workaround your team has internalised without saying so, every prop name that nobody has the heart to tell you was a bad idea, the agent records all of it, in writing, with timestamps.

Most teams aren't ready to read that. Survey data is comfortable because it's mediated, because the people answering it have already done the politeness work for you. Trace data isn't mediated. It's the unfiltered version of what your system feels like to use under deadline pressure, and a lot of what it tells you is going to sting. The temptation is to treat the inconvenient findings as noise, the way teams have always treated qualitative feedback they didn't like. The problem with that is the agent will keep producing the same failures, the same hallucinated props, the same bypassed tokens, until something in the documentation or the API changes. The trace doesn't go away because you didn't read it.

The governance question I raised in your design system might be AI-ready, your organisation probably isn't was about whether contribution models could absorb AI-generated patterns. This is the version of that question one layer down. Before you can decide whether an agent-surfaced pattern is worth promoting to a real component, someone has to be reading the traces in the first place. Someone has to treat the corrective turns from last month as evidence rather than as noise from a tool that occasionally gets things wrong. That's a posture decision, not a tooling decision, and most teams haven't made it.

Agent traces also contain the prompts people typed, and prompts sometimes contain sensitive product context. Treating the traces as telemetry means treating them with the same care as any other observability data. Retention policies, access controls, the boring stuff. Tractable, but not free.

Teams have been generating this data for a year. Most of it is on a thirty-day timer.

Your tokens became infrastructure when you weren't looking. Your component library became an API when you weren't looking. Your agent sessions have been telemetry the whole time, and that one might be the most useful of the three.

The transcripts are right there.

Thanks for reading! If you enjoyed this article, subscribing is the best way to keep up with new posts. And if it was useful, passing it on to someone who'd find it relevant is always appreciated.

You can find me on LinkedIn, X, and Bluesky.

Agents are the most honest reviewers your design system will ever have

What's actually in an agent trace

The four signals I'd start with

Hallucinated props

Bypassed tokens

Novel components

Corrective turns

What you can do with this on Monday

What this changes about how you think about the system

Member discussion

What your components look like as data

The prompt you never wrote

The design system just became middleware

Agents are the most honest reviewers your design system will ever have

What's actually in an agent trace

The four signals I'd start with

Hallucinated props

Bypassed tokens

Novel components

Corrective turns

💌

What you can do with this on Monday

What this changes about how you think about the system

Member discussion

What your components look like as data

The prompt you never wrote

The design system just became middleware