Why AX Alone Isn't Enough for Reliable Agents

The Accessibility Tree (AX) is one of the most underappreciated tools in the browser. It provides standardized roles, names, and states—and many agent frameworks rightly rely on it.

But when it comes to reliable LLM agents acting on modern websites, AX alone isn't enough.

Not because AX is flawed—but because it was never designed for what agents need to do.

This post explains where AX shines, where it falls short, and why geometry + stability + verification are required for dependable agents.

What AX is great at

AX answers the question:

"What elements are accessible?"

It provides:

roles (button, link, textbox)
names and labels
states (checked, disabled, expanded)

For a single, static document, this is extremely powerful.

AX mental model

Accessibility Tree
┌─────────────────────┐
│ Button              │
│ name: "Continue"    │
│ role: button        │
│ enabled: true       │
└─────────────────────┘

If an agent already knows which element to act on and when the page is ready, AX works well.

But that's the catch.

Where AX breaks down for agents

Modern web apps are:

JS-heavy
dynamically hydrated
full of embedded content and iframes
constantly changing layout and visibility

AX is intentionally lossy in these dimensions.

1. No global ordinality

AX doesn't encode:

"first result"
"main CTA"
"top item above the fold"

The Ordinality Problem

On a page with repeated elements, agents still need to answer: Which one matters?

AX doesn't model that.

2. Fragmentation across iframes

Each iframe has its own AX tree.

Page
├─ AX Tree (main document)
├─ AX Tree (auth iframe)
├─ AX Tree (checkout iframe)

What's missing:

global ordering across frames
spatial relationships
visibility/occlusion awareness

What Agents Need

Agents need a single interaction surface, not multiple disconnected trees.

3. AX doesn't model stability

AX can report elements that:

exist but aren't usable yet
are technically accessible but visually blocked
are mid-transition during hydration

AX Answers

"Is this accessible?"

Agents Need

"Is this usable right now?"

Those are different questions.

Why geometry matters

Geometry adds spatial truth to semantics.

It answers:

where elements are
what's above/below
what's inside what
what's visible in the viewport

Geometry mental model

Viewport
┌──────────────────────────────┐
│ [Search Result #1]           │  ← dominant group, index 0
│   [Title]   [Open]           │
│                              │
│ [Search Result #2]           │  ← index 1
│   [Title]   [Open]           │
└──────────────────────────────┘

With geometry, agents can reason about:

Ordinality — "first result", "second button"
Grouping — "button inside same card"
Hierarchy — "main content vs sidebar"

AX alone doesn't encode this.

Why vision alone isn't the answer either

Vision-first agents take screenshots and ask:

"What do you see?"

This works—but at a cost.

Vision mental model

Screenshot
┌──────────────────────────────┐
│ pixels, text, colors         │
│ everything looks important   │
└──────────────────────────────┘

Vision Trade-offs

Expensive — burns tokens every step
Brittle — struggles with ordinality
Hard to debug — no clear action targets
Risky — can hallucinate success

Vision is powerful—but unreliable as a default perception layer.

Comparing the three approaches

AX (Semantics)

"What exists?"

Geometry (Structure)

"What matters where?"

Vision (Fallback)

"What does it look like?"

Each answers a different question.

Reliable agents need all three, but in the right order.

The Sentience approach: structure-first, vision-last

Sentience treats the browser as a verifiable interaction surface.

It combines:

AX-style semantics — roles, names, states
Rendered geometry — layout, ordinality, grouping
Stability signals — DOM quiet time, confidence
Assertions — verify outcomes, not guess

Sentience mental model

Rendered DOM
   ↓
Semantic + Geometry Snapshot
   ↓
Assertions
   ↓
PASS / FAIL (with reasons)

Vision is used only when structure is exhausted, and only to verify, not to guess.

Why this matters for LLM agents

When structure is explicit:

smaller models work
token usage drops
retries become bounded
failures become explainable

LLMs are great at planning. They're bad at inferring unstable UI structure from scratch.

The system should handle that.

The takeaway

AX is necessary—but not sufficient—for reliable agents.

On modern, JS-heavy, iframe-filled websites, agents need:

What Reliable Agents Require

Semantics to know what exists
Geometry to know what matters
Stability to know when to act
Assertions to know what succeeded

That's how you move from:

"The agent probably clicked the right thing"

to:

"The system verified the outcome."

Move Beyond AX-Only Agents

Sentience combines accessibility semantics with geometry, stability signals, and verification—giving your agents the perception layer they need to act reliably.

Get Started Free