Why Structure Beats Vision (and Raw AX) for LLM Web Agents

Modern LLM agents struggle on the web for a simple reason: the web is not a static document, it's a live, unstable system.

Most agent stacks today pick one of three approaches:

Raw DOM / HTML parsing
Accessibility Tree (AX) traversal
Vision-first screenshot reasoning

Each works—up to a point. But none of them, alone, give an agent what it actually needs to reliably reason and act on modern, JS-heavy websites.

At Sentience, we took a different approach: treat the browser as a structured, verifiable interaction surface, not a document or an image.

This post explains why.

The problem: agents need global, stable context

When humans use a webpage, we implicitly understand things like:

what's the main content
which result is first
which button matters
whether the page is ready yet

LLMs don't get this for free.

On modern sites:

content loads asynchronously
layout shifts over time
important UI lives inside iframes
multiple elements look semantically similar

The Core Challenge

Agents don't just need to know what exists — they need to know what matters, where it is, and whether it's usable right now.

Why raw DOM and HTML parsing fall short

Static HTML or DOM parsing assumes:

the page structure is stable
content is fully present
ordering reflects importance

None of this is true for SPAs.

By the time an agent acts:

the DOM may still be mutating
critical UI may not exist yet
element order may change
CSS/layout determines meaning more than markup

HTML Parsing Answers

"What tags exist?"

Agents Need

"What can a user reliably interact with now?"

Why the Accessibility Tree helps—but isn't enough

The Accessibility Tree (AX) is one of the best standardized representations we have. It provides:

roles
names
states (checked, disabled, expanded)

This is extremely valuable, and Sentience uses many of these semantics.

But AX is intentionally lossy:

it abstracts away layout and geometry
it doesn't encode ordinality ("first", "top result")
it models documents, not global interaction surfaces
each iframe has its own tree

AX Limitations on JS-Heavy Pages

On modern SPAs and embedded pages, AX creates real gaps:

No clean global ordering across frames
No notion of dominant repeated groups (feeds, cards)
No visibility or occlusion awareness
Elements may be "accessible" but not usable yet

AX tells you what exists. Agents need to know what matters.

Why vision-first agents are expensive and brittle

Vision models see everything—but at a cost.

Vision-first agents:

burn large token budgets every step
struggle with fine-grained ordinality
are hard to debug
silently hallucinate success

They're great for demos. They're risky for production workflows.

Vision Answers

"What does this look like?"

Agents Also Need

"Did my action actually succeed?"

The Sentience approach: structure + stability + verification

Sentience treats the browser as a semantic, verifiable system.

Instead of raw HTML, AX alone, or screenshots, we snapshot:

Rendered DOM after hydration — not static HTML
Layout geometry — where things actually are
Ordinal signals — document order, viewport order
Grouping — feeds, cards, lists
State — enabled, checked, expanded, value
Stability metrics — DOM quiet time, confidence

From this, agents get:

a compact, structured view of what matters
predictable token usage
deterministic action targets

The Key Difference

Every action is followed by assertions:

Not "I think I clicked the right thing" but "The system verified the outcome."

Why this matters for LLM reasoning

When you reduce perception noise and encode structure explicitly:

smaller models become viable (3B–14B)
reasoning load shifts from the LLM to the system
failures become explainable
retries become bounded and intentional

LLMs are great at planning. They're bad at inferring unstable UI structure from scratch.

Sentience lets LLMs do what they're good at—and removes what they're bad at.

Vision as a last resort, not a default

Sentience doesn't reject vision.

When structural signals are exhausted:

snapshot confidence drops
retries are bounded
the system can fall back to vision to verify, not to guess

Assertions stay the same. Only the perception layer changes.

That keeps behavior auditable and safe.

The takeaway

Accessibility trees, DOM parsing, and vision are all useful.

But reliable agents need more than any single representation.

What Agents Actually Need

Structure to reason
Stability to act
Verification to trust outcomes

That's what Sentience is built for.

Vision agents show what's possible. Sentience makes it dependable.

Build Deterministic AI Agents

Stop guessing. Start verifying. Sentience gives your agents the perception layer they need to act reliably on any website.

Get Started Free