Why Structure Beats Vision (and Raw AX) for LLM Web Agents
Modern LLM agents struggle on the web because the web is not a static document—it's a live, unstable system. This post explains why structure + stability + verification beats raw DOM, accessibility trees, or vision-first approaches.
Modern LLM agents struggle on the web for a simple reason: the web is not a static document, it's a live, unstable system.
Most agent stacks today pick one of three approaches:
- Raw DOM / HTML parsing
- Accessibility Tree (AX) traversal
- Vision-first screenshot reasoning
Each works—up to a point. But none of them, alone, give an agent what it actually needs to reliably reason and act on modern, JS-heavy websites.
At Sentience, we took a different approach: treat the browser as a structured, verifiable interaction surface, not a document or an image.
This post explains why.
The problem: agents need global, stable context
When humans use a webpage, we implicitly understand things like:
- what's the main content
- which result is first
- which button matters
- whether the page is ready yet
LLMs don't get this for free.
On modern sites:
- content loads asynchronously
- layout shifts over time
- important UI lives inside iframes
- multiple elements look semantically similar
The Core Challenge
Agents don't just need to know what exists — they need to know what matters, where it is, and whether it's usable right now.
Why raw DOM and HTML parsing fall short
Static HTML or DOM parsing assumes:
- the page structure is stable
- content is fully present
- ordering reflects importance
None of this is true for SPAs.
By the time an agent acts:
- the DOM may still be mutating
- critical UI may not exist yet
- element order may change
- CSS/layout determines meaning more than markup
HTML Parsing Answers
"What tags exist?"
Agents Need
"What can a user reliably interact with now?"
Why the Accessibility Tree helps—but isn't enough
The Accessibility Tree (AX) is one of the best standardized representations we have. It provides:
- roles
- names
- states (checked, disabled, expanded)
This is extremely valuable, and Sentience uses many of these semantics.
But AX is intentionally lossy:
- it abstracts away layout and geometry
- it doesn't encode ordinality ("first", "top result")
- it models documents, not global interaction surfaces
- each iframe has its own tree
AX Limitations on JS-Heavy Pages
On modern SPAs and embedded pages, AX creates real gaps:
- No clean global ordering across frames
- No notion of dominant repeated groups (feeds, cards)
- No visibility or occlusion awareness
- Elements may be "accessible" but not usable yet
AX tells you what exists. Agents need to know what matters.
Why vision-first agents are expensive and brittle
Vision models see everything—but at a cost.
Vision-first agents:
- burn large token budgets every step
- struggle with fine-grained ordinality
- are hard to debug
- silently hallucinate success
They're great for demos. They're risky for production workflows.
Vision Answers
"What does this look like?"
Agents Also Need
"Did my action actually succeed?"
The Sentience approach: structure + stability + verification
Sentience treats the browser as a semantic, verifiable system.
Instead of raw HTML, AX alone, or screenshots, we snapshot:
- Rendered DOM after hydration — not static HTML
- Layout geometry — where things actually are
- Ordinal signals — document order, viewport order
- Grouping — feeds, cards, lists
- State — enabled, checked, expanded, value
- Stability metrics — DOM quiet time, confidence
From this, agents get:
- a compact, structured view of what matters
- predictable token usage
- deterministic action targets
The Key Difference
Every action is followed by assertions:
Not "I think I clicked the right thing" but "The system verified the outcome."
Why this matters for LLM reasoning
When you reduce perception noise and encode structure explicitly:
- smaller models become viable (3B–14B)
- reasoning load shifts from the LLM to the system
- failures become explainable
- retries become bounded and intentional
LLMs are great at planning. They're bad at inferring unstable UI structure from scratch.
Sentience lets LLMs do what they're good at—and removes what they're bad at.
Vision as a last resort, not a default
Sentience doesn't reject vision.
When structural signals are exhausted:
- snapshot confidence drops
- retries are bounded
- the system can fall back to vision to verify, not to guess
Assertions stay the same. Only the perception layer changes.
That keeps behavior auditable and safe.
The takeaway
Accessibility trees, DOM parsing, and vision are all useful.
But reliable agents need more than any single representation.
What Agents Actually Need
- Structure to reason
- Stability to act
- Verification to trust outcomes
That's what Sentience is built for.
Vision agents show what's possible. Sentience makes it dependable.
Build Deterministic AI Agents
Stop guessing. Start verifying. Sentience gives your agents the perception layer they need to act reliably on any website.
Get Started Free