Back to Blog
Engineering
January 15, 20268 min read

Why Structure Beats Vision (and Raw AX) for LLM Web Agents

Modern LLM agents struggle on the web because the web is not a static document—it's a live, unstable system. This post explains why structure + stability + verification beats raw DOM, accessibility trees, or vision-first approaches.

Modern LLM agents struggle on the web for a simple reason: the web is not a static document, it's a live, unstable system.

Most agent stacks today pick one of three approaches:

  1. Raw DOM / HTML parsing
  2. Accessibility Tree (AX) traversal
  3. Vision-first screenshot reasoning

Each works—up to a point. But none of them, alone, give an agent what it actually needs to reliably reason and act on modern, JS-heavy websites.

At Sentience, we took a different approach: treat the browser as a structured, verifiable interaction surface, not a document or an image.

This post explains why.


The problem: agents need global, stable context

When humans use a webpage, we implicitly understand things like:

  • what's the main content
  • which result is first
  • which button matters
  • whether the page is ready yet

LLMs don't get this for free.

On modern sites:

  • content loads asynchronously
  • layout shifts over time
  • important UI lives inside iframes
  • multiple elements look semantically similar

The Core Challenge

Agents don't just need to know what exists — they need to know what matters, where it is, and whether it's usable right now.


Why raw DOM and HTML parsing fall short

Static HTML or DOM parsing assumes:

  • the page structure is stable
  • content is fully present
  • ordering reflects importance

None of this is true for SPAs.

By the time an agent acts:

  • the DOM may still be mutating
  • critical UI may not exist yet
  • element order may change
  • CSS/layout determines meaning more than markup

HTML Parsing Answers

"What tags exist?"

Agents Need

"What can a user reliably interact with now?"


Why the Accessibility Tree helps—but isn't enough

The Accessibility Tree (AX) is one of the best standardized representations we have. It provides:

  • roles
  • names
  • states (checked, disabled, expanded)

This is extremely valuable, and Sentience uses many of these semantics.

But AX is intentionally lossy:

  • it abstracts away layout and geometry
  • it doesn't encode ordinality ("first", "top result")
  • it models documents, not global interaction surfaces
  • each iframe has its own tree

AX Limitations on JS-Heavy Pages

On modern SPAs and embedded pages, AX creates real gaps:

  • No clean global ordering across frames
  • No notion of dominant repeated groups (feeds, cards)
  • No visibility or occlusion awareness
  • Elements may be "accessible" but not usable yet

AX tells you what exists. Agents need to know what matters.


Why vision-first agents are expensive and brittle

Vision models see everything—but at a cost.

Vision-first agents:

  • burn large token budgets every step
  • struggle with fine-grained ordinality
  • are hard to debug
  • silently hallucinate success

They're great for demos. They're risky for production workflows.

Vision Answers

"What does this look like?"

Agents Also Need

"Did my action actually succeed?"


The Sentience approach: structure + stability + verification

Sentience treats the browser as a semantic, verifiable system.

Instead of raw HTML, AX alone, or screenshots, we snapshot:

  • Rendered DOM after hydration — not static HTML
  • Layout geometry — where things actually are
  • Ordinal signals — document order, viewport order
  • Grouping — feeds, cards, lists
  • State — enabled, checked, expanded, value
  • Stability metrics — DOM quiet time, confidence

From this, agents get:

  • a compact, structured view of what matters
  • predictable token usage
  • deterministic action targets

The Key Difference

Every action is followed by assertions:

Not "I think I clicked the right thing" but "The system verified the outcome."


Why this matters for LLM reasoning

When you reduce perception noise and encode structure explicitly:

  • smaller models become viable (3B–14B)
  • reasoning load shifts from the LLM to the system
  • failures become explainable
  • retries become bounded and intentional

LLMs are great at planning. They're bad at inferring unstable UI structure from scratch.

Sentience lets LLMs do what they're good at—and removes what they're bad at.


Vision as a last resort, not a default

Sentience doesn't reject vision.

When structural signals are exhausted:

  • snapshot confidence drops
  • retries are bounded
  • the system can fall back to vision to verify, not to guess

Assertions stay the same. Only the perception layer changes.

That keeps behavior auditable and safe.


The takeaway

Accessibility trees, DOM parsing, and vision are all useful.

But reliable agents need more than any single representation.

What Agents Actually Need

  • Structure to reason
  • Stability to act
  • Verification to trust outcomes

That's what Sentience is built for.

Vision agents show what's possible. Sentience makes it dependable.

Build Deterministic AI Agents

Stop guessing. Start verifying. Sentience gives your agents the perception layer they need to act reliably on any website.

Get Started Free