Back to Blog
Engineering
January 16, 20267 min read

Running Real Browser Agents on a 3B Local Model (No Vision Required)

How structure-first snapshots and assertions cut browser agent token usage by ~50% and make small local models viable.

The problem with browser agents today

Most browser agents today reason from screenshots.

Each step typically involves:

  • capturing pixels
  • sending images to a vision model
  • asking the model to infer structure, order, and state

This works, but it is expensive and fragile.

In practice, this leads to:

  • High token usage per step
  • Dependence on large cloud models
  • Retries instead of verification
  • Flaky behavior on modern SPAs

The common assumption is that reliable browser agents require large vision models.

That assumption turns out to be wrong.


A different approach: structure first, vision last

Instead of reasoning from pixels, we treat the rendered DOM as a semantic data structure.

Each snapshot extracts:

  • Semantic roles — button, link, textbox
  • Geometry — bounding boxes, viewport order
  • Grouping — lists, cards, dominant groups
  • Ordinality — first, top, main
  • State — enabled, checked, expanded

The agent reasons over structure, not images.

Vision is optional and only used as a fallback when structure is insufficient.

Jest-Style Assertions

On top of structure-first snapshots, we add Jest-style assertions so agents verify outcomes instead of guessing and retrying.

The result is lower token usage, deterministic behavior, and viability for small local models.

Below are three concrete demos.


Demo 1: Deterministic list reasoning (Hacker News)

Task: Open the top "Show HN" post.

This task often breaks vision-based agents because visual rank does not equal semantic rank.

Configuration

  • Model: Qwen 2.5 3B (local)
  • Vision: disabled
  • Snapshot size: ~50 semantic elements
  • Tokens per step: ~1,600

Assertions

  • element exists
  • element is first in dominant group
  • navigation succeeded

Result: PASS. Deterministic, zero retries.

The agent did not inspect pixels. It relied on explicit ordinality and grouping metadata.

Reproduce this demo with Qwen 2.5 3B and the Sentience SDK: View code on GitHub →


Demo 2: SPA login and state verification (Local Llama Land)

Modern SPAs are hostile to automation:

  • delayed hydration
  • disabled buttons that later become enabled
  • async profile loading
  • validation-gated flows

Vision agents usually solve this with sleeps and retries.

We do not.

Configuration

  • Site: Local Llama Land (Next.js SPA)
  • Model: Qwen 2.5 3B (local)
  • Vision: disabled

Assertions

  • button is disabled
  • button eventually becomes enabled
  • profile text appears after navigation

Result: PASS. No sleeps. No magic waits.

The agent waited on UI state, not time.

Reproduce this demo with Qwen 2.5 3B and the Sentience SDK: View code on GitHub →


Demo 3: Amazon shopping flow (stress test)

Amazon is a difficult real-world site:

  • JS-heavy
  • noisy DOM
  • frequent layout changes

We ran a shopping flow: search, open result, add to cart.

Configuration

  • Model: Qwen 2.5 3B (local)
  • Vision: disabled (fallback available but not used)
  • Total tokens: ~5,500

Assertions

  • navigation succeeded
  • button enabled
  • cart confirmation visible

Result: PASS.

Under the same conditions, vision-only agents failed repeatedly.

Reproduce this demo with Qwen 2.5 3B and the Sentience SDK: View code on GitHub →


Token usage comparison

Vision-Based Agents

  • ~3,000+ tokens per step
  • screenshots on every action
  • inference from pixels

Structure-First Agents

  • ~1,500 tokens per step
  • no screenshots in prompt
  • explicit semantic data

This represents roughly a 50% reduction in token usage while improving reliability.

Key Insight

This is not about parsing HTML better. It is about removing unnecessary reasoning work from the model.


Why small local models work

Small models struggle with:

  • Inferring structure from pixels
  • Ambiguous ordering
  • Unclear UI state

They do not struggle with:

  • Explicit structure
  • Clear ordinality
  • Boolean state checks

By moving perception and verification outside the model, the model's role becomes simpler and more reliable.

Local inference becomes practical.


Vision still matters, but not by default

Vision models are useful, but expensive.

The approach here is:

  1. Try structure-first snapshots
  2. Retry using assertions when confidence is low
  3. Escalate to vision only when necessary

This keeps costs low and behavior predictable.


Try it yourself

All demos are fully reproducible.

Run These Demos Locally

Clone the playground, run with Qwen 2.5 3B, and see structure-first agents in action.

Get Started Free

Demo SPA: localllamaland.com

You can run these locally using:

  • Qwen 2.5 3B
  • Playwright or CDP
  • no vision models required

Takeaway

Browser agents do not need bigger models.

They need better structure and verification.

The Bottom Line

When agents operate on semantic geometry with explicit assertions, small local models become sufficient.