Running Real Browser Agents on a 3B Local Model (No Vision Required)

The problem with browser agents today

Most browser agents today reason from screenshots.

Each step typically involves:

capturing pixels
sending images to a vision model
asking the model to infer structure, order, and state

This works, but it is expensive and fragile.

In practice, this leads to:

High token usage per step
Dependence on large cloud models
Retries instead of verification
Flaky behavior on modern SPAs

The common assumption is that reliable browser agents require large vision models.

That assumption turns out to be wrong.

A different approach: structure first, vision last

Instead of reasoning from pixels, we treat the rendered DOM as a semantic data structure.

Each snapshot extracts:

Semantic roles — button, link, textbox
Geometry — bounding boxes, viewport order
Grouping — lists, cards, dominant groups
Ordinality — first, top, main
State — enabled, checked, expanded

The agent reasons over structure, not images.

Vision is optional and only used as a fallback when structure is insufficient.

Jest-Style Assertions

On top of structure-first snapshots, we add Jest-style assertions so agents verify outcomes instead of guessing and retrying.

The result is lower token usage, deterministic behavior, and viability for small local models.

Below are three concrete demos.

Demo 1: Deterministic list reasoning (Hacker News)

Task: Open the top "Show HN" post.

This task often breaks vision-based agents because visual rank does not equal semantic rank.

Configuration

Model: Qwen 2.5 3B (local)
Vision: disabled
Snapshot size: ~50 semantic elements
Tokens per step: ~1,600

Assertions

element exists
element is first in dominant group
navigation succeeded

Result: PASS. Deterministic, zero retries.

The agent did not inspect pixels. It relied on explicit ordinality and grouping metadata.

Reproduce this demo with Qwen 2.5 3B and the Sentience SDK: View code on GitHub →

Modern SPAs are hostile to automation:

delayed hydration
disabled buttons that later become enabled
async profile loading
validation-gated flows

Vision agents usually solve this with sleeps and retries.

We do not.

Configuration

Site: Local Llama Land (Next.js SPA)
Model: Qwen 2.5 3B (local)
Vision: disabled

Assertions

button is disabled
button eventually becomes enabled
profile text appears after navigation

Result: PASS. No sleeps. No magic waits.

The agent waited on UI state, not time.

Reproduce this demo with Qwen 2.5 3B and the Sentience SDK: View code on GitHub →

Demo 3: Amazon shopping flow (stress test)

Amazon is a difficult real-world site:

JS-heavy
noisy DOM
frequent layout changes

We ran a shopping flow: search, open result, add to cart.

Configuration

Model: Qwen 2.5 3B (local)
Vision: disabled (fallback available but not used)
Total tokens: ~5,500

Assertions

navigation succeeded
button enabled
cart confirmation visible

Result: PASS.

Under the same conditions, vision-only agents failed repeatedly.

Reproduce this demo with Qwen 2.5 3B and the Sentience SDK: View code on GitHub →

Token usage comparison

Vision-Based Agents

~3,000+ tokens per step
screenshots on every action
inference from pixels

Structure-First Agents

~1,500 tokens per step
no screenshots in prompt
explicit semantic data

This represents roughly a 50% reduction in token usage while improving reliability.

Key Insight

This is not about parsing HTML better. It is about removing unnecessary reasoning work from the model.

Why small local models work

Small models struggle with:

Inferring structure from pixels
Ambiguous ordering
Unclear UI state

They do not struggle with:

Explicit structure
Clear ordinality
Boolean state checks

By moving perception and verification outside the model, the model's role becomes simpler and more reliable.

Local inference becomes practical.

Vision still matters, but not by default

Vision models are useful, but expensive.

The approach here is:

Try structure-first snapshots
Retry using assertions when confidence is low
Escalate to vision only when necessary

This keeps costs low and behavior predictable.

Try it yourself

All demos are fully reproducible.

Run These Demos Locally

Clone the playground, run with Qwen 2.5 3B, and see structure-first agents in action.

Get Started Free

Demo SPA: localllamaland.com

You can run these locally using:

Qwen 2.5 3B
Playwright or CDP
no vision models required

Takeaway

Browser agents do not need bigger models.

They need better structure and verification.

The Bottom Line

When agents operate on semantic geometry with explicit assertions, small local models become sufficient.

The problem with browser agents today

A different approach: structure first, vision last

Jest-Style Assertions

Demo 1: Deterministic list reasoning (Hacker News)

Configuration

Assertions

Demo 2: SPA login and state verification (Local Llama Land)

Configuration

Assertions

Demo 3: Amazon shopping flow (stress test)

Configuration

Assertions

Token usage comparison

Vision-Based Agents

Structure-First Agents

Key Insight

Why small local models work

Vision still matters, but not by default

Try it yourself

Run These Demos Locally

Takeaway

The Bottom Line