Running Real Browser Agents on a 3B Local Model (No Vision Required)
How structure-first snapshots and assertions cut browser agent token usage by ~50% and make small local models viable.
The problem with browser agents today
Most browser agents today reason from screenshots.
Each step typically involves:
- capturing pixels
- sending images to a vision model
- asking the model to infer structure, order, and state
This works, but it is expensive and fragile.
In practice, this leads to:
- High token usage per step
- Dependence on large cloud models
- Retries instead of verification
- Flaky behavior on modern SPAs
The common assumption is that reliable browser agents require large vision models.
That assumption turns out to be wrong.
A different approach: structure first, vision last
Instead of reasoning from pixels, we treat the rendered DOM as a semantic data structure.
Each snapshot extracts:
- Semantic roles — button, link, textbox
- Geometry — bounding boxes, viewport order
- Grouping — lists, cards, dominant groups
- Ordinality — first, top, main
- State — enabled, checked, expanded
The agent reasons over structure, not images.
Vision is optional and only used as a fallback when structure is insufficient.
Jest-Style Assertions
On top of structure-first snapshots, we add Jest-style assertions so agents verify outcomes instead of guessing and retrying.
The result is lower token usage, deterministic behavior, and viability for small local models.
Below are three concrete demos.
Demo 1: Deterministic list reasoning (Hacker News)
Task: Open the top "Show HN" post.
This task often breaks vision-based agents because visual rank does not equal semantic rank.
Configuration
- Model: Qwen 2.5 3B (local)
- Vision: disabled
- Snapshot size: ~50 semantic elements
- Tokens per step: ~1,600
Assertions
- element exists
- element is first in dominant group
- navigation succeeded
Result: PASS. Deterministic, zero retries.
The agent did not inspect pixels. It relied on explicit ordinality and grouping metadata.
Reproduce this demo with Qwen 2.5 3B and the Sentience SDK: View code on GitHub →
Demo 2: SPA login and state verification (Local Llama Land)
Modern SPAs are hostile to automation:
- delayed hydration
- disabled buttons that later become enabled
- async profile loading
- validation-gated flows
Vision agents usually solve this with sleeps and retries.
We do not.
Configuration
- Site: Local Llama Land (Next.js SPA)
- Model: Qwen 2.5 3B (local)
- Vision: disabled
Assertions
- button is disabled
- button eventually becomes enabled
- profile text appears after navigation
Result: PASS. No sleeps. No magic waits.
The agent waited on UI state, not time.
Reproduce this demo with Qwen 2.5 3B and the Sentience SDK: View code on GitHub →
Demo 3: Amazon shopping flow (stress test)
Amazon is a difficult real-world site:
- JS-heavy
- noisy DOM
- frequent layout changes
We ran a shopping flow: search, open result, add to cart.
Configuration
- Model: Qwen 2.5 3B (local)
- Vision: disabled (fallback available but not used)
- Total tokens: ~5,500
Assertions
- navigation succeeded
- button enabled
- cart confirmation visible
Result: PASS.
Under the same conditions, vision-only agents failed repeatedly.
Reproduce this demo with Qwen 2.5 3B and the Sentience SDK: View code on GitHub →
Token usage comparison
Vision-Based Agents
- ~3,000+ tokens per step
- screenshots on every action
- inference from pixels
Structure-First Agents
- ~1,500 tokens per step
- no screenshots in prompt
- explicit semantic data
This represents roughly a 50% reduction in token usage while improving reliability.
Key Insight
This is not about parsing HTML better. It is about removing unnecessary reasoning work from the model.
Why small local models work
Small models struggle with:
- Inferring structure from pixels
- Ambiguous ordering
- Unclear UI state
They do not struggle with:
- Explicit structure
- Clear ordinality
- Boolean state checks
By moving perception and verification outside the model, the model's role becomes simpler and more reliable.
Local inference becomes practical.
Vision still matters, but not by default
Vision models are useful, but expensive.
The approach here is:
- Try structure-first snapshots
- Retry using assertions when confidence is low
- Escalate to vision only when necessary
This keeps costs low and behavior predictable.
Try it yourself
All demos are fully reproducible.
Run These Demos Locally
Clone the playground, run with Qwen 2.5 3B, and see structure-first agents in action.
Get Started FreeDemo SPA: localllamaland.com
You can run these locally using:
- Qwen 2.5 3B
- Playwright or CDP
- no vision models required
Takeaway
Browser agents do not need bigger models.
They need better structure and verification.
The Bottom Line
When agents operate on semantic geometry with explicit assertions, small local models become sufficient.