How to Run Web Agents with Local LLM (3B)
Run web agents on 3B–14B local models by replacing screenshots + raw DOM with structure-first snapshots — cutting token usage by ~50% per step.
Running web agents locally used to feel unrealistic.
Most agents today:
- rely on vision models
- send screenshots every step
- dump large DOMs into the prompt
- assume cloud-scale inference
That works — but it’s expensive, slow, and hard to trust.
At Sentience, we took a different approach:
remove pixels, reduce DOM noise, and verify outcomes structurally.
The result:
- agents that run on 3B–14B local models
- with ~50% lower token usage per step
- while still completing real browser tasks correctly
This post shows how.
The core problem: pixels and raw DOMs are expensive
A typical vision-first browser agent does something like:
- Take a screenshot
- Serialize a large DOM
- Ask the model “what do you see?”
- Guess what to click next
That means:
- thousands of tokens per step
- vision tokens every iteration
- lots of irrelevant UI noise
- retries that silently burn cost
This is why small local models struggle:
they’re overwhelmed by perception, not reasoning.
The alternative: structure-first snapshots
Instead of sending pixels or raw HTML, Sentience snapshots the rendered DOM after hydration and formats it for LLM reasoning.
Example (Hacker News “Show HN”):
| id | role | name | importance | doc_y | ordinal | dominant_group |
|---|---|---|---|---|---|---|
| 49 | link | Show HN: 15 Years of StarCraft II… | 173 | 0 | 15 | 1 |
| 454 | link | Show HN: InfiniteGPU… | 192 | 3 | 230 | 1 |
| 550 | link | Show HN: ElixirBrowser… | 189 | 4 | 282 | 1 |
What the agent sees:
- only interactive elements
- ordered by importance and position
- grouped into dominant repeated structures (feeds)
- no screenshots
- no full DOM dump
This is semantic geometry, not scraping.
Token usage: before vs after
Before: Vision + full DOM
- DOM limit: 40,000 chars
- DOM state: ~10,195 chars (~2,548 tokens)
- Screenshots sent to model
- ~3,166 tokens per step
Total tokens (task): 37,051
Total cost: $0.0096
After: Sentience SDK (no vision, ranked DOM)
- DOM limit: 5,000 chars
- DOM state: 5,000 chars (~1,250 tokens)
- Vision disabled (0 screenshots)
- ~1,604 tokens per step
Total tokens (task): 14,143
Total cost: $0.0043
- ~50% reduction in tokens per step
- ~55% reduction in total cost
- Same task completed successfully
What the agent actually did
With Sentience integrated into browser-use, the agent:
- received 50 ranked semantic elements (1,557 chars)
- reasoned over structure, not pixels
- identified the top Show HN post
- completed the task in fewer steps
Log excerpt:
🧠 Sentience: Injected 50 semantic elements (1557 chars)
📊 DOM state truncated to 5000 chars (~1250 tokens)
✅ Vision DISABLED
▶️ done: The number 1 post on Show HN is:
“Show HN: 15 Years of StarCraft II Balance Changes…”
No screenshots.
No guessing.
No retries.
Why this enables local models
Small models aren’t bad at reasoning.
They’re bad at filtering noise.
By removing:
- irrelevant DOM nodes
- layout boilerplate
- pixel-level perception
…we reduce the reasoning load to something a 3B model can handle.
We’ve validated multi-step browser tasks using Qwen 2.5 3B locally with this approach:
- fewer tokens
- predictable behavior
- deterministic completion
Vision is optional — and used last
Vision isn’t banned.
But in Sentience:
- vision is disabled by default
- structure is tried first
- vision can be used only after snapshot exhaustion
- assertions stay the same
This keeps costs low without sacrificing correctness.
The real takeaway
Token efficiency isn’t just about cost.
It’s about making agents practical:
- local execution
- privacy-friendly
- predictable CI runs
- smaller models that actually work
The biggest gains didn’t come from better prompts.
They came from changing what the model sees.
Try it yourself
- browser-use + Sentience integration:
github.com/SentienceAPI/browser-use/pull/1 - multi-step tasks on Qwen 2.5 3B:
github.com/SentienceAPI/browser-use/pull/6
If you care about running agents locally, this is the direction.
One-line summary
To run agents locally, stop sending pixels.
Send structure, verify outcomes, and let small models reason.
Want to run agents locally?
Start with the SDK quickstart and see how structure-first snapshots change what the model sees.
Read the SDK Quickstart