Back to Blog
Engineering
January 15, 20266 min read

How to Run Web Agents with Local LLM (3B)

Run web agents on 3B–14B local models by replacing screenshots + raw DOM with structure-first snapshots — cutting token usage by ~50% per step.

Running web agents locally used to feel unrealistic.

Most agents today:

  • rely on vision models
  • send screenshots every step
  • dump large DOMs into the prompt
  • assume cloud-scale inference

That works — but it’s expensive, slow, and hard to trust.

At Sentience, we took a different approach:
remove pixels, reduce DOM noise, and verify outcomes structurally.

The result:

  • agents that run on 3B–14B local models
  • with ~50% lower token usage per step
  • while still completing real browser tasks correctly

This post shows how.


The core problem: pixels and raw DOMs are expensive

A typical vision-first browser agent does something like:

  1. Take a screenshot
  2. Serialize a large DOM
  3. Ask the model “what do you see?”
  4. Guess what to click next

That means:

  • thousands of tokens per step
  • vision tokens every iteration
  • lots of irrelevant UI noise
  • retries that silently burn cost

This is why small local models struggle:
they’re overwhelmed by perception, not reasoning.


The alternative: structure-first snapshots

Instead of sending pixels or raw HTML, Sentience snapshots the rendered DOM after hydration and formats it for LLM reasoning.

Example (Hacker News “Show HN”):

idrolenameimportancedoc_yordinaldominant_group
49linkShow HN: 15 Years of StarCraft II…1730151
454linkShow HN: InfiniteGPU…19232301
550linkShow HN: ElixirBrowser…18942821

What the agent sees:

  • only interactive elements
  • ordered by importance and position
  • grouped into dominant repeated structures (feeds)
  • no screenshots
  • no full DOM dump

This is semantic geometry, not scraping.


Token usage: before vs after

Before: Vision + full DOM

  • DOM limit: 40,000 chars
  • DOM state: ~10,195 chars (~2,548 tokens)
  • Screenshots sent to model
  • ~3,166 tokens per step

Total tokens (task): 37,051
Total cost: $0.0096


After: Sentience SDK (no vision, ranked DOM)

  • DOM limit: 5,000 chars
  • DOM state: 5,000 chars (~1,250 tokens)
  • Vision disabled (0 screenshots)
  • ~1,604 tokens per step

Total tokens (task): 14,143
Total cost: $0.0043

  • ~50% reduction in tokens per step
  • ~55% reduction in total cost
  • Same task completed successfully

What the agent actually did

With Sentience integrated into browser-use, the agent:

  • received 50 ranked semantic elements (1,557 chars)
  • reasoned over structure, not pixels
  • identified the top Show HN post
  • completed the task in fewer steps

Log excerpt:

🧠 Sentience: Injected 50 semantic elements (1557 chars)
📊 DOM state truncated to 5000 chars (~1250 tokens)
✅ Vision DISABLED
▶️ done: The number 1 post on Show HN is:
   “Show HN: 15 Years of StarCraft II Balance Changes…”

No screenshots.
No guessing.
No retries.


Why this enables local models

Small models aren’t bad at reasoning.
They’re bad at filtering noise.

By removing:

  • irrelevant DOM nodes
  • layout boilerplate
  • pixel-level perception

…we reduce the reasoning load to something a 3B model can handle.

We’ve validated multi-step browser tasks using Qwen 2.5 3B locally with this approach:

  • fewer tokens
  • predictable behavior
  • deterministic completion

Vision is optional — and used last

Vision isn’t banned.

But in Sentience:

  • vision is disabled by default
  • structure is tried first
  • vision can be used only after snapshot exhaustion
  • assertions stay the same

This keeps costs low without sacrificing correctness.


The real takeaway

Token efficiency isn’t just about cost.

It’s about making agents practical:

  • local execution
  • privacy-friendly
  • predictable CI runs
  • smaller models that actually work

The biggest gains didn’t come from better prompts.
They came from changing what the model sees.


Try it yourself

If you care about running agents locally, this is the direction.


One-line summary

To run agents locally, stop sending pixels.
Send structure, verify outcomes, and let small models reason.

Want to run agents locally?

Start with the SDK quickstart and see how structure-first snapshots change what the model sees.

Read the SDK Quickstart