Generate massive, pixel-perfect ground truth datasets for VLM fine-tuning and Imitation Learning without manual labeling.
Training reliable web agents requires high-quality "Observation-Action" pairs. Traditional methods fail to scale:
Manual Labeling
Humans annotating bounding boxes is slow, expensive ($2-5/image), and prone to inconsistencies.
Raw HTML Scraping
DOM trees are noisy and lack visual context. An AI can't "see" if an element is hidden behind a modal.
Use Visual Mode to programmatically generate millions of perfect training samples:
High-fidelity PNG screenshots (1024x768) that perfectly represent what a human sees, rendering all CSS, WebGL, and fonts correctly.
Simultaneous extraction of the DOM tree with calculated bounding boxes (bbox). This maps every pixel to a semantic element automatically.
Automatic labeling of interactive states (clickable, input, disabled). The API acts as the "labeler," guaranteeing 100% consistency.
Architectures for synthetic data generation
Create a dataset mapping Screenshots → JSON Representations to teach models to "read" screens:
1# 1. Define target sites
2sites = ["amazon.com", "google.com", "reddit.com"]
3
4for site in sites:
5 # 2. Capture "Visual Mode" (Screenshot + Data)
6 observation = sentience.observe(
7 url=site,
8 mode="visual", # Returns screenshot + bbox
9 options={"render_quality": "precision"}
10 )
11
12 # 3. Handle polymorphic screenshot format
13 screenshot = observation.get("screenshot")
14 if isinstance(screenshot, dict):
15 if screenshot.get("type") == "url":
16 image = screenshot["url"] # Use presigned URL (recommended)
17 elif screenshot.get("type") == "base64":
18 image = f"data:image/png;base64,{screenshot['data']}" # Base64 format
19 else:
20 image = screenshot # Legacy: base64 string (backward compatibility)
21
22 # 4. Save as training pair
23 save_dataset_pair(
24 image=image, # X (Input)
25 label=observation["interactable_elements"] # Y (Target)
26 )✅ Outcome: A perfectly labeled dataset linking visual pixels to semantic DOM nodes.
Use the API as the "Eyes" of your RL Agent environment loop:
1class WebAgentEnv(gym.Env):
2 def step(self, action):
3 # 1. Execute action (click/type)
4 self.driver.execute(action)
5
6 # 2. Get new observation via API
7 obs = sentience.observe(
8 url=self.driver.current_url,
9 mode="map",
10 options={"render_quality": "performance"} # Fast for RL loops
11 )
12
13 # 3. Calculate reward based on visual state
14 reward = calculate_reward(obs)
15
16 return obs, reward, False, {}✅ Outcome: Agent receives structured, parsed observations instead of raw HTML.
Manual Annotation
Throughput: ~20 images/hour
Cost: $2.00 - $5.00 per image
Accuracy: Variable (Human Error)
Sentience API Generation
Throughput: ~5,000 images/hour
Cost: $0.01 per image
Accuracy: Pixel-Perfect (>99.5%)
200x Cost Reduction
Generate a 100k sample dataset for $1,000 instead of $200,000