Docs/Use Cases/Training_AI_Agents

Training AI Agents & Dataset Generation

Generate massive, pixel-perfect ground truth datasets for VLM fine-tuning and Imitation Learning without manual labeling.

The Data Bottleneck

Training reliable web agents requires high-quality "Observation-Action" pairs. Traditional methods fail to scale:

Manual Labeling

Humans annotating bounding boxes is slow, expensive ($2-5/image), and prone to inconsistencies.

Raw HTML Scraping

DOM trees are noisy and lack visual context. An AI can't "see" if an element is hidden behind a modal.

The Solution: Programmatic Ground Truth

Use Visual Mode to programmatically generate millions of perfect training samples:

1

Visual Capture

High-fidelity PNG screenshots (1024x768) that perfectly represent what a human sees, rendering all CSS, WebGL, and fonts correctly.

2

Precision Mapping

Simultaneous extraction of the DOM tree with calculated bounding boxes (bbox). This maps every pixel to a semantic element automatically.

3

State Annotation

Automatic labeling of interactive states (clickable, input, disabled). The API acts as the "labeler," guaranteeing 100% consistency.

Workflow Examples

Architectures for synthetic data generation

VLM Fine-Tuning Dataset

Create a dataset mapping Screenshots → JSON Representations to teach models to "read" screens:

1# 1. Define target sites
2sites = ["amazon.com", "google.com", "reddit.com"]
3
4for site in sites:
5    # 2. Capture "Visual Mode" (Screenshot + Data)
6    observation = sentience.observe(
7        url=site,
8        mode="visual", # Returns screenshot + bbox
9        options={"render_quality": "precision"}
10    )
11
12    # 3. Handle polymorphic screenshot format
13    screenshot = observation.get("screenshot")
14    if isinstance(screenshot, dict):
15        if screenshot.get("type") == "url":
16            image = screenshot["url"]  # Use presigned URL (recommended)
17        elif screenshot.get("type") == "base64":
18            image = f"data:image/png;base64,{screenshot['data']}"  # Base64 format
19    else:
20        image = screenshot  # Legacy: base64 string (backward compatibility)
21    
22    # 4. Save as training pair
23    save_dataset_pair(
24        image=image, # X (Input)
25        label=observation["interactable_elements"] # Y (Target)
26    )

✅ Outcome: A perfectly labeled dataset linking visual pixels to semantic DOM nodes.

RL Environment Observation

Use the API as the "Eyes" of your RL Agent environment loop:

1class WebAgentEnv(gym.Env):
2    def step(self, action):
3        # 1. Execute action (click/type)
4        self.driver.execute(action)
5
6        # 2. Get new observation via API
7        obs = sentience.observe(
8            url=self.driver.current_url,
9            mode="map",
10            options={"render_quality": "performance"} # Fast for RL loops
11        )
12
13        # 3. Calculate reward based on visual state
14        reward = calculate_reward(obs)
15
16        return obs, reward, False, {}

✅ Outcome: Agent receives structured, parsed observations instead of raw HTML.

Dataset Creation ROI

Manual Annotation

Throughput: ~20 images/hour

Cost: $2.00 - $5.00 per image

Accuracy: Variable (Human Error)

Sentience API Generation

Throughput: ~5,000 images/hour

Cost: $0.01 per image

Accuracy: Pixel-Perfect (>99.5%)

200x Cost Reduction

Generate a 100k sample dataset for $1,000 instead of $200,000