Google's Gemini 2.5 Computer Use: AI That Navigates Web Like a Human

On October 7, 2025, Google DeepMind unveiled Gemini 2.5 Computer Use, a specialized AI model that can autonomously navigate websites, click buttons, fill forms, and complete multi-step web tasks—just like a human. With an 88.9% score on WebVoyager and 69.7% on AndroidWorld, Gemini 2.5 Computer Use outperforms both Anthropic’s Claude and OpenAI’s computer-using agents, establishing Google as a frontrunner in the race to build AI that can actually do things on your behalf. Now available in public preview via Gemini API, this model powers Google’s experimental Project Mariner and marks a significant step toward AI agents that work autonomously across digital interfaces.

What is Gemini 2.5 Computer Use?

Beyond Chatbots: AI That Takes Action

Unlike traditional language models that simply answer questions, Gemini 2.5 Computer Use is an agentic AI model designed to:

Navigate web browsers autonomously
Understand visual interfaces (buttons, forms, menus)
Execute multi-step tasks from natural language instructions
Interact with websites like a human user

Core Capability: The model uses a combination of visual understanding and reasoning to analyze user requests and carry out tasks in the browser, completing all required actions—clicking, typing, scrolling, manipulating dropdown menus, filling out and submitting forms—just as a human can do.

Built on Gemini 2.5 Pro Foundation

Gemini 2.5 Computer Use is built on Gemini 2.5 Pro’s visual understanding and reasoning capabilities, but specialized for UI interaction:

Vision-language model: Interprets screenshots as visual input
Action prediction: Decides which UI elements to interact with
Task planning: Breaks down complex requests into sequential actions
Error recovery: Handles unexpected UI states or failures

Benchmark Dominance: Best-in-Class Performance

WebVoyager Leaderboard: 88.9% Success Rate

Official WebVoyager Results (October 2025):

Gemini 2.5 Computer Use: 88.9% (highest score)
Claude Sonnet 4.5: ~71.4% (estimated from Browserbase)
Claude Sonnet 4: 69.4%
OpenAI Computer Using Agent: 61.0%

WebVoyager Benchmark: Tests AI agents on real-world web navigation tasks across diverse websites, requiring:

Multi-step task completion
Form filling and submission
Navigation across multiple pages
Understanding of web UI conventions

Gemini’s Advantage: The 17.5 percentage point lead over Claude Sonnet 4.5 represents a significant performance gap—translating to many more successfully completed tasks.

Browserbase Harness: 79.9% Accuracy

When measured by Browserbase (an independent evaluation platform):

Gemini 2.5 Computer Use: 79.9%
Claude Sonnet 4.5: 71.4%
Claude Sonnet 4: 69.4%
OpenAI Agent: 61.0%

This confirms Gemini’s lead across multiple evaluation methodologies.

AndroidWorld: 69.7% on Mobile Tasks

AndroidWorld Benchmark Results:

Gemini 2.5 Computer Use: 69.7%
Claude Sonnet 4: 62.1%
Claude Sonnet 4.5: 56.0%

Significance: Despite being “primarily optimized for web browsers,” Gemini 2.5 Computer Use still outperforms Claude on mobile UI control—demonstrating strong cross-platform generalization.

AndroidWorld Benchmark: Tests AI agents on Android app interactions:

Tapping UI elements
Scrolling through lists
Entering text in forms
Navigating between screens
Completing real-world mobile tasks

Online Mind2Web: 65.7% Real-World Web Tasks

Results:

Gemini 2.5 Computer Use: 65.7%
Claude Sonnet 4: 61.0%
OpenAI Agent: 44.3%

Mind2Web Benchmark: Focuses on complex, real-world web tasks from popular websites, requiring multi-step reasoning and precise UI interaction.

Latency Performance: Speed Meets Accuracy

Google emphasizes that Gemini 2.5 Computer Use delivers “high accuracy while maintaining low latency”:

Accuracy: Above 70% on key benchmarks
Latency: Around 225 seconds for complex tasks

Competitive Advantage: Balancing speed and accuracy is critical for practical deployment—users won’t wait 10 minutes for an agent to complete a task.

How It Works: Technical Architecture

Vision-Language Understanding

Gemini 2.5 Computer Use processes screenshots as visual input:

User provides natural language instruction (e.g., “Book a flight to Tokyo for next Friday”)
Model takes screenshot of current browser state
Vision model identifies UI elements (buttons, text fields, links)
Reasoning engine plans next action
Model outputs coordinates and action type (click, type, scroll)
Browser executes action
Process repeats until task complete

Action Space: What the Model Can Do

Mouse Actions:

Click on specific coordinates
Right-click for context menus
Drag and drop elements

Keyboard Actions:

Type text into fields
Press special keys (Enter, Tab, Escape)
Use keyboard shortcuts

Navigation Actions:

Scroll up/down, left/right
Navigate forward/back in browser history
Open new tabs or windows

Form Interactions:

Select dropdown menu options
Check/uncheck boxes
Fill multi-field forms
Submit forms

Multi-Step Task Planning

For complex requests, the model:

Decomposes the task into subtasks
Prioritizes actions based on current UI state
Adapts when encountering unexpected UI changes
Verifies completion of each subtask before proceeding

Example: “Book a hotel in Paris for next weekend”

Navigate to booking website
Enter “Paris” in destination field
Select check-in/check-out dates (next weekend)
Click “Search”
Filter results by price/rating
Click on selected hotel
Fill in guest details
Enter payment information
Confirm booking

Each step requires visual understanding, reasoning, and precise UI interaction.

Platform Optimization and Limitations

Primarily Web Browser Focused

Google states Gemini 2.5 Computer Use is “primarily optimized for web browsers”:

✅ Chrome, Firefox, Edge, Safari support
✅ Complex web applications (forms, dashboards, e-commerce)
✅ Multi-page workflows

Strong Android Performance (Not Yet Optimized)

Despite not being specifically optimized for Android, the model achieved 69.7% on AndroidWorld—suggesting:

Strong visual understanding transfers to mobile UIs
Future Android-optimized versions will likely perform even better
Cross-platform agent capabilities are feasible

Desktop OS-Level Control: Not Yet Available

The model is “not yet optimized for desktop OS-level control”, meaning:

❌ Cannot directly control desktop applications (Word, Excel, Photoshop)
❌ Cannot manage files in operating system
❌ Cannot execute system-level commands

Comparison with Competitors:

Anthropic Claude Computer Use: Can control desktop OS, create/edit local files
OpenAI Agent: Can access desktop applications and file system
Gemini 2.5 Computer Use: Browser-only (currently)

This represents a significant limitation for use cases requiring OS-level control.

Real-World Applications

1. Automated Form Filling

Use Case: Fill out repetitive forms across multiple websites

Example:

Insurance quote requests across 10 providers
Job applications to 50 companies
Survey responses for research

Gemini Advantage:

Understands form context (what information to enter where)
Handles diverse form layouts and validation rules
Completes in minutes what would take hours manually

2. Research and Data Gathering

Use Case: Compile information from multiple sources

Example: “Find the 10 cheapest flights from LAX to Tokyo departing next Friday, and create a comparison table”

Gemini’s Workflow:

Navigate to flight search websites (Google Flights, Kayak, Skyscanner)
Enter search criteria on each site
Extract pricing and flight details
Compile results into structured data
Present comparison to user

3. E-commerce Price Monitoring

Use Case: Track product prices across retailers

Example: “Monitor the price of [specific product] on Amazon, Best Buy, and Walmart daily, and alert me when it drops below $500”

Gemini’s Actions:

Navigate to each retailer
Search for product
Extract current price
Compare with previous prices
Send notification if threshold met

4. Account Management

Use Case: Update information across multiple services

Example: “Update my email address to [new email] on all my subscription accounts”

Gemini’s Workflow:

Navigate to each service’s account settings
Locate email update field
Enter new email
Verify change (if required)
Confirm update

5. Customer Support Automation

Use Case: Handle repetitive customer queries

Example: “Process refund request for order #12345”

Gemini’s Actions:

Log into order management system
Locate order #12345
Initiate refund process
Fill in refund reason and amount
Confirm and log action

Powering Google’s Experimental Projects

Project Mariner: AI Browser Agent

Gemini 2.5 Computer Use powers Project Mariner, Google’s experimental browser agent that:

Understands user goals expressed in natural language
Plans multi-step browsing workflows
Executes tasks autonomously
Provides real-time status updates

Current Status: Project Mariner is in limited testing, with Gemini 2.5 Computer Use serving as its underlying model.

AI Mode in Google Search

The model also powers agentic capabilities in AI Mode within Google Search:

Executes search-related tasks beyond simple queries
Navigates to relevant pages automatically
Extracts and synthesizes information
Completes user goals (e.g., “Find and book a dinner reservation”)

Firebase Testing Agent

Google uses Gemini 2.5 Computer Use for automated app testing:

Navigates through mobile apps
Tests UI flows and interactions
Identifies bugs and edge cases
Generates test reports

This demonstrates enterprise-grade reliability for automated testing workflows.

Availability and Pricing

Public Preview: Available Now

Access:

Google AI Studio: Web-based interface for testing
Vertex AI: Enterprise-grade deployment on Google Cloud
Gemini API: Programmatic access for developers

Status: Public preview as of October 7, 2025—open to all developers (no waitlist)

Pricing: No Free Tier

Unlike Gemini 2.5 Pro (which offers free access with token caps), Gemini 2.5 Computer Use is paid-only:

Input Token Pricing:

< 200K tokens: $1.25 per million tokens
≥ 200K tokens: $2.50 per million tokens

Output Token Pricing: Not explicitly disclosed (likely similar structure)

Why No Free Tier? Computer use models consume significant computational resources:

Visual processing (screenshot analysis)
Multi-step reasoning
Real-time interaction loops
Higher latency tolerance requirements

Cost Comparison:

Gemini 2.5 Pro: Free tier available, $1.25-$ 2.50/M tokens paid
Claude Computer Use: Pricing via API, similar range
OpenAI Agent: Pricing TBD

Competitive Landscape

vs. Anthropic Claude Computer Use

Claude Computer Use (October 2024 release):

Can control entire desktop OS
Creates/edits local files (PowerPoint, Excel, text docs)
Strong reasoning capabilities
Lower benchmark scores than Gemini 2.5

Gemini 2.5 Computer Use:

Browser-only (no OS-level control)
Cannot create/edit local files
Higher benchmark performance (WebVoyager, AndroidWorld)
Lower latency

Winner Depends on Use Case:

Need OS control or file creation? → Claude
Pure web automation with best performance? → Gemini

vs. OpenAI Computer Using Agent

OpenAI Agent:

Announced but limited public details
Significantly lower benchmark scores (44.3% Mind2Web vs. Gemini’s 65.7%)
OS-level control capabilities (similar to Claude)
Integration with ChatGPT ecosystem

Gemini Advantage:

Much higher accuracy on web tasks
Public preview available (OpenAI agent still invite-only)
Google Cloud enterprise integration

vs. Microsoft Copilot Vision

Microsoft Copilot Vision:

Can “see” web pages and assist with tasks
More advisory (suggests actions) than autonomous
Integrated with Edge browser
Privacy-focused (doesn’t retain screenshots)

Gemini Difference:

Fully autonomous execution (not just suggestions)
Cross-browser support (not Edge-only)
Higher performance on complex tasks

Limitations and Concerns

1. No Desktop OS Control

As noted above, this is a major gap compared to Claude and OpenAI:

Cannot create PowerPoint presentations
Cannot edit Excel spreadsheets
Cannot manage local files
Cannot control native apps (Photoshop, VS Code, etc.)

Google’s Response: The model is “not yet optimized” for desktop control—implying future versions may add this capability.

2. Browser-Only Environment

All tasks must be web-accessible:

❌ Cannot install software
❌ Cannot execute terminal commands
❌ Cannot interact with local databases (unless web-accessible)

Workaround: Use web apps (Google Docs, Office 365, etc.) instead of desktop apps.

3. Security and Privacy Risks

Concerns:

Model sees everything on screen (sensitive information)
Potential for malicious instructions (“send all emails to attacker@evil.com”)
Screenshot data sent to Google servers

Mitigations:

Developers must implement authentication and authorization
Users should review actions before execution (where feasible)
Google likely has content policy and abuse detection

4. Reliability Challenges

Edge Cases:

Websites with CAPTCHAs (model cannot solve)
Non-standard UI patterns
Dynamically loaded content (AJAX, infinite scroll)
Multi-factor authentication requirements

Performance: While 88.9% on WebVoyager is impressive, 11.1% failure rate means 1 in 9 tasks fail—not yet suitable for mission-critical automation without human oversight.

5. Ethical and Legal Considerations

Questions:

Is automated scraping via AI agents legal under website Terms of Service?
Who’s liable if the agent makes a mistake (wrong purchase, incorrect form submission)?
Can businesses detect and block AI agents?

Gray Area: Many websites prohibit automated access—AI agents operate in legal ambiguity.

Developer Integration

Getting Started with Gemini API

Basic Workflow:

1
import google.generativeai as genai
2

3
# Initialize
4
genai.configure(api_key='YOUR_API_KEY')
5
model = genai.GenerativeModel('gemini-2.5-computer-use')
6

7
# Define task
8
task = "Find the cheapest flight from SFO to NYC departing tomorrow"
9

10
# Execute (simplified example)
11
response = model.generate_content(task)
12
print(response.text)

Real Implementation: Requires screenshot capture, action execution framework, and error handling (Google provides SDKs and examples).

Use Cases for Developers

Web Scraping Alternative: Instead of writing custom scraping code:

Describe what data to extract
Let Gemini navigate and extract
Handle changing UI layouts automatically

Automated Testing:

Define user flows in natural language
Gemini executes test scenarios
Catches UI regressions automatically

Workflow Automation:

Connect disparate web services
Automate repetitive admin tasks
Build custom agents for specific domains

Conclusion

Google’s Gemini 2.5 Computer Use represents a major leap forward in AI agent capabilities. By achieving 88.9% on WebVoyager and 69.7% on AndroidWorld—both higher than Claude and OpenAI—Google has proven that AI agents can reliably navigate complex web interfaces and execute multi-step tasks autonomously.

The model’s browser-first optimization is both a strength and limitation: it excels at web automation but lacks the OS-level control of competitors. For use cases centered on web workflows—research, form filling, e-commerce monitoring, testing—Gemini 2.5 Computer Use is the most accurate option available.

With public preview access through Gemini API, Google has opened the door for developers to build the next generation of AI-powered automation tools. As the model evolves to support desktop control and file manipulation, it could become the foundational layer for autonomous AI agents across personal and enterprise applications.

The race to build AI that can “actually do things” is accelerating—and with Gemini 2.5 Computer Use, Google just took a commanding lead in web-based agent intelligence.

Stay updated on the latest AI agent breakthroughs and automation technologies at AI Breaking.