Logo
Overview
Google's Gemini 2.5 Computer Use: AI That Navigates Web Like a Human

Google's Gemini 2.5 Computer Use: AI That Navigates Web Like a Human

October 7, 2025
12 min read

On October 7, 2025, Google DeepMind unveiled Gemini 2.5 Computer Use, a specialized AI model that can autonomously navigate websites, click buttons, fill forms, and complete multi-step web tasks—just like a human. With an 88.9% score on WebVoyager and 69.7% on AndroidWorld, Gemini 2.5 Computer Use outperforms both Anthropic’s Claude and OpenAI’s computer-using agents, establishing Google as a frontrunner in the race to build AI that can actually do things on your behalf. Now available in public preview via Gemini API, this model powers Google’s experimental Project Mariner and marks a significant step toward AI agents that work autonomously across digital interfaces.

What is Gemini 2.5 Computer Use?

Beyond Chatbots: AI That Takes Action

Unlike traditional language models that simply answer questions, Gemini 2.5 Computer Use is an agentic AI model designed to:

  • Navigate web browsers autonomously
  • Understand visual interfaces (buttons, forms, menus)
  • Execute multi-step tasks from natural language instructions
  • Interact with websites like a human user

Core Capability: The model uses a combination of visual understanding and reasoning to analyze user requests and carry out tasks in the browser, completing all required actions—clicking, typing, scrolling, manipulating dropdown menus, filling out and submitting forms—just as a human can do.

Built on Gemini 2.5 Pro Foundation

Gemini 2.5 Computer Use is built on Gemini 2.5 Pro’s visual understanding and reasoning capabilities, but specialized for UI interaction:

  • Vision-language model: Interprets screenshots as visual input
  • Action prediction: Decides which UI elements to interact with
  • Task planning: Breaks down complex requests into sequential actions
  • Error recovery: Handles unexpected UI states or failures

Benchmark Dominance: Best-in-Class Performance

WebVoyager Leaderboard: 88.9% Success Rate

Official WebVoyager Results (October 2025):

  • Gemini 2.5 Computer Use: 88.9% (highest score)
  • Claude Sonnet 4.5: ~71.4% (estimated from Browserbase)
  • Claude Sonnet 4: 69.4%
  • OpenAI Computer Using Agent: 61.0%

WebVoyager Benchmark: Tests AI agents on real-world web navigation tasks across diverse websites, requiring:

  • Multi-step task completion
  • Form filling and submission
  • Navigation across multiple pages
  • Understanding of web UI conventions

Gemini’s Advantage: The 17.5 percentage point lead over Claude Sonnet 4.5 represents a significant performance gap—translating to many more successfully completed tasks.

Browserbase Harness: 79.9% Accuracy

When measured by Browserbase (an independent evaluation platform):

  • Gemini 2.5 Computer Use: 79.9%
  • Claude Sonnet 4.5: 71.4%
  • Claude Sonnet 4: 69.4%
  • OpenAI Agent: 61.0%

This confirms Gemini’s lead across multiple evaluation methodologies.

AndroidWorld: 69.7% on Mobile Tasks

AndroidWorld Benchmark Results:

  • Gemini 2.5 Computer Use: 69.7%
  • Claude Sonnet 4: 62.1%
  • Claude Sonnet 4.5: 56.0%

Significance: Despite being “primarily optimized for web browsers,” Gemini 2.5 Computer Use still outperforms Claude on mobile UI control—demonstrating strong cross-platform generalization.

AndroidWorld Benchmark: Tests AI agents on Android app interactions:

  • Tapping UI elements
  • Scrolling through lists
  • Entering text in forms
  • Navigating between screens
  • Completing real-world mobile tasks

Online Mind2Web: 65.7% Real-World Web Tasks

Results:

  • Gemini 2.5 Computer Use: 65.7%
  • Claude Sonnet 4: 61.0%
  • OpenAI Agent: 44.3%

Mind2Web Benchmark: Focuses on complex, real-world web tasks from popular websites, requiring multi-step reasoning and precise UI interaction.

Latency Performance: Speed Meets Accuracy

Google emphasizes that Gemini 2.5 Computer Use delivers “high accuracy while maintaining low latency”:

  • Accuracy: Above 70% on key benchmarks
  • Latency: Around 225 seconds for complex tasks

Competitive Advantage: Balancing speed and accuracy is critical for practical deployment—users won’t wait 10 minutes for an agent to complete a task.

How It Works: Technical Architecture

Vision-Language Understanding

Gemini 2.5 Computer Use processes screenshots as visual input:

  1. User provides natural language instruction (e.g., “Book a flight to Tokyo for next Friday”)
  2. Model takes screenshot of current browser state
  3. Vision model identifies UI elements (buttons, text fields, links)
  4. Reasoning engine plans next action
  5. Model outputs coordinates and action type (click, type, scroll)
  6. Browser executes action
  7. Process repeats until task complete

Action Space: What the Model Can Do

Mouse Actions:

  • Click on specific coordinates
  • Right-click for context menus
  • Drag and drop elements

Keyboard Actions:

  • Type text into fields
  • Press special keys (Enter, Tab, Escape)
  • Use keyboard shortcuts

Navigation Actions:

  • Scroll up/down, left/right
  • Navigate forward/back in browser history
  • Open new tabs or windows

Form Interactions:

  • Select dropdown menu options
  • Check/uncheck boxes
  • Fill multi-field forms
  • Submit forms

Multi-Step Task Planning

For complex requests, the model:

  1. Decomposes the task into subtasks
  2. Prioritizes actions based on current UI state
  3. Adapts when encountering unexpected UI changes
  4. Verifies completion of each subtask before proceeding

Example: “Book a hotel in Paris for next weekend”

  • Navigate to booking website
  • Enter “Paris” in destination field
  • Select check-in/check-out dates (next weekend)
  • Click “Search”
  • Filter results by price/rating
  • Click on selected hotel
  • Fill in guest details
  • Enter payment information
  • Confirm booking

Each step requires visual understanding, reasoning, and precise UI interaction.

Platform Optimization and Limitations

Primarily Web Browser Focused

Google states Gemini 2.5 Computer Use is “primarily optimized for web browsers”:

  • ✅ Chrome, Firefox, Edge, Safari support
  • ✅ Complex web applications (forms, dashboards, e-commerce)
  • ✅ Multi-page workflows

Strong Android Performance (Not Yet Optimized)

Despite not being specifically optimized for Android, the model achieved 69.7% on AndroidWorld—suggesting:

  • Strong visual understanding transfers to mobile UIs
  • Future Android-optimized versions will likely perform even better
  • Cross-platform agent capabilities are feasible

Desktop OS-Level Control: Not Yet Available

The model is “not yet optimized for desktop OS-level control”, meaning:

  • ❌ Cannot directly control desktop applications (Word, Excel, Photoshop)
  • ❌ Cannot manage files in operating system
  • ❌ Cannot execute system-level commands

Comparison with Competitors:

  • Anthropic Claude Computer Use: Can control desktop OS, create/edit local files
  • OpenAI Agent: Can access desktop applications and file system
  • Gemini 2.5 Computer Use: Browser-only (currently)

This represents a significant limitation for use cases requiring OS-level control.

Real-World Applications

1. Automated Form Filling

Use Case: Fill out repetitive forms across multiple websites

Example:

  • Insurance quote requests across 10 providers
  • Job applications to 50 companies
  • Survey responses for research

Gemini Advantage:

  • Understands form context (what information to enter where)
  • Handles diverse form layouts and validation rules
  • Completes in minutes what would take hours manually

2. Research and Data Gathering

Use Case: Compile information from multiple sources

Example: “Find the 10 cheapest flights from LAX to Tokyo departing next Friday, and create a comparison table”

Gemini’s Workflow:

  1. Navigate to flight search websites (Google Flights, Kayak, Skyscanner)
  2. Enter search criteria on each site
  3. Extract pricing and flight details
  4. Compile results into structured data
  5. Present comparison to user

3. E-commerce Price Monitoring

Use Case: Track product prices across retailers

Example: “Monitor the price of [specific product] on Amazon, Best Buy, and Walmart daily, and alert me when it drops below $500”

Gemini’s Actions:

  • Navigate to each retailer
  • Search for product
  • Extract current price
  • Compare with previous prices
  • Send notification if threshold met

4. Account Management

Use Case: Update information across multiple services

Example: “Update my email address to [new email] on all my subscription accounts”

Gemini’s Workflow:

  1. Navigate to each service’s account settings
  2. Locate email update field
  3. Enter new email
  4. Verify change (if required)
  5. Confirm update

5. Customer Support Automation

Use Case: Handle repetitive customer queries

Example: “Process refund request for order #12345”

Gemini’s Actions:

  • Log into order management system
  • Locate order #12345
  • Initiate refund process
  • Fill in refund reason and amount
  • Confirm and log action

Powering Google’s Experimental Projects

Project Mariner: AI Browser Agent

Gemini 2.5 Computer Use powers Project Mariner, Google’s experimental browser agent that:

  • Understands user goals expressed in natural language
  • Plans multi-step browsing workflows
  • Executes tasks autonomously
  • Provides real-time status updates

Current Status: Project Mariner is in limited testing, with Gemini 2.5 Computer Use serving as its underlying model.

The model also powers agentic capabilities in AI Mode within Google Search:

  • Executes search-related tasks beyond simple queries
  • Navigates to relevant pages automatically
  • Extracts and synthesizes information
  • Completes user goals (e.g., “Find and book a dinner reservation”)

Firebase Testing Agent

Google uses Gemini 2.5 Computer Use for automated app testing:

  • Navigates through mobile apps
  • Tests UI flows and interactions
  • Identifies bugs and edge cases
  • Generates test reports

This demonstrates enterprise-grade reliability for automated testing workflows.

Availability and Pricing

Public Preview: Available Now

Access:

  • Google AI Studio: Web-based interface for testing
  • Vertex AI: Enterprise-grade deployment on Google Cloud
  • Gemini API: Programmatic access for developers

Status: Public preview as of October 7, 2025—open to all developers (no waitlist)

Pricing: No Free Tier

Unlike Gemini 2.5 Pro (which offers free access with token caps), Gemini 2.5 Computer Use is paid-only:

Input Token Pricing:

  • < 200K tokens: $1.25 per million tokens
  • ≥ 200K tokens: $2.50 per million tokens

Output Token Pricing: Not explicitly disclosed (likely similar structure)

Why No Free Tier? Computer use models consume significant computational resources:

  • Visual processing (screenshot analysis)
  • Multi-step reasoning
  • Real-time interaction loops
  • Higher latency tolerance requirements

Cost Comparison:

  • Gemini 2.5 Pro: Free tier available, 1.251.25-2.50/M tokens paid
  • Claude Computer Use: Pricing via API, similar range
  • OpenAI Agent: Pricing TBD

Competitive Landscape

vs. Anthropic Claude Computer Use

Claude Computer Use (October 2024 release):

  • Can control entire desktop OS
  • Creates/edits local files (PowerPoint, Excel, text docs)
  • Strong reasoning capabilities
  • Lower benchmark scores than Gemini 2.5

Gemini 2.5 Computer Use:

  • Browser-only (no OS-level control)
  • Cannot create/edit local files
  • Higher benchmark performance (WebVoyager, AndroidWorld)
  • Lower latency

Winner Depends on Use Case:

  • Need OS control or file creation? → Claude
  • Pure web automation with best performance? → Gemini

vs. OpenAI Computer Using Agent

OpenAI Agent:

  • Announced but limited public details
  • Significantly lower benchmark scores (44.3% Mind2Web vs. Gemini’s 65.7%)
  • OS-level control capabilities (similar to Claude)
  • Integration with ChatGPT ecosystem

Gemini Advantage:

  • Much higher accuracy on web tasks
  • Public preview available (OpenAI agent still invite-only)
  • Google Cloud enterprise integration

vs. Microsoft Copilot Vision

Microsoft Copilot Vision:

  • Can “see” web pages and assist with tasks
  • More advisory (suggests actions) than autonomous
  • Integrated with Edge browser
  • Privacy-focused (doesn’t retain screenshots)

Gemini Difference:

  • Fully autonomous execution (not just suggestions)
  • Cross-browser support (not Edge-only)
  • Higher performance on complex tasks

Limitations and Concerns

1. No Desktop OS Control

As noted above, this is a major gap compared to Claude and OpenAI:

  • Cannot create PowerPoint presentations
  • Cannot edit Excel spreadsheets
  • Cannot manage local files
  • Cannot control native apps (Photoshop, VS Code, etc.)

Google’s Response: The model is “not yet optimized” for desktop control—implying future versions may add this capability.

2. Browser-Only Environment

All tasks must be web-accessible:

  • ❌ Cannot install software
  • ❌ Cannot execute terminal commands
  • ❌ Cannot interact with local databases (unless web-accessible)

Workaround: Use web apps (Google Docs, Office 365, etc.) instead of desktop apps.

3. Security and Privacy Risks

Concerns:

  • Model sees everything on screen (sensitive information)
  • Potential for malicious instructions (“send all emails to attacker@evil.com”)
  • Screenshot data sent to Google servers

Mitigations:

  • Developers must implement authentication and authorization
  • Users should review actions before execution (where feasible)
  • Google likely has content policy and abuse detection

4. Reliability Challenges

Edge Cases:

  • Websites with CAPTCHAs (model cannot solve)
  • Non-standard UI patterns
  • Dynamically loaded content (AJAX, infinite scroll)
  • Multi-factor authentication requirements

Performance: While 88.9% on WebVoyager is impressive, 11.1% failure rate means 1 in 9 tasks fail—not yet suitable for mission-critical automation without human oversight.

Questions:

  • Is automated scraping via AI agents legal under website Terms of Service?
  • Who’s liable if the agent makes a mistake (wrong purchase, incorrect form submission)?
  • Can businesses detect and block AI agents?

Gray Area: Many websites prohibit automated access—AI agents operate in legal ambiguity.

Developer Integration

Getting Started with Gemini API

Basic Workflow:

import google.generativeai as genai
# Initialize
genai.configure(api_key='YOUR_API_KEY')
model = genai.GenerativeModel('gemini-2.5-computer-use')
# Define task
task = "Find the cheapest flight from SFO to NYC departing tomorrow"
# Execute (simplified example)
response = model.generate_content(task)
print(response.text)

Real Implementation: Requires screenshot capture, action execution framework, and error handling (Google provides SDKs and examples).

Use Cases for Developers

Web Scraping Alternative: Instead of writing custom scraping code:

  • Describe what data to extract
  • Let Gemini navigate and extract
  • Handle changing UI layouts automatically

Automated Testing:

  • Define user flows in natural language
  • Gemini executes test scenarios
  • Catches UI regressions automatically

Workflow Automation:

  • Connect disparate web services
  • Automate repetitive admin tasks
  • Build custom agents for specific domains

Conclusion

Google’s Gemini 2.5 Computer Use represents a major leap forward in AI agent capabilities. By achieving 88.9% on WebVoyager and 69.7% on AndroidWorld—both higher than Claude and OpenAI—Google has proven that AI agents can reliably navigate complex web interfaces and execute multi-step tasks autonomously.

The model’s browser-first optimization is both a strength and limitation: it excels at web automation but lacks the OS-level control of competitors. For use cases centered on web workflows—research, form filling, e-commerce monitoring, testing—Gemini 2.5 Computer Use is the most accurate option available.

With public preview access through Gemini API, Google has opened the door for developers to build the next generation of AI-powered automation tools. As the model evolves to support desktop control and file manipulation, it could become the foundational layer for autonomous AI agents across personal and enterprise applications.

The race to build AI that can “actually do things” is accelerating—and with Gemini 2.5 Computer Use, Google just took a commanding lead in web-based agent intelligence.


Stay updated on the latest AI agent breakthroughs and automation technologies at AI Breaking.