REAL Evals - Realistic Evaluations for Agents Leaderboard

REAL

BY

AGI INC

REAL

Realistic Evaluations for Agents Leaderboard

REAL Score

Sandbox

A controlled environment where AI agents interact with realistic website replicas to test complex tasks. Built for research and debugging only—this non-commercial clone is unaffiliated with any real brands and used strictly for educational purposes.

Staynb

Staynb

AIRBNB CLONE

Omnizon

Omnizon

AMAZON CLONE

DashDish

DashDish

DOORDASH CLONE

GoCalendar

GoCalendar

GOOGLE CALENDAR CLONE

GoMail

GoMail

GMAIL CLONE

OpenDining

OpenDining

OPENTABLE CLONE

NetworkIn

NetworkIn

LINKEDIN CLONE

Udriver

Udriver

UBER CLONE

Fly Unified

Fly Unified

UNITED CLONE

TopWork

TopWork

UPWORK CLONE

Zilloft

Zilloft

ZILLOW CLONE

How it works

Real websites

Modern web stack - React + Next.js
Rich functionality for core flows
Realistic mock data
Fully deterministic, meaning

Locked data
Fixed date ranges
Perfect replayability

Already logged in and ready
Agent-friendly security posture
Cross-tab session persistence

Real goals

Practical goals written by humans
Websites are fully configurable

Toggle accessibility features
Set unexpected behavior flags
Configurable mock latency

Action and retrieval-based goals
Failure, "No action" cases included
Easy, medium, hard categories
Rubrics for LLM judging of retrieval tasks

Flexible evaluation

Bring your own system, "black box" systems are supported
Framework agnostic
Playwright SDK available
Multiple ways to accomplish goals
Easy to work with websites

/config to configure
/finish to get state changes
/submit to submit goal outcomes

Local evaluation support

Submit to the Leaderboard

We welcome submissions to the REAL leaderboard! You can evaluate your own agent's performance and contribute to advancing the state of AI web agents.

To submit your results, use our official SDK available at:

https://github.com/agi-inc/agisdk

Follow the documentation in the repository to learn how to run evaluations and submit your results to appear on this leaderboard.

Read the REAL benchmark paper: REAL: Realistic Evaluations for Agents Leaderboard

REAL is now publicly released! Check out our blog post and paper to learn more about this benchmark and how to use it.