REAL Evals - Realistic Evaluations for Agents Leaderboard

About

About the REAL Benchmark

There’s no shortage of hype around AI agents that can navigate the web. But what we care about most is how these systems perform on actual websites, completing real-world tasks that people genuinely want done.

This is the motivation behind the REAL Benchmark.

Today, both startups and major players are racing to build models that can navigate browsers and operating systems. Yet most existing evaluations are designed for research settings—and often fail to reflect real-world utility.

We think there’s a better way.

Evaluating web-capable AI models requires a deterministic (i.e., consistent and repeatable) yet realistic environment that mimics actual browser experiences. Models should be assessed on their ability to take meaningful actions—like emulating a checkout process—without causing real-world consequences.

Currently, no such standardized environment exists. Real websites are non-deterministic, with dynamic content, variable UIs, authentication walls, and unpredictable behaviors. They’re also risky to test on directly.

REAL Evals is a curated collection of simulated websites that faithfully replicate the core interaction patterns of real-world sites. These sites are pre-configured with fixed content, allowing for consistent and repeatable evaluations.

Each evaluation includes:

A simulated website mirroring a real-world experience
A set of questions, tasks, or prompts
Evaluation logic based on either the final answer (for information tasks) or the programmatic review of the website’s state after actions are taken.

We’ve also included edge cases like loading errors, latency, and pop-ups to better mirror real user experiences.

Importantly, the REAL Benchmark is built to accommodate “black box” systems—agents that use proprietary interaction methods rather than predefined actions or screens. As long as the system can:

Navigate to a starting page before handing control to the AI agent, and
Navigate to a final page to capture the model’s answer or action result,

…it can be evaluated using REAL.

Participate

We are actively seeking feedback and collaboration to provide the initial baselines for state‑of‑the‑art browser‑agent models to the public and academic community.

Our team would love the opportunity to work with you, sharing access to the websites, challenge sets, and SDKs. Please email us at participate@realevals.xyz if you are interested.

We are happy to sign NDAs and work with you to ensure that your participation is meaningful and valuable. All participants will have the opportunity to opt in to be showcased on the leaderboard results. There is no cost to participate in the REAL Benchmark.

REAL Benchmark Team

Benchmark Creators

Div Garg

CEO of AGI, Inc.
Adjunct Lecturer, Stanford

Shaun VanWeelden

Managing Director at Mercor
Ex-OpenAI, Ex-Scale AI

Get in touch

Excited about agents that can really use browsers? So are we.

participate@realevals.xyz - We are looking for brave leaders to participate in the most real-world browser agent challenge to date.

press@realevals.xyz - For press inquiries, please reach out to our team.

hello@realevals.xyz - For general inquiries, we'd love to hear what's on your mind.

Legal Information

Last updated: December 9, 2024

Disclaimer

REAL Evals (“the Platform”) is an evaluation and testing environment designed to mimic the functionality and workflows of real-world websites for educational, research, and development purposes. The following disclaimers apply:

1. No Affiliation: The Platform and its operators are not affiliated, associated, authorized, endorsed by, or in any way officially connected with the real-world companies, brands, or entities represented by the mimicked websites. All company names, logos, and trademarks used on the Platform belong to their respective owners.

2. Simulation Environment: The Platform provides simulated environments that are not identical to the real-world counterparts they mimic. Results and evaluations conducted on the Platform are for testing and benchmarking purposes only and should not be construed as equivalent to performing actions on actual websites or applications.

3. Limitation of Liability: The Platform and its operators shall not be held liable for any direct, indirect, incidental, or consequential damages resulting from the use of the Platform, including but not limited to errors in the simulations, misrepresentations of the real-world counterparts, or any unintended outcomes from evaluations conducted on the Platform.

4. No Guarantee of Accuracy: While we strive to provide realistic simulations, we do not guarantee the accuracy, completeness, or currency of the content or workflows presented in the mimicked websites.

5. Use at Your Own Risk: The Platform is provided “as is,” and users assume full responsibility for their use of the Platform. It is the user’s responsibility to ensure compliance with all applicable laws and regulations.

All content, designs, configurations, workflows, and other materials provided on REAL Evals (“the Platform”) are the intellectual property of Success Engineering, LLC or its licensors and are protected under applicable copyright and intellectual property laws.

1. Educational and Research Purposes: The mimicked websites and configurations provided on the Platform are developed for educational, research, and testing purposes only. These simulations are intended to evaluate browser-based AI models in a controlled environment and are not meant to replicate or compete with the original websites.

2. No Ownership of Real-World Counterparts: The trademarks, logos, and brand names referenced within the simulated environments belong to their respective owners. The inclusion of such references is solely for simulation purposes and does not imply any ownership, endorsement, or affiliation with the actual entities.

3. Prohibited Uses: Users are prohibited from:

Copying, reproducing, or distributing the simulations outside of the Platform’s intended use.
Using the simulations for commercial purposes unrelated to the Platform’s evaluation framework.
Modifying or reverse-engineering the simulations to create derivative works.

4. Fair Use Acknowledgment: The mimicked websites and configurations are designed under the principles of fair use to serve as transformative tools for research and development. Any similarities to real-world counterparts are intended only to replicate core interaction flows in a controlled environment and do not represent the full functionality or appearance of the actual websites.

5. Infringement Reporting: If you believe your intellectual property has been used inappropriately on the Platform, please contact us at hello@realevals.xyz with a detailed report of the alleged infringement. We will investigate and take appropriate action promptly.