About

About the REAL Benchmark

There is no shortage of hype around AI models that can navigate the web. What I was most curious about was how these systems stack up on the websites we actually use, on goals that we actually would want done.

This was the motivation behind the REAL Benchmark.

Startups and the largest players in the space are gearing up to have models that can navigate browsers and operating systems. Most of the existing evaluations today are built for research and don’t correlate with a model’s usefulness in the real world.

There’s a way to fix that.

Evaluating AI models that can use the web requires a deterministic (in other words, it's always the same), but realistic playground modeling real-world browser experiences. Models need to be measured on their ability to take action and change the state of the world, without actually doing so (think emulating a check-out experience).

There is no such playground today. Using real websites is not deterministic, both from a content and a UI perspective. Additionally, there’s the issue of authentication, needing to use proxies or clever hacks to access a website, and the real-world consequences of actions taken.

REAL Evals is a collection of websites mimicking the core interaction flows of real-world websites.

The websites will look and respond like their “real” counterparts with a preset content configuration. We have come up with corresponding questions/challenges/prompts. Our evaluation consists of the website configuration and the AI systems final answer to the question for information retrieval, or a programmatic review of the end state of the website after an action was taken. We have come up with website configurations that support loading errors, high latency, and annoying pop-ups.

This evaluation recognizes that companies are developing their own proprietary ways of interacting with a browser that don't fit neatly into a set of actions and screens like previous benchmarks.

We have designed this benchmark to work with "black box" systems that work with the browser in their own unique ways. As long as the system can navigate to a start page before giving control of the session to the AI, and can navigate to a finish page with the model's final answer, it can be evaluated on the REAL Benchmark.

Participate

We are actively seeking feedback and collaboration to provide the initial baselines for SOTA broswer agent models to the public and academic community.

Our team would love the opportunity to work with you, sharing access to the websites, challengeset, and SDK. Please email us at participate@realevals.xyz if you are interested.

We are happy to sign NDAs and work with you to ensure that your participation is meaningful and valuable. All participants will have the opportunity to opt-in to be showcased on the leaderboard results. There is no cost to particpate in the REAL Benchmark.

REAL Benchmark Team

Benchmark Creator

Shaun VanWeelden

Managing Director at Mercor
Ex-OpenAI, Ex-Scale AI

Website Creators

Michael Lara

Front-end Developer
Based in Colombia 🇨🇴

Federico Lopez

Front-end Developer
Based in Argentina 🇦🇷

Tomas Abraham

Front-end Developer
Based in Argentina 🇦🇷

In collaboration with

Div Garg

CEO of MultiOn
Adjunct Lecturer, Stanford

Dr. Julia Kiseleva

Research Lead at MultiOn
Ex-Microsoft Research

Get in touch

Excited about agents that can really use browsers? So are we.

participate@realevals.xyz - We are looking for brave leaders to participate in the most real-world browser agent challenge to date.

press@realevals.xyz - For press inquiries, please reach out to our team.

hello@realevals.xyz - For general inquiries, we'd love to hear what's on your mind.

Legal Information

Last updated: December 9, 2024

Disclaimer

REAL Evals (“the Platform”) is an evaluation and testing environment designed to mimic the functionality and workflows of real-world websites for educational, research, and development purposes. The following disclaimers apply:

1. No Affiliation: The Platform and its operators are not affiliated, associated, authorized, endorsed by, or in any way officially connected with the real-world companies, brands, or entities represented by the mimicked websites. All company names, logos, and trademarks used on the Platform belong to their respective owners.

2. Simulation Environment: The Platform provides simulated environments that are not identical to the real-world counterparts they mimic. Results and evaluations conducted on the Platform are for testing and benchmarking purposes only and should not be construed as equivalent to performing actions on actual websites or applications.

3. Limitation of Liability: The Platform and its operators shall not be held liable for any direct, indirect, incidental, or consequential damages resulting from the use of the Platform, including but not limited to errors in the simulations, misrepresentations of the real-world counterparts, or any unintended outcomes from evaluations conducted on the Platform.

4. No Guarantee of Accuracy: While we strive to provide realistic simulations, we do not guarantee the accuracy, completeness, or currency of the content or workflows presented in the mimicked websites.

5. Use at Your Own Risk: The Platform is provided “as is,” and users assume full responsibility for their use of the Platform. It is the user’s responsibility to ensure compliance with all applicable laws and regulations.

Copyright Notice

All content, designs, configurations, workflows, and other materials provided on REAL Evals (“the Platform”) are the intellectual property of Success Engineering, LLC or its licensors and are protected under applicable copyright and intellectual property laws.

1. Educational and Research Purposes: The mimicked websites and configurations provided on the Platform are developed for educational, research, and testing purposes only. These simulations are intended to evaluate browser-based AI models in a controlled environment and are not meant to replicate or compete with the original websites.

2. No Ownership of Real-World Counterparts: The trademarks, logos, and brand names referenced within the simulated environments belong to their respective owners. The inclusion of such references is solely for simulation purposes and does not imply any ownership, endorsement, or affiliation with the actual entities.

3. Prohibited Uses: Users are prohibited from:

  • Copying, reproducing, or distributing the simulations outside of the Platform’s intended use.

  • Using the simulations for commercial purposes unrelated to the Platform’s evaluation framework.

  • Modifying or reverse-engineering the simulations to create derivative works.

4. Fair Use Acknowledgment: The mimicked websites and configurations are designed under the principles of fair use to serve as transformative tools for research and development. Any similarities to real-world counterparts are intended only to replicate core interaction flows in a controlled environment and do not represent the full functionality or appearance of the actual websites.

5. Infringement Reporting: If you believe your intellectual property has been used inappropriately on the Platform, please contact us at hello@realevals.xyz with a detailed report of the alleged infringement. We will investigate and take appropriate action promptly.