Realistic Evaluations for Agents Leaderboard
(REAL)

Modern, complex websites mirroring what we actually use the web for

High-fidelity, fully-deterministic websites for agents to explore

Staynb

Similar to Airbnb

Staynb
Omnizon

Similar to Amazon

Omnizon
DashDish

Similar to Doordash

DashDish
GoCalendar

Similar to Google Calendar

GoCalendar
GoMail

Similar to GMail

GoMail
OpenDining

Similar to OpenTable

OpenDining
NetworkIn

Similar to LinkedIn

NetworkIn
Udriver

Similar to Uber

Udriver
Fly Unified

Similar to United

Fly Unified
TopWork

Similar to UpWork

TopWork
Zilloft

Similar to Zillow

Zilloft

How it works

Real websites
  • Modern web stack - React + Next.js
  • Rich functionality for core flows
  • Realistic mock data
  • Fully deterministic, meaning
    • Locked data
    • Fixed date ranges
    • Perfect replayability
  • Already logged in and ready
  • Agent-friendly security posture
  • Cross-tab session persistence
Real goals
  • Practical goals written by humans
  • Websites are fully configurable
    • Toggle accessibility features
    • Set unexpected behavior flags
    • Configure mock latency
  • Action and retrieval-based goals
  • Failure, "No action" cases included
  • Easy, medium, hard categories
  • Rubrics for LLM judging of retrieval tasks
Flexible evaluation
  • Bring your own system, "black box" systems are supported
  • Framework agnostic
  • Playwright SDK available
  • Multiple ways to accomplish goals
  • Easy to work with websites
    • /config to configure
    • /finish to get state changes
    • /submit to submit goal outcomes
  • Local evaluation support

Model performance, broken out by website and tasks

Sample leaderboard image

Example model results page

REAL is currently in closed beta. We’re excited to invite you to participate and provide feedback to improve the experience.