REAL comprises high-fidelity, deterministic replicas of 11 widely-used websites across e-commerce, travel, communication, and professional networking domains, with 112 practical tasks mirroring everyday complex user interactions. Frontier language models achieve at most a 41% success rate, highlighting critical gaps in autonomous web navigation and task completion.
@inproceedings{liu2025real,title={REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites},author={Garg, Divyansh and VanWeelden, Shaun and Caples, Diego and Draguns, Andis and Ravi, Nikil and Putta, Pranav and Garg, Naman and Abraham, Tomas and Lara, Michael and Lopez, Federico and Liu, James and Gundawar, Atharva and Hebbar, Prannay and Joo, Youngchul and Gu, Jindong and London, Charles and Schroeder de Witt, Christian and Motwani, Sumeet},booktitle={Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks},year={2025},}
In Progress
Diversity-Driven Multi-Agent Reinforcement Learning for Language Models
James Liu and Yejin Choi
2025
Ongoing research at Stanford AI Lab with Professor Yejin Choi
Training shared-parameter language models using a multi-agent RL pipeline with VERL on Slurm-managed NVIDIA H100 clusters. Dual reward functions for quality and diversity improve Shannon Evenness Index by over 15%. Research conducted at the Stanford Artificial Intelligence Lab.
@article{liu2025marl,title={Diversity-Driven Multi-Agent Reinforcement Learning for Language Models},author={Liu, James and Choi, Yejin},year={2025},note={Ongoing research at Stanford AI Lab with Professor Yejin Choi},}