Princeton computer science professor Arvind Narayanan recently published a paper with Sayash Kapoor, his Ph.D. advisee and frequent collaborator, and fifteen others (Peter Kirgis, Andrew Schwartz, Stephan Rabanser, J.J. Allaire, Rishi Bommasani, Magda Dubois, Gillian Hadfield, Andy Hall, Sara Hooker, Seth Lazar, Steve Newman, Dimitris Papailiopoulos, Shoshannah Tekofsky, Helen Toner, and Cozmin Ududec), titled Open-world evaluations for measuring frontier AI capabilities. Along with defining open-world evaluations, surveying the lessons learned so far, and laying out best practices for conducting them, the paper introduces CRUX, a new project for systematically conducting open-world evaluations.
As major benchmarks are increasingly saturated with AI models, CRUX was developed by these 17 co-authors as a collaboration among researchers from academia, government, civil society, and industry to measure frontier AI capabilities through open-world evaluations.
“The practice of AI benchmarking is about half a century old. Its limitations have been well known for decades. Benchmarks are useful, but can’t be the entire story of AI evaluation,” said Narayanan.
In direct contrast to traditional benchmarking, open-world evaluations use a small sample and often require a human element or intervention. While some may hasten to dismiss these evaluations as unscientific, the CRUX team states, “We think such evaluations are important for collecting evidence about AI capabilities. They can provide early warnings about emerging capabilities to inform efforts at building societal resilience, help evaluators identify blind spots in existing benchmarks, and give companies a clearer picture of what tasks AI systems could soon carry out, informing strategic decisions about AI.”
What CRUX Proposes:
CRUX has already completed several experiments, including the evaluation of an AI agent creating an iOS app from scratch. “What is interesting about open-world evaluation is that many different people and teams started doing it roughly simultaneously, showing that there was pent-up demand for new approaches to AI evaluation,” shared Narayanan. “With the CRUX project, we hope to learn from these efforts and carry out regular open-world evaluations that will allow us to map the rapidly shifting AI frontier.”
Learn more about the limitations of benchmarking, open-world evaluations, and CRUX by reading the full paper on Substack.
Arvind Narayanan is a professor of computer science at Princeton University and the director of the Center for Information Technology Policy. His research focuses on studying the societal impact of digital technologies, especially AI.
Narayanan is a co-author of the book AI Snake Oil, the essay AI as Normal Technology, and the newsletter of the same name, which is read by over 75,000 researchers, policymakers, journalists, and AI enthusiasts. He previously co-authored two widely used computer science textbooks: Bitcoin and Cryptocurrency Technologies and Fairness in Machine Learning. Narayanan led the Princeton Web Transparency and Accountability Project to uncover how companies collect and use our personal information. His work was among the first to show how machine learning reflects cultural stereotypes. Narayanan was included on TIME’s inaugural list of the 100 most influential people in AI. He is a recipient of the Presidential Early Career Award for Scientists and Engineers (PECASE).
Narayanan received technical degrees from the Indian Institute of Technology Madras and his Ph.D. in computer science from the University of Texas, Austin. He began his academic career as a post-doctoral researcher at Stanford University before moving to Princeton in 2012.