Trajectory-Level Web Agent Evaluation

Project Overview

At the SaNDwich Lab, a collaboration between IBM and the University of Notre Dame, I’m working under Prof. Toby Jia-Jun Li to revolutionize how we evaluate web agents. Rather than simple success/failure metrics, we’re developing trajectory-level evaluation that assesses the quality and value alignment of entire action sequences in web automation tasks.

Key Contributions

Developed trajectory-level evaluation framework for scoring action sequences, not only end success
Created controlled replica testbeds for retail, maps, and flight booking scenarios to avoid anti-bot drift
Implemented standardized browser automation using browser-use framework (screenshots + DOM)
Encoded user values (e.g., vegan, budget-conscious) and logged value reasoning separately from action reasoning
Added completion-page checks to prevent false continuation and improve evaluation accuracy
Instrumented various factors including API vs. end-to-end performance analysis
Analyzed environmental variables such as recommender settings, pricing frames, and deceptive ads

Technical Skills Demonstrated

Browser automation using screenshots and DOM manipulation frameworks
Controlled environment replication for major web platforms and services
Value encoding systems and preference tracking for ethical AI evaluation
Comprehensive logging and analysis pipelines for complex behavioral data
Multi-factor experimental design for robust evaluation methodology
Human-AI interaction principles and evaluation framework development

Impact and Applications

This framework enables more sophisticated evaluation of web agents across various domains and has practical applications in:

E-commerce platforms with budget and preference constraints
Travel booking systems with personal preferences and accessibility requirements
Navigation systems with user-specific needs and restrictions
General web automation with ethical considerations and value alignment
Human-AI collaboration assessment in real-world scenarios

Publication Status

This work is being prepared for submission to IUI, where it will contribute to advancing the field of human-AI interaction and web automation evaluation methodologies.

Dan (Hojun) Yoo

Project Overview

Key Contributions

Technical Skills Demonstrated

Impact and Applications

Publication Status