Trajectory-Level Web Agent Evaluation
Project Overview
At the SaNDwich Lab, a collaboration between IBM and the University of Notre Dame, I’m working under Prof. Toby Jia-Jun Li to revolutionize how we evaluate web agents. Rather than simple success/failure metrics, we’re developing trajectory-level evaluation that assesses the quality and value alignment of entire action sequences in web automation tasks.
Key Contributions
- Developed trajectory-level evaluation framework for scoring action sequences, not only end success
- Created controlled replica testbeds for retail, maps, and flight booking scenarios to avoid anti-bot drift
- Implemented standardized browser automation using browser-use framework (screenshots + DOM)
- Encoded user values (e.g., vegan, budget-conscious) and logged value reasoning separately from action reasoning
- Added completion-page checks to prevent false continuation and improve evaluation accuracy
- Instrumented various factors including API vs. end-to-end performance analysis
- Analyzed environmental variables such as recommender settings, pricing frames, and deceptive ads
Technical Skills Demonstrated
- Browser automation using screenshots and DOM manipulation frameworks
- Controlled environment replication for major web platforms and services
- Value encoding systems and preference tracking for ethical AI evaluation
- Comprehensive logging and analysis pipelines for complex behavioral data
- Multi-factor experimental design for robust evaluation methodology
- Human-AI interaction principles and evaluation framework development
Impact and Applications
This framework enables more sophisticated evaluation of web agents across various domains and has practical applications in:
- E-commerce platforms with budget and preference constraints
- Travel booking systems with personal preferences and accessibility requirements
- Navigation systems with user-specific needs and restrictions
- General web automation with ethical considerations and value alignment
- Human-AI collaboration assessment in real-world scenarios
Publication Status
This work is being prepared for submission to IUI, where it will contribute to advancing the field of human-AI interaction and web automation evaluation methodologies.
