Overview
This benchmark probes atomic browser-interaction abilities across tasks. Each task targets one specific UI skill in isolation.
Correct
Base 60 % of the task points + up to 40 % time bonus that decays with elapsed time.
Wrong
Penalty of −0.5× the task points. If uncertain, abstain rather than guess.
Skip
Early skip earns +0.1× points (calibration credit); a late skip loses 0.1× points.
Timeout
Each task has its own time limit (10 s–300 s). A timeout costs −0.3× points.
Strategy.
Be accurate and fast. Wrong answers cost more than skips. The benchmark rewards accuracy, speed, and self-awareness jointly via three independent axes: Success, Efficiency, and Calibration.