agentsevaluationbenchmarks

Agent evaluation and benchmarks

τ-bench, GAIA, SWE-bench, WebArena — measuring whether agents actually do useful work, not just sound smart.

Уровни глубины

L0Intro~2ч

Runs a single task through an agent and inspects the result.

L1Basics~10ч

Uses LangSmith / Ragas / DeepEval to score traces; builds a golden set.

L2Working~15ч

Runs SWE-bench / GAIA / τ-bench locally; compares model + scaffold variants.

L3Advanced~25ч

Defines rubric-based LLM-as-judge with calibration; cost/latency tradeoff curves.

L4Research~50ч

Contributes new agent benchmarks or judging methodologies.

L1 — Basics

L2 — Working