agentsevaluationbenchmarks
Agent evaluation and benchmarks
τ-bench, GAIA, SWE-bench, WebArena — measuring whether agents actually do useful work, not just sound smart.
Уровни глубины
L0Intro~2ч
Runs a single task through an agent and inspects the result.
L1Basics~10ч
Uses LangSmith / Ragas / DeepEval to score traces; builds a golden set.
L2Working~15ч
Runs SWE-bench / GAIA / τ-bench locally; compares model + scaffold variants.
L3Advanced~25ч
Defines rubric-based LLM-as-judge with calibration; cost/latency tradeoff curves.
L4Research~50ч
Contributes new agent benchmarks or judging methodologies.
Ресурсы
L1 — Basics