How this is built.
The structural argument behind the benchmark. 48 skills, one anatomy, a deterministic eval loop, and a manifest that refuses to let undocumented files exist.
Click to zoom · generated with the architecture-diagram Claude skill.
The repo is the product.
A learner forks the repo, opens the README, and walks the rubric for each skill. The "weak" column is where the discomfort lives — recognize yourself, mark it. The "competent" column needs evidence from your own shipped work, not from a tutorial you followed. The unchecked boxes become a learning roadmap, prioritized by which ones scared you the most.
Self-assessment is the default surface. The eval harnesses sit one level deeper for engineers who want machine-checkable proof.
Every skill follows the same anatomy.
One uniform structure across all 48 directories — so once you've read one skill, you can navigate any of them in seconds:
- README.md Self-assessment rubric (7 criteria), competent example, annotated weak example, red flags, hands-on exercise, opinion on what good vs. bad looks like.
- corpus/ Realistic fixture documents — ADRs, runbooks, incident post-mortems, specs — that the learner's solution must retrieve from and cite correctly.
- eval/queries.json Fixed query set with expected sources and facts. Same input every time.
- eval/eval.py Deterministic scorer. Evaluates against the rubric criteria — retrieval, fact coverage, citations — and returns per-criterion pass/fail.
- solution.py Learner's implementation. Reads the corpus, answers the queries, emits answers + sources. Local only — `.gitignore`d so nobody can claim a score they didn't earn.
Run the eval, get a per-criterion scorecard. Fail a criterion, fix the implementation, re-run. The loop is the whole point — it's how a Substack post about RAG becomes a skill you can actually demonstrate.
If it's not in the manifest, it doesn't exist.
MANIFEST.md is a plain-language index of every file in the repo — what it is, why it's there, what it's for. Adding content without updating the manifest is a build break. scripts/check-manifest.sh walks every tracked path, every intermediate directory, and verifies each one is named in the manifest by full path. No basename tricks, no quiet drift, no orphaned files.
Why this matters: AI-assisted codebases accumulate untracked artifacts faster than human ones. The manifest discipline is the difference between a benchmark you can audit and a directory you can't.
Ready to look under the hood?