SkillsBench

Benchmark Results

Does structured knowledge delivered via MCP tools help agents complete tasks better than raw markdown files? We ran a controlled experiment to find out.

What is SkillsBench?

SkillsBench is an open-source benchmark that measures how well AI coding agents perform on real-world tasks when given skill documentation. Each task requires the agent to generate a working Python script — no human judgment involved.

SkillsBench is part of the BenchFlow evaluation suite. It uses deterministic Docker evaluation: the agent's output is executed inside a container and scored by pytest. Either the tests pass or they don't — no ambiguity.

The Question

AI coding agents like Claude Code rely on skill documentation to solve specialized tasks — generating DOCX files, processing PDFs, analyzing financial data. Today, these skills are delivered as plain markdown files (SKILL.md). The agent reads the raw text and must extract instructions, heuristics, and anti-patterns on its own.

OntoSkills takes a different approach: skill knowledge is compiled into structured OWL 2 ontologies and delivered via a single MCP tool (ontoskill). The agent queries for skill knowledge — receiving typed knowledge nodes with severity ratings, anti-patterns with rationale, and curated code examples.

Which approach produces better results?

Methodology

Evaluation

We use SkillsBench (part of BenchFlow), which measures an agent's ability to generate working code for real-world tasks.

  1. Agent receives task description + skill docs
  2. Generates a Python solution script
  3. Script runs inside the task's Docker container (via BenchFlow Trial + Harbor)
  4. Harbor Verifier runs pytest — deterministic, no human judgment

Setup

AgentClaude Code CLI (--print --bare)
Modelglm-5.1 (via API proxy)
InfrastructureBenchFlow Trial + Harbor Verifier
ScoringDocker + pytest CTRF
Retries5 per task (BenchFlow RetryConfig)
Traditional

SKILL.md files in .claude/skills/. Agent uses Claude Code's native file reading — same as production.

vs
OntoSkills MCP

Skills compiled to OWL 2, served via OntoMCP. The agent discovers and loads skill knowledge through a single ontoskill tool call — receiving structured, prioritized context.

Same agent, same model, same BenchFlow container management, same Harbor Verifier. The only difference is how skill knowledge is delivered.

Results

Why Structured Knowledge Wins

Traditional SKILL.md files mix instructions, examples, caveats, and anti-patterns in unstructured text. The agent must parse everything at once with no indication of what's critical.

OntoSkills delivers knowledge as typed nodes with severity ratings:

  • CRITICAL rules highlighted first
  • Anti-patterns with explicit rationale explaining why
  • Curated, prioritized view instead of a wall of text
  • 88% token reduction through compact structured format

The result: structured knowledge helps agents solve more tasks correctly with fewer tokens.

Limitations

  • Sample size: Results from a pool of 70+ eligible tasks.
  • Single model: All results use glm-5.1 via API proxy. Other models may differ.
  • Single benchmark: SkillsBench tests code generation. Other benchmarks planned.

All benchmark code is open source. Run it yourself:

python benchmark/run.py --benchmark skillsbench --mode all5 --max-tasks 25 --model glm-5.1 --attempts 5 --workers 2