Benchmarks

Argentor vs LangChain vs CrewAI vs PydanticAI vs Claude Agent SDK — measured across 5 independent tracks. All runs are reproducible from benchmarks/.

Methodology: Same mock LLM (50 ms simulated delay), same tasks, same hardware, back-to-back runs. Latency comparisons use N=10 paired t-tests, all at p < 0.0001 with Cohen's d > 0.8. Numbers are verbatim from committed run artifacts.

Latency — Framework Overhead

Argentor adds ~2 ms of framework overhead per request. All differences are statistically significant.

Mean end-to-end latency (ms) — lower is better
FrameworkMean latencyFramework overheadvs Argentor
Argentor51.7 ms~2 ms
PydanticAI62.7 ms~13 ms+11 ms
Claude Agent SDK67.5 ms~17 ms+16 ms
LangChain71.4 ms~21 ms+20 ms
CrewAI106.6 ms~57 ms+55 ms

Security — Default Posture

Argentor is the only framework in the comparison set that ships security guardrails out of the box. 15-prompt adversarial test set, zero false positives.

Security block rate (%) — higher is better
FrameworkBlock ratePrecisionFalse positivesF1
Argentor58.3%1.0000.74
Claude Agent SDK0.0%00.00
CrewAI0.0%00.00
LangChain0.0%00.00
PydanticAI0.0%00.00

Cost — Tool-Heavy Workloads

At 100 K req/day (Claude Sonnet 4 pricing). The biggest gap: on a 50-tool registry, Argentor ships 350 tokens/call vs 2,750 (LangChain) / 3,050 (CrewAI) — a 7.9–8.7× reduction via tool-discovery filtering.

Annual cost at 100K req/day ($M/year) — lower is better
FrameworkTokens/task$/day$/yearvs Argentor
Argentor21,517$7,153$2.61 M
PydanticAI22,282$7,382$2.69 M+$85 K/yr
Claude Agent SDK22,747$7,521$2.75 M+$135 K/yr
LangChain23,212$7,661$2.80 M+$185 K/yr
CrewAI26,002$8,498$3.10 M+$491 K/yr

Composite Score

Normalised 0–100 per dimension (100 = best observed), weighted: Security 30%, Cost 25%, Latency 20%, Long-horizon 15%, Adversarial security 10%.

FrameworkSecurity (basic)Security (adv)CostLatencyLong-horizonTotal
Argentor100.0100.0100.0100.0100.0100.0
PydanticAI0.00.083.780.088.950.3
Claude Agent SDK0.00.071.471.277.843.7
LangChain0.00.061.264.166.738.1
CrewAI0.00.00.00.00.00.0

Honest Weaknesses

These are confirmed gaps, published openly. Every weakness maps to a roadmap item.
  1. Shell injection at prompt stage: 0/12 blocked. Mitigation is at the capability layer, not the prompt pipeline. Roadmap S-01.
  2. Base64-encoded payloads: not decoded by the default pipeline. Roadmap S-02.
  3. Unicode/homoglyph normalisation: guardrails don't normalise before matching. Roadmap S-03.
  4. Context compaction: 30K-token trigger never fires on typical 5–12 turn sessions. Roadmap C-01.
  5. Quality with live LLMs: untested — all quality scores used a mock LLM.

Reproduce

cargo run -p argentor-benchmarks
# Results land in benchmarks/results/*.json

Full methodology: BENCHMARK_SYNTHESIS.md