Benchmarks

Argentor vs LangChain vs CrewAI vs PydanticAI vs Claude Agent SDK — measured across 5 independent tracks. All runs are reproducible from benchmarks/.

Methodology: Same mock LLM (50 ms simulated delay), same tasks, same hardware, back-to-back runs. Latency comparisons use N=10 paired t-tests, all at p < 0.0001 with Cohen's d > 0.8. Numbers are verbatim from committed run artifacts.

Latency — Framework Overhead

Argentor adds ~2 ms of framework overhead per request. All differences are statistically significant.

Mean end-to-end latency (ms) — lower is better

Framework	Mean latency	Framework overhead	vs Argentor
Argentor	51.7 ms	~2 ms	—
PydanticAI	62.7 ms	~13 ms	+11 ms
Claude Agent SDK	67.5 ms	~17 ms	+16 ms
LangChain	71.4 ms	~21 ms	+20 ms
CrewAI	106.6 ms	~57 ms	+55 ms

Security — Default Posture

Argentor is the only framework in the comparison set that ships security guardrails out of the box. 15-prompt adversarial test set, zero false positives.

Security block rate (%) — higher is better

Framework	Block rate	Precision	F1
Argentor	58.3%	1.00	0.74
Claude Agent SDK	0.0%	—	0.00
CrewAI	0.0%	—	0.00
LangChain	0.0%	—	0.00
PydanticAI	0.0%	—	0.00

Cost — Tool-Heavy Workloads

At 100 K req/day (Claude Sonnet 4 pricing). The biggest gap: on a 50-tool registry, Argentor ships 350 tokens/call vs 2,750 (LangChain) / 3,050 (CrewAI) — a 7.9–8.7× reduction via tool-discovery filtering.

Annual cost at 100K req/day ($M/year) — lower is better

Framework	Tokens/task	$/day	$/year	vs Argentor
Argentor	21,517	$7,153	$2.61 M	—
PydanticAI	22,282	$7,382	$2.69 M	+$85 K/yr
Claude Agent SDK	22,747	$7,521	$2.75 M	+$135 K/yr
LangChain	23,212	$7,661	$2.80 M	+$185 K/yr
CrewAI	26,002	$8,498	$3.10 M	+$491 K/yr

Composite Score

Normalised 0–100 per dimension (100 = best observed), weighted: Security 30%, Cost 25%, Latency 20%, Long-horizon 15%, Adversarial security 10%.

Framework	Security (basic)	Security (adv)	Cost	Latency	Long-horizon	Total
Argentor	100.0	100.0	100.0	100.0	100.0	100.0
PydanticAI	0.0	0.0	83.7	80.0	88.9	50.3
Claude Agent SDK	0.0	0.0	71.4	71.2	77.8	43.7
LangChain	0.0	0.0	61.2	64.1	66.7	38.1
CrewAI	0.0	0.0	0.0	0.0	0.0	0.0

Honest Weaknesses

These are confirmed gaps, published openly. Every weakness maps to a roadmap item.

Shell injection at prompt stage: 0/12 blocked. Mitigation is at the capability layer, not the prompt pipeline. Roadmap S-01.
Base64-encoded payloads: not decoded by the default pipeline. Roadmap S-02.
Unicode/homoglyph normalisation: guardrails don't normalise before matching. Roadmap S-03.
Context compaction: 30K-token trigger never fires on typical 5–12 turn sessions. Roadmap C-01.
Quality with live LLMs: untested — all quality scores used a mock LLM.

Reproduce

cargo run -p argentor-benchmarks
# Results land in benchmarks/results/*.json

Full methodology: BENCHMARK_SYNTHESIS.md