Benchmarks
Argentor vs LangChain vs CrewAI vs PydanticAI vs Claude Agent SDK — measured across 5 independent tracks. All runs are reproducible from benchmarks/.
Latency — Framework Overhead
Argentor adds ~2 ms of framework overhead per request. All differences are statistically significant.
| Framework | Mean latency | Framework overhead | vs Argentor |
|---|---|---|---|
| Argentor | 51.7 ms | ~2 ms | — |
| PydanticAI | 62.7 ms | ~13 ms | +11 ms |
| Claude Agent SDK | 67.5 ms | ~17 ms | +16 ms |
| LangChain | 71.4 ms | ~21 ms | +20 ms |
| CrewAI | 106.6 ms | ~57 ms | +55 ms |
Security — Default Posture
Argentor is the only framework in the comparison set that ships security guardrails out of the box. 15-prompt adversarial test set, zero false positives.
| Framework | Block rate | Precision | False positives | F1 |
|---|---|---|---|---|
| Argentor | 58.3% | 1.00 | 0 | 0.74 |
| Claude Agent SDK | 0.0% | — | 0 | 0.00 |
| CrewAI | 0.0% | — | 0 | 0.00 |
| LangChain | 0.0% | — | 0 | 0.00 |
| PydanticAI | 0.0% | — | 0 | 0.00 |
Cost — Tool-Heavy Workloads
At 100 K req/day (Claude Sonnet 4 pricing). The biggest gap: on a 50-tool registry, Argentor ships 350 tokens/call vs 2,750 (LangChain) / 3,050 (CrewAI) — a 7.9–8.7× reduction via tool-discovery filtering.
| Framework | Tokens/task | $/day | $/year | vs Argentor |
|---|---|---|---|---|
| Argentor | 21,517 | $7,153 | $2.61 M | — |
| PydanticAI | 22,282 | $7,382 | $2.69 M | +$85 K/yr |
| Claude Agent SDK | 22,747 | $7,521 | $2.75 M | +$135 K/yr |
| LangChain | 23,212 | $7,661 | $2.80 M | +$185 K/yr |
| CrewAI | 26,002 | $8,498 | $3.10 M | +$491 K/yr |
Composite Score
Normalised 0–100 per dimension (100 = best observed), weighted: Security 30%, Cost 25%, Latency 20%, Long-horizon 15%, Adversarial security 10%.
| Framework | Security (basic) | Security (adv) | Cost | Latency | Long-horizon | Total |
|---|---|---|---|---|---|---|
| Argentor | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
| PydanticAI | 0.0 | 0.0 | 83.7 | 80.0 | 88.9 | 50.3 |
| Claude Agent SDK | 0.0 | 0.0 | 71.4 | 71.2 | 77.8 | 43.7 |
| LangChain | 0.0 | 0.0 | 61.2 | 64.1 | 66.7 | 38.1 |
| CrewAI | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Honest Weaknesses
- Shell injection at prompt stage: 0/12 blocked. Mitigation is at the capability layer, not the prompt pipeline. Roadmap S-01.
- Base64-encoded payloads: not decoded by the default pipeline. Roadmap S-02.
- Unicode/homoglyph normalisation: guardrails don't normalise before matching. Roadmap S-03.
- Context compaction: 30K-token trigger never fires on typical 5–12 turn sessions. Roadmap C-01.
- Quality with live LLMs: untested — all quality scores used a mock LLM.
Reproduce
cargo run -p argentor-benchmarks # Results land in benchmarks/results/*.json
Full methodology: BENCHMARK_SYNTHESIS.md