Metrics
- Recall measures how many attack samples Firefish catches.
- Precision measures how often flagged samples are actually attacks.
- False positive rate measures benign samples that were flagged or blocked.
- p95 latency shows tail latency for a benchmark profile.
- Judge call rate shows how often expensive judge routing ran.
Source-aware benchmarking
Firefish reports source-type breakdowns because direct user prompts, RAG chunks, webpages, PDFs, email, tool output, and agent memory have different trust boundaries.
Run locally
Generated benchmark artifacts remain local and ignored by Git.
python benchmarks/run_benchmark.py