This is not vendor theater. Firefish ships with reproducible security tests.
Review recall, precision, false positives, source coverage, profile latency, judge routing, and anomaly-layer usage from the latest local benchmark run.
Run the reproducible benchmark to populate this page.
Firefish will read the latest local metrics from data/benchmark_results/metrics.json when present.
What the benchmark proves and what it does not.
What is measured
Detection recall, precision, false positive rate, latency percentiles, judge call rate, anomaly routing, source-type coverage, and attack-type coverage on the bundled corpus.
What is not measured
It does not claim universal coverage against every future attack, every private corpus, every hosted model, or every production integration shape.
Why recall alone is not enough
A detector can catch attacks and still be hard to operate if it floods teams with false positives or hides costly routing behavior.
Why false positives matter
Security education, defensive reports, and normal developer workflows need to keep moving unless they become active instructions or exfiltration paths.
Why latency matters
A gateway sits in the request path. Profile comparison makes the tradeoff between security layers and p95 response time visible.
Why source type matters
Direct user prompts and untrusted retrieved content have different risk. Firefish reports coverage by USER_PROMPT, RAG, web, PDF, email, tools, and memory.
Reproduce the proof on your machine.
Generated benchmark artifacts stay local and are ignored by Git. Public pages show only metrics, IDs, hashes, and safe summaries.
Command
python benchmarks/run_benchmark.py
python scripts/clean_repo.py --check