Benchmark / Proof Center

This is not vendor theater. Firefish ships with reproducible security tests.

Review recall, precision, false positives, source coverage, profile latency, judge routing, and anomaly-layer usage from the latest local benchmark run.

No local benchmark artifact

Run the reproducible benchmark to populate this page.

Firefish will read the latest local metrics from data/benchmark_results/metrics.json when present.

python benchmarks/run_benchmark.py
Methodology

What the benchmark proves and what it does not.

What is measured

Detection recall, precision, false positive rate, latency percentiles, judge call rate, anomaly routing, source-type coverage, and attack-type coverage on the bundled corpus.

What is not measured

It does not claim universal coverage against every future attack, every private corpus, every hosted model, or every production integration shape.

Why recall alone is not enough

A detector can catch attacks and still be hard to operate if it floods teams with false positives or hides costly routing behavior.

Why false positives matter

Security education, defensive reports, and normal developer workflows need to keep moving unless they become active instructions or exfiltration paths.

Why latency matters

A gateway sits in the request path. Profile comparison makes the tradeoff between security layers and p95 response time visible.

Why source type matters

Direct user prompts and untrusted retrieved content have different risk. Firefish reports coverage by USER_PROMPT, RAG, web, PDF, email, tools, and memory.

Run locally

Reproduce the proof on your machine.

Generated benchmark artifacts stay local and are ignored by Git. Public pages show only metrics, IDs, hashes, and safe summaries.

Command

python benchmarks/run_benchmark.py

python scripts/clean_repo.py --check