Benchmarks
bomdrift uses criterion for benchmarking the four hot paths: parse, diff, typosquat, and render. Benchmarks are not run in CI (the variance on shared GitHub runners is ±20%, which buries any real signal); they’re a tool for validating perf-relevant changes on hardware you control.
Running
# Run all benchmarks
cargo bench
# Run just one harness
cargo bench --bench parse
cargo bench --bench diff
cargo bench --bench typosquat
cargo bench --bench render
# Filter inside a harness
cargo bench --bench typosquat -- npm_batch
Criterion writes an HTML report to target/criterion/report/index.html
on each run with throughput plots, distribution histograms, and
diff-against-previous-run charts.
What each harness measures
parse — SBOM parser layer
For each of the three fixture formats (CycloneDX, SPDX, Syft):
json_value: cost ofserde_json::from_strto aValueonly. Captures the JSON-deserialization floor independent of bomdrift’s parser.full_pipeline: cost offrom_str+parse_with_formatto a normalizedmodel::Sbom. The delta vsjson_valueis bomdrift’s parsing overhead.
A regression in the delta is the signal worth investigating — a
regression in json_value is a serde_json change.
diff — diff core
axios_fixture_pair: realistic small-PR shape (~3 components per side). The lower bound for any diff invocation.synth_monorepo_200: 200 components per side, half of them version-changed. The realistic monorepo upper bound for a single PR.synth_self_diff_200: same input on both sides. Worst case for the BTreeMap-intersection path with no resulting work to do.
A regression on synth_monorepo_200 likely indicates a hot-loop
change in diff_one_key; a regression on synth_self_diff_200 likely
indicates a ComponentKey::Ord change.
typosquat — Jaro-Winkler scoring
one_npm_typosquat_axios: a single candidate (plain-crypto-js) scored against the embedded npm top-1k list. The typosquat enricher’s per-candidate cost.npm_batch10/50/100: a batch of N candidates, exercising the per-candidate cost amortized.mixed_three_ecosystems: one candidate per ecosystem (npm + PyPI- Cargo), exercising the per-ecosystem dispatch and embedded-list load cost (after the OnceLock has been hit).
The first invocation of each ecosystem in a process pays the legit-list
parsing + canonicalization cost (~1ms for the npm 1k list); subsequent
invocations are hot. Criterion’s iter() measures the cached path.
render — output renderers
For each of markdown / JSON / SARIF / terminal, with a synthetic ChangeSet shaped like a moderate PR (50 added / 20 removed / 30 version-changed / 5 license-changed + 10 typosquats + 15 CVEs).
A regression on one renderer specifically is usually a string-formatting change in that file. A regression across all renderers is usually a ChangeSet shape change that propagated.
Suggested workflow for perf-relevant PRs
- On a clean main, run
cargo benchand let criterion record the baseline. - Switch to your branch, make the change, and run
cargo benchagain. - Criterion’s HTML report shows a “Change vs previous” column with confidence intervals. ±5% is noise on most hardware; ±10%+ is worth looking at; statistical significance markers (criterion’s “Performance has improved” / “Performance has regressed” lines) are the first-class signal.
- If the change is intentional (e.g. a feature that adds a new pass), note the new baseline in the PR description so reviewers know to compare against the post-change number, not the pre-change one.
Why no CI integration?
- Shared GitHub runners have ±20% variance run-over-run on these benchmarks. Real regressions are smaller than the noise floor.
- Self-hosted runners with pinned hardware would solve that, but the project doesn’t have that infrastructure (and the operational cost isn’t worth it at the project’s scale).
- For now, run benchmarks locally on a quiet machine; a future contributor can wire up a self-hosted bench runner if the project grows enough to justify it.