How the benchmark is run

A continuous, SRBench-style evaluation comparing Ufinq against PySR, gplearn, and Operon across symbolic-regression datasets — with the rules fixed in advance and the failures published, not buried.

Principles

One declared configuration per algorithm, applied uniformly across every dataset and seed — no per-dataset tuning.
One run per trial. Each algorithm trains on each dataset once per seed; quality is read from held-out predictions, never from re-training.
Beyond R². Each trial also reports symbolic recovery, complexity, stability, extrapolation, and calibration — not accuracy alone.
Negative results are first-class. Timeouts, non-finite predictions, and all-failed datasets are taxonomised and shown, never silently dropped.
Calibration separates a real regression from an environment change — a fixed reference set runs every sweep and drift is checked against a per-segment baseline.

The numbers

Every sweep is a point-in-time snapshot with a stable, citable URL. To compare environments over time, see calibration & drift; to cite a result, see how to cite.

The full methodology — metric definitions, the hyperparameter policy, the negative-results policy, and the governance for versioning it — is maintained as the canonical reference alongside the benchmark harness.