How the benchmark is run
A continuous, SRBench-style evaluation comparing Ufinq against PySR, gplearn, and Operon across symbolic-regression datasets — with the rules fixed in advance and the failures published, not buried.
Principles
- One declared configuration per algorithm, applied uniformly across every dataset and seed — no per-dataset tuning.
- One fit per trial. Each algorithm fits each dataset once per seed; quality is read from held-out predictions, never from re-fits.
- Beyond R². Each trial also reports symbolic recovery, complexity, stability, extrapolation, and calibration — not accuracy alone.
- Negative results are first-class. Timeouts, non-finite predictions, and all-failed datasets are taxonomised and shown, never silently dropped.
- Calibration separates a real regression from an environment change — a fixed reference set runs every sweep and drift is checked against a per-segment baseline.
The numbers
Every sweep is a point-in-time snapshot with a stable, citable URL. To compare environments over time, see calibration & drift; to cite a result, see how to cite.
The full methodology — metric definitions, the hyperparameter policy, the negative-results policy, and the governance for versioning it — is maintained as the canonical reference alongside the benchmark harness.
