Note  ·  2026-04-22  ·  RESEARCH

Approximate evaluation, as a research problem

Full-fidelity evaluation of evolved candidates is the dominant compute cost in symbolic regression at scale. CRESCENT treats evaluation itself as a search problem: which candidates deserve a full score, and which can be safely scored cheaply?

At cluster scale, the time spent inside the evaluation function dwarfs the time spent inside selection, mutation, and even the refinement pipeline. This is true for any non-trivial regression target: scoring a candidate means running it on the training data, computing residuals, computing a loss. The bigger the dataset, the more dominant evaluation becomes.

Most of those evaluations are wasted. The vast majority of candidates the search produces are bad — the search is supposed to produce many bad candidates so that the rare good ones can stand out. Spending full-fidelity evaluation budget on candidates that will be discarded after one generation is a structural inefficiency, not a tuning issue.

The CRESCENT idea

CRESCENT (a Ufinq research thread) treats evaluation as a two-stage decision. A cheap, binned, low-fidelity evaluation runs first; the result decides whether the candidate is promising enough to deserve a full-fidelity score. Candidates that fail the cheap test are scored on the cheap test alone; candidates that pass it pay the full cost.

Two design questions fall out:

  • What does the cheap test look like? Binning the data, sampling a subset, running a coarsened version of the loss — each has different failure modes. The cheap test must correlate strongly with the full score on the candidates that matter, while running fast enough to pay for itself.
  • What is the gating threshold? Static thresholds underweight the fact that the population’s quality distribution shifts over the run. Dynamic thresholds tied to current population statistics handle this, but introduce a feedback loop with the cheap test’s noise floor.

What we want from the experiments

The first experimental question is whether CRESCENT produces a measurable end-to-end speedup on SRBench while preserving solution quality. The second is whether the savings scale faster than linearly with dataset size, as the underlying argument suggests they should. The third is whether contamination of the cheap test by problem structure causes any regressions on a specific subset of benchmarks.

Approximate evaluation has a long history in machine learning under different names: stochastic evaluation, racing algorithms, successive halving. CRESCENT is what those ideas look like when the search produces evolved programs rather than parameter settings. The shape of the problem changes, because the cost of one full-fidelity evaluation is no longer constant across candidates: a tall, expensive expression takes much longer to evaluate than a small one. The cheap test has to adapt to that, too.