v1.2.8: Cost of Abstraction

10 April 2025. Version 1.2.8 introduces a new solving algorithm called barebones, which can be activated via the option -arch barebones. This architecture strips away some abstractions to reduce kernel code complexity and ensure compatibility with H100 GPUs.

While less generic than the existing algorithms, barebones is significantly more efficient—delivering up to +63% nodes per second and, for the first time, proving optimality on three instances.

Minimalist by Design

The key design decision is to remove dependencies on SearchTree and BAB abstract domains. The entire search process is implemented in a single file, barebones_dive_and_solve.hpp (about 1000 LOC), compared to 700 LOC for gpu_dive_and_solve.hpp and 800 LOC for hybrid_dive_and_solve.hpp.

Despite losing generality, we retain full functionality. In theory, we could add any constraint to the search tree during solving, but this feature was not used. The only dynamically added constraint is the objective bound—now directly handled with interval domains, reducing overhead.

Similarly, instead of supporting arbitrary branching like x < y ∨ x = y ∨ x > y, we stick to binary strategies such as x < 10 ∨ x ≥ 10, which can be efficiently implemented with two intervals: [xl, 9] and [10, xu].

Parallel Branching & Restoration

With simplified branching and fewer abstractions, search becomes easier to parallelize.

Branching strategies (variable + value selection) now leverage parallel iteration over the variable store, e.g., to find the smallest domain. Restoring the variables' domain upon backtracking is also done in parallel.

This significantly reduces search time, which is now negligible compared to propagation. This was not the case in the GPU and hybrid versions, as illustrated in previous experiments.

Turbo Barebones v1.2.8 Time Profile

As shown above, on the "roster-sickness_small-4" instance, 191 blocks remain idle at timeout, indicating load imbalance. Improving work decomposition across threads is a near-term priority.

Automatic Block Configuration

Previously, the number of CUDA blocks was set manually using -or, e.g. -or 128. Now, you can enable automatic configuration via -or 0. It uses CUDA’s cudaOccupancyMaxActiveBlocksPerMultiprocessor to maximize the number of blocks per SM. We then attempt to fit the variable and propagator stores into shared memory with the selected configuration. The goal is to favor higher occupancy, even if it means using global memory in some cases.

Benchmark Results (A5000)

Using -or 128 -arch barebones -fp wac1, the performance gains are substantial:

MetricsNormalized average [0,100]Δ v1.2.7#best (_/16)AverageΔ v1.2.7MedianΔ v1.2.7
Nodes per second99.87+49%15211302.32+351%34413.66+22%
Deductions per node95.80-3%9815118.160%185548.13-16%
Fixpoint iterations per second99.97+47%151541334.10+248%437091.95+34%
Fixpoint iterations per node96.45-2%914.52-3%9.29-9%
Propagators memory99.990%90.64MB0%0.25MB0%
Variables store memory100.000%0288.40KB0%89.06KB0%

With automatic block configuration (-or 0), we gain an additional speed-up:

MetricsNormalized average [0,100]Δ v1.2.7#best (_/16)AverageΔ v1.2.7MedianΔ v1.2.7
Nodes per second99.17+63%15262153.55+459%43555.10+54%
Deductions per node96.52-1%6823142.05+1%191625.76-13%
Fixpoint iterations per second99.62+65%151925882.39+334%502630.01+54%
Fixpoint iterations per node97.11-1%614.950%9.31-9%
Propagators memory99.990%90.64MB0%0.25MB0%
Variables store memory100.000%0288.40KB0%89.06KB0%

Results Summary

Metricsbareboneshybrid
#Problems at optimality30
#Problems satisfiable1114
#Problems unknown22
#Problem with store in shared memory16
#Problem with prop in shared memory00
#Problems with IDLE SMs at timeout12

This is the first time the solver could prove optimality on three problems within a 5-minute timeout. It validates the trade-off: by removing abstraction, we unlock performance and simplify the codebase—without losing what matters.