v1.2.5: Trivially Copyable

4 January 2025. If premature optimization is the root of all evil, we demonstrate in this version that premature generalization is the root of inefficiency on the GPU.

While profiling Turbo, I identified cuda::std::tuple as a performance bottleneck. The inefficiency stemmed from the Interval class, which was originally implemented using a CartesianProduct abstraction to represent intervals as a pair of lower and upper bounds. The goal being to stay close to the mathematical definition of interval. For generality, CartesianProduct supports an arbitrary number of lattice types, storing them internally in a tuple. However, tuple is not trivially copyable. This meant that every time an interval was transferred from the CPU to GPU (and vice-versa) the constructor of each element was invoked, incurring a significant overhead.

To address this, I refactored Interval to directly store the lower and upper bounds as members, eliminating the use of CartesianProduct and tuple. As shown in the table below, this simple refactoring doubled the number of nodes explored per second, for the hybrid version with 128 blocks.

MetricsNormalized average [0,100]Δ v1.2.4#best (_/16)AverageΔ v1.2.4MedianΔ v1.2.4
Nodes per second99.98+97%1534154.16+56%17018.07+180%
Fixpoint iterations per second99.96+104%151102807.71+30%419951.32+192%
Fixpoint iterations per node99.82+4%642.51+5%25.34+11%
Propagators memory100.000%00.85MB0%0.42MB0%
Variables store memory100.000%0402.75KB0%208.20KB0%

Interestingly, the full GPU architecture (-arch gpu) benefits less from this optimization.

MetricsNormalized average [0,100]Δ v1.2.4#best (_/16)AverageΔ v1.2.4MedianΔ v1.2.4
Nodes per second99.98+8%1515135.09+5%6002.31+7%
Fixpoint iterations per second100.00+7%15523632.56+2%151634.36+11%
Fixpoint iterations per node99.43-1%1340.22-1%24.030%
Propagators memory100.000%01.03MB0%0.51MB0%
Variables store memory100.000%0402.93KB0%208.39KB0%

It indicates that non-trivial copy is particularly inefficient when transferring data between CPU and GPU.

On Helios (GH200), however, the difference between versions becomes negligible. This is likely due to the unified memory architecture of SM90, which reduces the impact of memory transfer costs.