Understanding GPU Performance: The Roofline Model
Having implemented the naive matrix multiplication, you might be wondering: How well is our kernel actually performing? Is it limited by the GPU’s computational power, or is something else holding it back?
The roofline model is your compass for GPU optimization—it reveals which hardware bottleneck limits your kernel’s performance and guides you toward the most impactful optimizations. Rather than guessing at improvements, the roofline model shows you exactly where to focus your efforts.
1. Two ceilings for every GPU kernel
Every GPU kernel operates under two fundamental constraints:
- Compute ceiling – how quickly the cores can execute floating-point operations (peak FLOPs/s)
- Memory ceiling – how quickly the memory system can feed those cores with data (peak bytes/s)
Understanding which ceiling constrains your kernel is crucial for optimization strategy. The roofline model visualizes this relationship by plotting two key metrics:
X-axis: Arithmetic Intensity – How much computation you extract per byte of data
\[\Large I = \frac{\text{Total FLOPs}}{\text{Total Bytes from Memory}} \quad [\text{FLOP/B}]\]
Y-axis: Sustained Performance – How fast your kernel actually runs
\[\Large P_{\text{sustained}} = \frac{\text{Total FLOPs}}{\text{Elapsed Time}} \quad [\text{GFLOP/s}]\]
Two “roofs” bound all achievable performance:
Roof | Equation | Meaning |
---|---|---|
Memory roof | \(P = B_{\text{peak}} \cdot I\) | Sloped line; performance limited by memory bandwidth |
Compute roof | \(P = P_{\text{peak}}\) | Horizontal line; performance limited by compute throughput |
The critical intensity
\[\Large I^* = \frac{P_{\text{peak}}}{B_{\text{peak}}}\]
marks where a kernel transitions from memory-bound (\(I < I^* \)) to compute-bound (\(I > I^* \)).
2. Hardware example: NVIDIA A100 specifications
Let’s ground this theory in concrete numbers using the NVIDIA A100:
Peak FP32 throughput \[\Large P_{\text{peak}} = 19.5 \text{ TFLOP/s} = 19{,}500 \text{ GFLOP/s}\]
Peak HBM2 bandwidth \[\Large B_{\text{peak}} = 1{,}555 \text{ GB/s}\]
Critical intensity \[\Large I^* = \frac{19{,}500}{1{,}555} \approx 12.5 \text{ FLOP/B}\]
Source: NVIDIA A100 Tensor Core GPU Architecture
This means kernels with arithmetic intensity below 12.5 FLOP/B are memory-bound, while those above are compute-bound.
3. Visualizing our matrix multiplication implementations
The animation below shows how our puzzle implementations map onto the A100’s roofline model:
The visualization demonstrates the optimization journey we’ll take in this puzzle:
- Hardware constraints – The red memory roof and blue compute roof define performance limits
- Our starting point – The naive implementation (left purple dot) sitting firmly on the memory roof
- Optimization target – The shared memory version (right purple dot) with improved arithmetic intensity
- Ultimate goal – The golden arrow pointing toward the critical intensity where kernels become compute-bound
4. Analyzing our naive implementation
Let’s examine why our naive kernel from the previous section performs as it does. For our \(2 \times 2\) matrix multiplication:
Computation per output element: \(\text{SIZE} + (\text{SIZE}-1) = 3 \text{ FLOPs }\)
Each element requires \(\text{SIZE}\) multiplications and \(\text{SIZE} - 1\) additions: \[C_{00} = A_{00} \cdot B_{00} + A_{01} \cdot B_{10}\] For \(\text{SIZE} = 2\) it is 2 multiplications + 1 addition = 3 FLOPs
Memory accesses per output element:
- Row from matrix A: \(2 \times 4 = 8\) bytes (FP32)
- Column from matrix B: \(2 \times 4 = 8\) bytes (FP32)
- Total: \(16\) bytes per output element
Arithmetic intensity: \[\Large I_{\text{naive}} = \frac{3 \text{ FLOPs}}{16 \text{ bytes}} = 0.1875 \text{ FLOP/B}\]
Since \(I_{\text{naive}} = 0.1875 \ll I^* = 12.5\), our naive kernel is severely memory-bound.
Expected performance: \[\Large P \approx B_{\text{peak}} \times I_{\text{naive}} = 1{,}555 \times 0.1875 \approx 292 \text{ GFLOP/s}\]
This represents only \(\frac{292}{19{,}500} \approx 1.5%\) of the GPU’s computational potential! The visualization clearly shows this as the leftmost purple dot sitting squarely on the memory roof—we’re nowhere near the compute ceiling.
5. The path forward: shared memory optimization
The roofline model reveals our optimization strategy: increase arithmetic intensity by reducing redundant memory accesses. This is exactly what the shared memory approach accomplishes:
Shared memory benefits:
- Cooperative loading: Threads work together to load matrix blocks into fast shared memory
- Data reuse: Each loaded element serves multiple computations
- Reduced global memory traffic: Fewer accesses to slow global memory
Expected arithmetic intensity improvement: \[\Large I_{\text{shared}} = \frac{12 \text{ FLOPs}}{32 \text{ bytes}} = 0.375 \text{ FLOP/B}\]
While still memory-bound for our small \(2 \times 2\) case, this 2Ă— improvement in arithmetic intensity scales dramatically for larger matrices where shared memory tiles can be reused many more times.
6. Optimization strategies revealed by the roofline
The roofline model not only diagnoses current performance but also illuminates optimization paths. Here are the key techniques we’ll explore in later puzzles:
Technique | Roofline effect | Implementation approach |
---|---|---|
Shared memory tiling | ↑ Arithmetic intensity through data reuse | Cooperative loading, block-wise computation |
Register blocking | Reduce memory traffic with register accumulation | Loop unrolling with register variables |
Kernel fusion | More FLOPs per byte by combining operations | Single kernel handling multiple computation stages |
Memory coalescing | Maximize effective bandwidth utilization | Structured access patterns, proper thread organization |
Mixed precision | Smaller data types reduce memory pressure | FP16/BF16 input with FP32 accumulation |
Each technique moves kernels along the roofline—either up the memory roof (better bandwidth utilization) or rightward toward the compute roof (higher arithmetic intensity).
7. Beyond simple rooflines
Multi-level memory: Advanced rooflines include separate ceilings for L2 cache, shared memory, and register bandwidth to identify which memory hierarchy level constrains performance.
Communication rooflines: For multi-GPU applications, replace memory bandwidth with interconnect bandwidth (NVLink, InfiniBand) to analyze scaling efficiency.
Specialized units: Modern GPUs include tensor cores with their own performance characteristics, requiring specialized roofline analysis.
8. Using the roofline in practice
- Profile your kernel: Use tools like Nsight Compute to measure actual FLOPs and memory traffic
- Plot the data point: Calculate arithmetic intensity and sustained performance
- Identify the bottleneck: Memory-bound kernels sit on the memory roof; compute-bound kernels approach the compute roof
- Choose optimizations: Focus on bandwidth improvements for memory-bound kernels, algorithmic changes for compute-bound ones
- Measure and iterate: Verify that optimizations move kernels in the expected direction
Connection to our shared memory puzzle
In the next section, we’ll implement the shared memory optimization that begins moving our kernel up the roofline. As the visualization shows, this takes us from the left purple dot (naive) to the right purple dot (shared memory)—a clear performance improvement through better data reuse.
While our \(2 \times 2\) example won’t reach the compute roof, you’ll see how the same principles scale to larger matrices where shared memory becomes crucial for performance. The roofline model provides the theoretical foundation for understanding why shared memory helps and how much improvement to expect.
Understanding the roofline model transforms GPU optimization from guesswork into systematic engineering. Every optimization technique in this book can be understood through its effect on this simple but powerful performance model.