`warp.shuffle_xor()` Butterfly Communication

For warp-level butterfly communication we can use shuffle_xor() to create sophisticated tree-based communication patterns within a warp. This powerful primitive enables efficient parallel reductions, sorting networks, and advanced coordination algorithms without shared memory or explicit synchronization.

Key insight: The shuffle_xor() operation leverages SIMT execution to create XOR-based communication trees, enabling efficient butterfly networks and parallel algorithms that scale with \(O(\log n)\) complexity relative to warp size.

What are butterfly networks? Butterfly networks are communication topologies where threads exchange data based on XOR patterns of their indices. The name comes from the visual pattern when drawn - connections that look like butterfly wings. These networks are fundamental to parallel algorithms like FFT, bitonic sort, and parallel reductions because they enable \(O(\log n)\) communication complexity.

Key concepts

In this puzzle, you’ll master:

XOR-based communication patterns with shuffle_xor()
Butterfly network topologies for parallel algorithms
Tree-based parallel reductions with \(O(\log n)\) complexity
Conditional butterfly operations for advanced coordination
Hardware-optimized parallel primitives replacing complex shared memory

The shuffle_xor operation enables each lane to exchange data with lanes based on XOR patterns: \[\Large \text{shuffle_xor}(\text{value}, \text{mask}) = \text{value_from_lane}(\text{lane_id} \oplus \text{mask})\]

This transforms complex parallel algorithms into elegant butterfly communication patterns, enabling efficient tree reductions and sorting networks without explicit coordination.

1. Basic butterfly pair swap

Configuration

Vector size: SIZE = WARP_SIZE (32 or 64 depending on GPU)
Grid configuration: (1, 1) blocks per grid
Block configuration: (WARP_SIZE, 1) threads per block
Data type: DType.float32
Layout: Layout.row_major(SIZE) (1D row-major)

The `shuffle_xor` concept

Traditional pair swapping requires complex indexing and coordination:

# Traditional approach - complex and requires synchronization
shared_memory[lane] = input[global_i]
barrier()
if lane % 2 == 0:
    partner = lane + 1
else:
    partner = lane - 1
if partner < WARP_SIZE:
    swapped_val = shared_memory[partner]

Problems with traditional approach:

Memory overhead: Requires shared memory allocation
Synchronization: Needs explicit barriers
Complex logic: Manual partner calculation and bounds checking
Poor scaling: Doesn’t leverage hardware communication

With shuffle_xor(), pair swapping becomes elegant:

# Butterfly XOR approach - simple and hardware-optimized
current_val = input[global_i]
swapped_val = shuffle_xor(current_val, 1)  # XOR with 1 creates pairs
output[global_i] = swapped_val

Benefits of shuffle_xor:

Zero memory overhead: Direct register-to-register communication
No synchronization: SIMT execution guarantees correctness
Hardware optimized: Single instruction for all lanes
Butterfly foundation: Building block for complex parallel algorithms

Code to complete

Implement pair swapping using shuffle_xor() to exchange values between adjacent pairs.

Mathematical operation: Create adjacent pairs that exchange values using XOR pattern: \[\Large \text{output}[i] = \text{input}[i \oplus 1]\]

This transforms input data [0, 1, 2, 3, 4, 5, 6, 7, ...] into pairs [1, 0, 3, 2, 5, 4, 7, 6, ...], where each pair (i, i+1) swaps values through XOR communication.

alias SIZE = WARP_SIZE
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (WARP_SIZE, 1)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE)


fn butterfly_pair_swap[
    layout: Layout, size: Int
](
    output: LayoutTensor[mut=False, dtype, layout],
    input: LayoutTensor[mut=False, dtype, layout],
):
    """
    Basic butterfly pair swap: Exchange values between adjacent pairs using XOR pattern.
    Each thread exchanges its value with its XOR-1 neighbor, creating pairs: (0,1), (2,3), (4,5), etc.
    Uses shuffle_xor(val, 1) to swap values within each pair.
    This is the foundation of butterfly network communication patterns.
    """
    global_i = block_dim.x * block_idx.x + thread_idx.x

    # FILL ME IN (4 lines)

View full file: problems/p24/p24.mojo

Tips

1. Understanding shuffle_xor

The shuffle_xor(value, mask) operation allows each lane to exchange data with a lane whose ID differs by the XOR mask. Think about what happens when you XOR a lane ID with different mask values.

Key question to explore:

What partner does lane 0 get when you XOR with mask 1?
What partner does lane 1 get when you XOR with mask 1?
Do you see a pattern forming?

Hint: Try working out the XOR operation manually for the first few lane IDs to understand the pairing pattern.

2. XOR pair pattern

Think about the binary representation of lane IDs and what happens when you flip the least significant bit.

Questions to consider:

What happens to even-numbered lanes when you XOR with 1?
What happens to odd-numbered lanes when you XOR with 1?
Why does this create perfect pairs?

3. No boundary checking needed

Unlike shuffle_down(), shuffle_xor() operations stay within warp boundaries. Consider why XOR with small masks never creates out-of-bounds lane IDs.

Think about: What’s the maximum lane ID you can get when XORing any valid lane ID with 1?

Test the butterfly pair swap:

uv run poe p24 --pair-swap

pixi run p24 --pair-swap

Expected output when solved:

WARP_SIZE:  32
SIZE:  32
output: [1.0, 0.0, 3.0, 2.0, 5.0, 4.0, 7.0, 6.0, 9.0, 8.0, 11.0, 10.0, 13.0, 12.0, 15.0, 14.0, 17.0, 16.0, 19.0, 18.0, 21.0, 20.0, 23.0, 22.0, 25.0, 24.0, 27.0, 26.0, 29.0, 28.0, 31.0, 30.0]
expected: [1.0, 0.0, 3.0, 2.0, 5.0, 4.0, 7.0, 6.0, 9.0, 8.0, 11.0, 10.0, 13.0, 12.0, 15.0, 14.0, 17.0, 16.0, 19.0, 18.0, 21.0, 20.0, 23.0, 22.0, 25.0, 24.0, 27.0, 26.0, 29.0, 28.0, 31.0, 30.0]
✅ Butterfly pair swap test passed!

Solution

fn butterfly_pair_swap[
    layout: Layout, size: Int
](
    output: LayoutTensor[mut=False, dtype, layout],
    input: LayoutTensor[mut=False, dtype, layout],
):
    """
    Basic butterfly pair swap: Exchange values between adjacent pairs using XOR pattern.
    Each thread exchanges its value with its XOR-1 neighbor, creating pairs: (0,1), (2,3), (4,5), etc.
    Uses shuffle_xor(val, 1) to swap values within each pair.
    This is the foundation of butterfly network communication patterns.
    """
    global_i = block_dim.x * block_idx.x + thread_idx.x

    if global_i < size:
        current_val = input[global_i]

        # Exchange with XOR-1 neighbor using butterfly pattern
        # Lane 0 exchanges with lane 1, lane 2 with lane 3, etc.
        swapped_val = shuffle_xor(current_val, 1)

        # For demonstration, we'll store the swapped value
        # In real applications, this might be used for sorting, reduction, etc.
        output[global_i] = swapped_val

This solution demonstrates how shuffle_xor() creates perfect pair exchanges through XOR communication patterns.

Algorithm breakdown:

if global_i < size:
    current_val = input[global_i]              # Each lane reads its element
    swapped_val = shuffle_xor(current_val, 1)  # XOR creates pair exchange

    # For demonstration, store the swapped value
    output[global_i] = swapped_val

SIMT execution deep dive:

Cycle 1: All lanes load their values simultaneously
  Lane 0: current_val = input[0] = 0
  Lane 1: current_val = input[1] = 1
  Lane 2: current_val = input[2] = 2
  Lane 3: current_val = input[3] = 3
  ...
  Lane 31: current_val = input[31] = 31

Cycle 2: shuffle_xor(current_val, 1) executes on all lanes
  Lane 0: receives from Lane 1 (0⊕1=1) → swapped_val = 1
  Lane 1: receives from Lane 0 (1⊕1=0) → swapped_val = 0
  Lane 2: receives from Lane 3 (2⊕1=3) → swapped_val = 3
  Lane 3: receives from Lane 2 (3⊕1=2) → swapped_val = 2
  ...
  Lane 30: receives from Lane 31 (30⊕1=31) → swapped_val = 31
  Lane 31: receives from Lane 30 (31⊕1=30) → swapped_val = 30

Cycle 3: Store results
  Lane 0: output[0] = 1
  Lane 1: output[1] = 0
  Lane 2: output[2] = 3
  Lane 3: output[3] = 2
  ...

Mathematical insight: This implements perfect pair exchange using XOR properties: \[\Large \text{XOR}(i, 1) = \begin{cases} i + 1 & \text{if } i \bmod 2 = 0 \\ i - 1 & \text{if } i \bmod 2 = 1 \end{cases}\]

Why shuffle_xor is superior:

Perfect symmetry: Every lane participates in exactly one pair
No coordination: All pairs exchange simultaneously
Hardware optimized: Single instruction for entire warp
Butterfly foundation: Building block for complex parallel algorithms

Performance characteristics:

Latency: 1 cycle (hardware register exchange)
Bandwidth: 0 bytes (no memory traffic)
Parallelism: All WARP_SIZE lanes exchange simultaneously
Scalability: \(O(1)\) complexity regardless of data size

2. Butterfly parallel maximum

Configuration

Vector size: SIZE = WARP_SIZE (32 or 64 depending on GPU)
Grid configuration: (1, 1) blocks per grid
Block configuration: (WARP_SIZE, 1) threads per block

Code to complete

Implement parallel maximum reduction using butterfly shuffle_xor with decreasing offsets.

Mathematical operation: Compute the maximum across all warp lanes using tree reduction: \[\Large \text{max_result} = \max_{i=0}^{\small\text{WARP_SIZE}-1} \text{input}[i]\]

Butterfly reduction pattern: Use XOR offsets starting from WARP_SIZE/2 down to 1 to create a binary tree where each step halves the active communication range:

Step 1: Compare with lanes WARP_SIZE/2 positions away (covers full warp)
Step 2: Compare with lanes WARP_SIZE/4 positions away (covers remaining range)
Step 3: Compare with lanes WARP_SIZE/8 positions away
Step 4: Continue halving until offset = 1

After \(\log_2(\text{WARP_SIZE})\) steps, all lanes have the global maximum. This works for any WARP_SIZE (32, 64, etc.).

fn butterfly_parallel_max[
    layout: Layout, size: Int
](
    output: LayoutTensor[mut=False, dtype, layout],
    input: LayoutTensor[mut=False, dtype, layout],
):
    """
    Parallel maximum reduction using butterfly pattern.
    Uses shuffle_xor with decreasing offsets starting from WARP_SIZE/2 down to 1.
    Each step reduces the active range by half until all threads have the maximum value.
    This implements an efficient O(log n) parallel reduction algorithm that works
    for any WARP_SIZE (32, 64, etc.).
    """
    global_i = block_dim.x * block_idx.x + thread_idx.x

    # FILL ME IN (roughly 7 lines)

Tips

1. Understanding butterfly reduction

The butterfly reduction creates a binary tree communication pattern. Think about how you can systematically reduce the problem size at each step.

Key questions:

What should be your starting offset to cover the maximum range?
How should the offset change between steps?
When should you stop the reduction?

Hint: The name “butterfly” comes from the communication pattern - try sketching it out for a small example.

2. XOR reduction properties

XOR creates non-overlapping communication pairs at each step. Consider why this is important for parallel reductions.

Think about:

How does XOR with different offsets create different communication patterns?
Why don’t lanes interfere with each other at the same step?
What makes XOR particularly well-suited for tree reductions?

3. Accumulating maximum values

Each lane needs to progressively build up knowledge of the maximum value in its “region”.

Algorithm structure:

Start with your own value
At each step, compare with a neighbor’s value
Keep the maximum and continue

Key insight: After each step, your “region of knowledge” doubles in size.

After final step: Each lane knows global maximum

4. Why this pattern works

The butterfly reduction guarantees that after \(\log_2(\text{WARP\_SIZE})\) steps:

Every lane has seen every other lane’s value indirectly
No redundant communication: Each pair exchanges exactly once per step
Optimal complexity: \(O(\log n)\) steps instead of \(O(n)\) sequential comparison

Trace example (4 lanes, values [3, 1, 7, 2]):

Initial: Lane 0=3, Lane 1=1, Lane 2=7, Lane 3=2

Step 1 (offset=2): 0 ↔ 2, 1 ↔ 3
  Lane 0: max(3, 7) = 7
  Lane 1: max(1, 2) = 2
  Lane 2: max(7, 3) = 7
  Lane 3: max(2, 1) = 2

Step 2 (offset=1): 0 ↔ 1, 2 ↔ 3
  Lane 0: max(7, 2) = 7
  Lane 1: max(2, 7) = 7
  Lane 2: max(7, 2) = 7
  Lane 3: max(2, 7) = 7

Result: All lanes have global maximum = 7

Test the butterfly parallel maximum:

uv run poe p24 --parallel-max

pixi run p24 --parallel-max

Expected output when solved:

WARP_SIZE:  32
SIZE:  32
output: [1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0]
expected: [1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0]
✅ Butterfly parallel max test passed!

Solution

fn butterfly_parallel_max[
    layout: Layout, size: Int
](
    output: LayoutTensor[mut=False, dtype, layout],
    input: LayoutTensor[mut=False, dtype, layout],
):
    """
    Parallel maximum reduction using butterfly pattern.
    Uses shuffle_xor with decreasing offsets (16, 8, 4, 2, 1) to perform tree-based reduction.
    Each step reduces the active range by half until all threads have the maximum value.
    This implements an efficient O(log n) parallel reduction algorithm.
    """
    global_i = block_dim.x * block_idx.x + thread_idx.x

    if global_i < size:
        max_val = input[global_i]

        # Butterfly reduction tree: dynamic for any WARP_SIZE (32, 64, etc.)
        # Start with half the warp size and reduce by half each step
        offset = WARP_SIZE // 2
        while offset > 0:
            max_val = max(max_val, shuffle_xor(max_val, offset))
            offset //= 2

        # All threads now have the maximum value across the entire warp
        output[global_i] = max_val

This solution demonstrates how shuffle_xor() creates efficient parallel reduction trees with \(O(\log n)\) complexity.

Complete algorithm analysis:

if global_i < size:
    max_val = input[global_i]  # Start with local value

    # Butterfly reduction tree: dynamic for any WARP_SIZE
    offset = WARP_SIZE // 2
    while offset > 0:
        max_val = max(max_val, shuffle_xor(max_val, offset))
        offset //= 2

    output[global_i] = max_val  # All lanes have global maximum

Butterfly execution trace (8-lane example, values [0,2,4,6,8,10,12,1000]):

Initial state:
  Lane 0: max_val = 0,    Lane 1: max_val = 2
  Lane 2: max_val = 4,    Lane 3: max_val = 6
  Lane 4: max_val = 8,    Lane 5: max_val = 10
  Lane 6: max_val = 12,   Lane 7: max_val = 1000

Step 1: shuffle_xor(max_val, 4) - Halves exchange
  Lane 0↔4: max(0,8)=8,     Lane 1↔5: max(2,10)=10
  Lane 2↔6: max(4,12)=12,   Lane 3↔7: max(6,1000)=1000
  Lane 4↔0: max(8,0)=8,     Lane 5↔1: max(10,2)=10
  Lane 6↔2: max(12,4)=12,   Lane 7↔3: max(1000,6)=1000

Step 2: shuffle_xor(max_val, 2) - Quarters exchange
  Lane 0↔2: max(8,12)=12,   Lane 1↔3: max(10,1000)=1000
  Lane 2↔0: max(12,8)=12,   Lane 3↔1: max(1000,10)=1000
  Lane 4↔6: max(8,12)=12,   Lane 5↔7: max(10,1000)=1000
  Lane 6↔4: max(12,8)=12,   Lane 7↔5: max(1000,10)=1000

Step 3: shuffle_xor(max_val, 1) - Pairs exchange
  Lane 0↔1: max(12,1000)=1000,  Lane 1↔0: max(1000,12)=1000
  Lane 2↔3: max(12,1000)=1000,  Lane 3↔2: max(1000,12)=1000
  Lane 4↔5: max(12,1000)=1000,  Lane 5↔4: max(1000,12)=1000
  Lane 6↔7: max(12,1000)=1000,  Lane 7↔6: max(1000,12)=1000

Final result: All lanes have max_val = 1000

Mathematical insight: This implements the parallel reduction operator with butterfly communication: \[\Large \text{Reduce}(\oplus, [a_0, a_1, \ldots, a_{n-1}]) = a_0 \oplus a_1 \oplus \cdots \oplus a_{n-1}\]

Where \(\oplus\) is the max operation and the butterfly pattern ensures optimal \(O(\log n)\) complexity.

Why butterfly reduction is superior:

Logarithmic complexity: \(O(\log n)\) vs \(O(n)\) for sequential reduction
Perfect load balancing: Every lane participates equally at each step
No memory bottlenecks: Pure register-to-register communication
Hardware optimized: Maps directly to GPU butterfly networks

Performance characteristics:

Steps: \(\log_2(\text{WARP_SIZE})\) (e.g., 5 for 32-thread, 6 for 64-thread warp)
Latency per step: 1 cycle (register exchange + comparison)
Total latency: \(\log_2(\text{WARP_SIZE})\) cycles vs \((\text{WARP_SIZE}-1)\) cycles for sequential
Parallelism: All lanes active throughout the algorithm

3. Butterfly conditional maximum

Configuration

Vector size: SIZE_2 = 64 (multi-block scenario)
Grid configuration: BLOCKS_PER_GRID_2 = (2, 1) blocks per grid
Block configuration: THREADS_PER_BLOCK_2 = (WARP_SIZE, 1) threads per block

Code to complete

Implement conditional butterfly reduction where even lanes store the maximum and odd lanes store the minimum.

Mathematical operation: Perform butterfly reduction for both maximum and minimum, then conditionally output based on lane parity: \[\Large \text{output}[i] = \begin{cases} \max_{j=0}^{\text{WARP_SIZE}-1} \text{input}[j] & \text{if } i \bmod 2 = 0 \\ \min_{j=0}^{\text{WARP_SIZE}-1} \text{input}[j] & \text{if } i \bmod 2 = 1 \end{cases}\]

Dual reduction pattern: Simultaneously track both maximum and minimum values through the butterfly tree, then conditionally output based on lane ID parity. This demonstrates how butterfly patterns can be extended for complex multi-value reductions.

alias SIZE_2 = 64
alias BLOCKS_PER_GRID_2 = (2, 1)
alias THREADS_PER_BLOCK_2 = (WARP_SIZE, 1)
alias layout_2 = Layout.row_major(SIZE_2)


fn butterfly_conditional_max[
    layout: Layout, size: Int
](
    output: LayoutTensor[mut=False, dtype, layout],
    input: LayoutTensor[mut=False, dtype, layout],
):
    """
    Conditional butterfly maximum: Perform butterfly max reduction, but only store result
    in even-numbered lanes. Odd-numbered lanes store the minimum value seen.
    Demonstrates conditional logic combined with butterfly communication patterns.
    """
    global_i = block_dim.x * block_idx.x + thread_idx.x
    lane = lane_id()

    if global_i < size:
        current_val = input[global_i]
        min_val = current_val

        # FILL ME IN (roughly 11 lines)

Tips

1. Dual-track butterfly reduction

This puzzle requires tracking TWO different values simultaneously through the butterfly tree. Think about how you can run multiple reductions in parallel.

Key questions:

How can you maintain both maximum and minimum values during the reduction?
Can you use the same butterfly pattern for both operations?
What variables do you need to track?

2. Conditional output logic

After completing the butterfly reduction, you need to output different values based on lane parity.

Consider:

How do you determine if a lane is even or odd?
Which lanes should output the maximum vs minimum?
How do you access the lane ID?

3. Butterfly reduction for both min and max

The challenge is efficiently computing both min and max in parallel using the same butterfly communication pattern.

Think about:

Do you need separate shuffle operations for min and max?
Can you reuse the same neighbor values for both operations?
How do you ensure both reductions complete correctly?

4. Multi-block boundary considerations

This puzzle uses multiple blocks. Consider how this affects the reduction scope.

Important considerations:

What’s the scope of each butterfly reduction?
How does the block structure affect lane numbering?
Are you computing global or per-block min/max values?

Test the butterfly conditional maximum:

uv run poe p24 --conditional-max

pixi run p24 --conditional-max

Expected output when solved:

WARP_SIZE:  32
SIZE_2:  64
output: [9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0]
expected: [9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0]
✅ Butterfly conditional max test passed!

Solution

fn butterfly_conditional_max[
    layout: Layout, size: Int
](
    output: LayoutTensor[mut=False, dtype, layout],
    input: LayoutTensor[mut=False, dtype, layout],
):
    """
    Conditional butterfly maximum: Perform butterfly max reduction, but only store result
    in even-numbered lanes. Odd-numbered lanes store the minimum value seen.
    Demonstrates conditional logic combined with butterfly communication patterns.
    """
    global_i = block_dim.x * block_idx.x + thread_idx.x
    lane = lane_id()

    if global_i < size:
        current_val = input[global_i]
        min_val = current_val

        # Butterfly reduction for both maximum and minimum: dynamic for any WARP_SIZE
        offset = WARP_SIZE // 2
        while offset > 0:
            neighbor_val = shuffle_xor(current_val, offset)
            current_val = max(current_val, neighbor_val)

            min_neighbor_val = shuffle_xor(min_val, offset)
            min_val = min(min_val, min_neighbor_val)

            offset //= 2

        # Conditional output: max for even lanes, min for odd lanes
        if lane % 2 == 0:
            output[global_i] = current_val  # Maximum
        else:
            output[global_i] = min_val  # Minimum

This solution demonstrates advanced butterfly reduction with dual tracking and conditional output.

Complete algorithm analysis:

if global_i < size:
    current_val = input[global_i]
    min_val = current_val  # Track minimum separately

    # Butterfly reduction for both max and min log_2(WARP_SIZE}) steps)
    offset = WARP_SIZE // 2
    while offset > 0:
        neighbor_val = shuffle_xor(current_val, offset)
        current_val = max(current_val, neighbor_val)    # Max reduction

        min_neighbor_val = shuffle_xor(min_val, offset)
        min_val = min(min_val, min_neighbor_val)        # Min reduction

        offset //= 2

    # Conditional output based on lane parity
    if lane % 2 == 0:
        output[global_i] = current_val  # Even lanes: maximum
    else:
        output[global_i] = min_val      # Odd lanes: minimum

Dual reduction execution trace (4-lane example, values [3, 1, 7, 2]):

Initial state:
  Lane 0: current_val=3, min_val=3
  Lane 1: current_val=1, min_val=1
  Lane 2: current_val=7, min_val=7
  Lane 3: current_val=2, min_val=2

Step 1: shuffle_xor(current_val, 2) and shuffle_xor(min_val, 2) - Halves exchange
  Lane 0↔2: max_neighbor=7, min_neighbor=7 → current_val=max(3,7)=7, min_val=min(3,7)=3
  Lane 1↔3: max_neighbor=2, min_neighbor=2 → current_val=max(1,2)=2, min_val=min(1,2)=1
  Lane 2↔0: max_neighbor=3, min_neighbor=3 → current_val=max(7,3)=7, min_val=min(7,3)=3
  Lane 3↔1: max_neighbor=1, min_neighbor=1 → current_val=max(2,1)=2, min_val=min(2,1)=1

Step 2: shuffle_xor(current_val, 1) and shuffle_xor(min_val, 1) - Pairs exchange
  Lane 0↔1: max_neighbor=2, min_neighbor=1 → current_val=max(7,2)=7, min_val=min(3,1)=1
  Lane 1↔0: max_neighbor=7, min_neighbor=3 → current_val=max(2,7)=7, min_val=min(1,3)=1
  Lane 2↔3: max_neighbor=2, min_neighbor=1 → current_val=max(7,2)=7, min_val=min(3,1)=1
  Lane 3↔2: max_neighbor=7, min_neighbor=3 → current_val=max(2,7)=7, min_val=min(1,3)=1

Final result: All lanes have current_val=7 (global max) and min_val=1 (global min)

Dynamic algorithm (works for any WARP_SIZE):

offset = WARP_SIZE // 2
while offset > 0:
    neighbor_val = shuffle_xor(current_val, offset)
    current_val = max(current_val, neighbor_val)

    min_neighbor_val = shuffle_xor(min_val, offset)
    min_val = min(min_val, min_neighbor_val)

    offset //= 2

Mathematical insight: This implements dual parallel reduction with conditional demultiplexing: \[\Large \begin{align} \text{max_result} &= \max_{i=0}^{n-1} \text{input}[i] \\ \text{min_result} &= \min_{i=0}^{n-1} \text{input}[i] \\ \text{output}[i] &= \text{lane_parity}(i) \; \text{?} \; \text{min_result} : \text{max_result} \end{align}\]

Why dual butterfly reduction works:

Independent reductions: Max and min reductions are mathematically independent
Parallel execution: Both can use the same butterfly communication pattern
Shared communication: Same shuffle operations serve both reductions
Conditional output: Lane parity determines which result to output

Performance characteristics:

Communication steps: \(\log_2(\text{WARP_SIZE})\) (same as single reduction)
Computation per step: 2 operations (max + min) vs 1 for single reduction
Memory efficiency: 2 registers per thread vs complex shared memory approaches
Output flexibility: Different lanes can output different reduction results

Summary

The shuffle_xor() primitive enables powerful butterfly communication patterns that form the foundation of efficient parallel algorithms. Through these three problems, you’ve mastered:

Core Butterfly Patterns

Pair Exchange (shuffle_xor(value, 1)):
- Creates perfect adjacent pairs: (0,1), (2,3), (4,5), …
- \(O(1)\) complexity with zero memory overhead
- Foundation for sorting networks and data reorganization
Tree Reduction (dynamic offsets: WARP_SIZE/2 → 1):
- Logarithmic parallel reduction: \(O(\log n)\) vs \(O(n)\) sequential
- Works for any associative operation (max, min, sum, etc.)
- Optimal load balancing across all warp lanes
Conditional Multi-Reduction (dual tracking + lane parity):
- Simultaneous multiple reductions in parallel
- Conditional output based on thread characteristics
- Advanced coordination without explicit synchronization

Key Algorithmic Insights

XOR Communication Properties:

shuffle_xor(value, mask) creates symmetric, non-overlapping pairs
Each mask creates a unique communication topology
Butterfly networks emerge naturally from binary XOR patterns

Dynamic Algorithm Design:

offset = WARP_SIZE // 2
while offset > 0:
    neighbor_val = shuffle_xor(current_val, offset)
    current_val = operation(current_val, neighbor_val)
    offset //= 2

Performance Advantages:

Hardware optimization: Direct register-to-register communication
No synchronization: SIMT execution guarantees correctness
Scalable complexity: \(O(\log n)\) for any WARP_SIZE (32, 64, etc.)
Memory efficiency: Zero shared memory requirements

Practical Applications

These butterfly patterns are fundamental to:

Parallel reductions: Sum, max, min, logical operations
Prefix/scan operations: Cumulative sums, parallel sorting
FFT algorithms: Signal processing and convolution
Bitonic sorting: Parallel sorting networks
Graph algorithms: Tree traversals and connectivity

The shuffle_xor() primitive transforms complex parallel coordination into elegant, hardware-optimized communication patterns that scale efficiently across different GPU architectures.