The Art of Slicing ; Partitioning Data Across Blocks and Threads
Difficulty: Beginner
Prerequisites: Tutorial 01: Hello, Layout!, CUDA thread blocks (blockIdx, threadIdx)
1. The Problem (The "Why")
You have a big matrix ; say 512×512 ; sitting in global memory. Your GPU launches a grid of thread blocks, each of which should work on a small tile (e.g., 128×128). Inside each block, 256 threads need to split that tile further so every thread handles its own little piece.
Without CuTe, you write something like:
int row_start = blockIdx.x * TILE_M;
int col_start = blockIdx.y * TILE_N;
int thr_row = threadIdx.x % THR_M;
int thr_col = threadIdx.x / THR_M;
// ... and then more math for every matrix, every stage, every layout change
This is fragile. Change the tile size? Rewrite the math. Transpose a matrix? Rewrite it again. Add a K-dimension for GEMM? Rewrite it again.
CuTe replaces all of this with two composable operations:
| Operation | Level | What it does |
|---|---|---|
local_tile |
Block (CTA) | "Give this block its tile from the big matrix." |
local_partition |
Thread | "Give this thread its elements from the tile." |
Both work on any shape, any stride, any number of dimensions.
B200 Note: When you program the TMA engine on Hopper/Blackwell, you still need
local_tileto select which tile the TMA should fetch. The TMA descriptor handles the copy, but you tell it where to point using the exact same tiling logic described here.
2. The Mental Model (The Visual)
Step 1: The Big Picture ; Tiling for CTAs
Imagine an 8×8 matrix divided into 4×4 tiles. Four thread blocks share the work:
Global Matrix (8×8)
col 0 col 1 col 2 col 3 col 4 col 5 col 6 col 7
┌───────────────────────────┬───────────────────────────┐
row 0 │ │ │
row 1 │ Block (0,0) │ Block (0,1) │
row 2 │ rows 0-3, cols 0-3 │ rows 0-3, cols 4-7 │
row 3 │ │ │
├───────────────────────────┼───────────────────────────┤
row 4 │ │ │
row 5 │ Block (1,0) │ Block (1,1) │
row 6 │ rows 4-7, cols 0-3 │ rows 4-7, cols 4-7 │
row 7 │ │ │
└───────────────────────────┴───────────────────────────┘
local_tile(matrix, Shape<4,4>{}, make_coord(blockIdx.x, blockIdx.y))
→ extracts one 4×4 tile for each block
local_tile applies a tiler (here Shape<4,4>) to produce tiles, then uses the block coordinate to pick the right one. Under the hood it calls zipped_divide, which reshapes the matrix into ((Tile), (Rest)) ; tile elements in the first mode, tile indices in the second ; then slices the rest-mode with your coordinate.
Step 2: Inside a Tile ; Partitioning for Threads
Now zoom into Block (0,0)'s 4×4 tile. We have 4 threads arranged as a 2×2 layout. Each thread gets a 2×2 chunk:
Block (0,0)'s Tile (4×4)
col 0 col 1 col 2 col 3
┌─────────────────┬─────────────────┐
row 0 │ T0 T0 │ T1 T1 │
row 1 │ T0 T0 │ T1 T1 │
├─────────────────┼─────────────────┤
row 2 │ T2 T2 │ T3 T3 │
row 3 │ T2 T2 │ T3 T3 │
└─────────────────┴─────────────────┘
Thread layout: (2,2):(1,2) → 4 threads in column-major
Tile shape: (4,4)
Each thread: (4/2, 4/2) = (2,2) elements
local_partition(tile, Layout<Shape<2,2>>{}, threadIdx.x)
→ Thread 0 gets the top-left 2×2 subtensor
local_partition takes a thread layout ; a Layout that maps coordinates to thread indices ; and uses zipped_divide to split the tile into thread-sized pieces, then slices into the tile-mode with the thread's coordinate (derived from its flat index via the thread layout).
Key Insight:
local_tileslices the rest (picks which tile).local_partitionslices the tile (picks which thread's elements). They are two sides of the samezipped_dividecoin.
3. The Solution (The Code)
A CUDA kernel that tiles a matrix across blocks, partitions each tile across threads, and prints what each agent owns.
#include <cute/tensor.hpp>
using namespace cute;
template <class Shape_MN, class CtaTiler, class ThreadLayout>
__global__ void partition_kernel(Shape_MN shape_mn,
CtaTiler cta_tiler,
ThreadLayout thr_layout)
{
// 1. Build the full matrix as a counting tensor (value at (i,j) = flat index)
// In real code this would be make_tensor(make_gmem_ptr(ptr), shape, stride)
auto matrix = make_counting_tensor(make_layout(shape_mn)); // (M,N)
// 2. CTA-level: each block picks its tile
auto cta_coord = make_coord(blockIdx.x, blockIdx.y);
auto cta_tile = local_tile(matrix, cta_tiler, cta_coord); // (BLK_M,BLK_N)
// 3. Thread-level: each thread picks its elements from the tile
auto thr_tile = local_partition(cta_tile, thr_layout, threadIdx.x); // (THR_M,THR_N)
// 4. Print from Block (0,0), Thread 0 only
if (blockIdx.x == 0 && blockIdx.y == 0 && threadIdx.x == 0) {
printf("=== Full matrix layout ===\n");
print(matrix.layout()); printf("\n\n");
printf("=== Block (0,0) tile ===\n");
print(cta_tile.layout()); printf("\n");
for (int m = 0; m < size<0>(cta_tile); ++m) {
for (int n = 0; n < size<1>(cta_tile); ++n) {
printf("%3d ", int(cta_tile(m, n)));
}
printf("\n");
}
printf("\n=== Block (0,0), Thread 0 partition ===\n");
print(thr_tile.layout()); printf("\n");
printf("Thread 0 owns elements: ");
for (int i = 0; i < size(thr_tile); ++i) {
printf("%d ", int(thr_tile(i)));
}
printf("\n");
}
}
int main()
{
// 8×8 matrix, tiled into 4×4 blocks, 4 threads per block (2×2)
auto shape_mn = make_shape(Int<8>{}, Int<8>{});
auto cta_tiler = make_shape(Int<4>{}, Int<4>{});
auto thr_layout = make_layout(make_shape(Int<2>{}, Int<2>{})); // (m,n)->thr_idx
dim3 grid(size(ceil_div(get<0>(shape_mn), get<0>(cta_tiler))),
size(ceil_div(get<1>(shape_mn), get<1>(cta_tiler)))); // 2×2 blocks
dim3 block(size(thr_layout)); // 4 threads
partition_kernel<<<grid, block>>>(shape_mn, cta_tiler, thr_layout);
cudaDeviceSynchronize();
return 0;
}
Expected Output:
=== Full matrix layout ===
(_8,_8):(_1,_8)
=== Block (0,0) tile ===
(_4,_4):(_1,_8)
0 8 16 24
1 9 17 25
2 10 18 26
3 11 19 27
=== Block (0,0), Thread 0 partition ===
(_2,_2):(_1,_8)
Thread 0 owns elements: 0 1 8 9
Block (0,0) gets the top-left 4×4 tile (addresses 0–3 down column 0, 8–11 down column 1, etc. ; column-major). Thread 0 gets the top-left 2×2 corner of that tile.
4. Step-by-Step Explanation
Line: auto matrix = make_counting_tensor(make_layout(shape_mn));
make_counting_tensorcreates a "virtual" tensor where the value at each coordinate equals its flat index. No actual memory ; it's a trick for visualizing the Layout's mapping. In a real kernel, replace this withmake_tensor(make_gmem_ptr(ptr), shape, stride).
Line: auto cta_tile = local_tile(matrix, cta_tiler, cta_coord);
local_tileis CuTe's inner partition. Under the hood:zipped_divide(matrix, cta_tiler)reshapes the(8,8)matrix into((4,4),(2,2)); a 4×4 tile mode and a 2×2 "rest" mode (2 tiles per row, 2 tiles per column).- The
cta_coordslices the rest-mode:(blockIdx.x, blockIdx.y)picks one tile.
- Block (0,0) gets the tile at rest-coordinate (0,0) ; the top-left 4×4 region.
Line: auto thr_tile = local_partition(cta_tile, thr_layout, threadIdx.x);
local_partitionis CuTe's outer partition. Under the hood:- It takes the shape of
thr_layout; here(2,2); and uses it as a tiler on the(4,4)tile, producing((2,2),(2,2)); 2×2 elements per thread, tiled over 2×2 threads. - It converts
threadIdx.xinto a coordinate usingthr_layout. Thread 0 → coord(0,0), Thread 1 →(1,0), Thread 2 →(0,1), Thread 3 →(1,1). - That coordinate slices the tile-mode, so each thread gets the matching 2×2 block of elements.
- It takes the shape of
Line: ceil_div(get<0>(shape_mn), get<0>(cta_tiler))
ceil_div(8, 4) = 2. This gives us 2 blocks in the M-dimension (and 2 in N), for a total of 4 blocks.
cosize (used in real kernels for shared memory):
cosize(layout)returns the size of the layout's codomain ; the number of unique addresses needed. Use it to allocate shared memory:__shared__ float smem[cosize_v<SmemLayout>];.
5. Engineer's Notebook (Latent Space Notes)
Analogy: Think of partitioning as dealing cards at a poker table:
local_tile= The casino floor manager assigns a table (tile) to each dealer (CTA). "Table 0, you get the top-left quarter of the deck."local_partition= The dealer at each table divides the cards among the players (threads). "Player 0, you get these 4 cards."zipped_divide= The reorganization step that makes dealing possible. It splits the deck into((hand), (tables))so you can ask for any specific table's hand in one shot.
The key invariant: every element is owned by exactly one block and exactly one thread within that block. No overlap, no gaps (assuming tile sizes evenly divide the matrix ; see the predication tutorial for handling remainders).
Hardware Note: On Hopper/B200, the tile coordinate you compute with local_tile is the same coordinate you pass to cute::copy with a TMA atom. The TMA engine uses the descriptor's shape/stride (from Tutorial 01) and this coordinate to DMA the right tile into shared memory ; zero address math in registers.
The local_tile / local_partition Cheat-Sheet:
| Function | Level | Slices into... | You provide... | You get... |
|---|---|---|---|---|
local_tile |
CTA | Rest (which tile) | Tile shape + block coord | The tile data |
local_partition |
Thread | Tile (which thread) | Thread layout + thread idx | That thread's elements |
Under the hood ; both use zipped_divide:
zipped_divide(data, tiler) → ((Tile), (Rest))
│ │
local_partition local_tile
slices here slices here