Latest Thoughts
I write to clear my mind and share what I learn.
Cute-DSL: I Wrote a CUDA Kernel in Python and My GPU Didn't Even Cry
Welcome to the ultimate guide to cute-dsl! Bringing the power of CuTe's concepts like Layouts, Tilers, and vectorized memory operations into a familiar, Pythonic interface.
CUDAWGMMA ; Warpgroup MMA
How to use Warpgroup MMA (WGMMA) to feed NVIDIA Tensor Cores directly from shared memory, bypassing the register file bottleneck.
CUDAThe TMA Revolution (Async Copy)
With the Hopper and Blackwell architectures, NVIDIA introduced the Tensor Memory Accelerator (TMA). Instead of having threads manually calculating pointers and copying data, a single thread can offload the entire tile copy to dedicated hardware.
CUDAThe Global GEMM ; Putting It All Together
Writing a complete three-level tiled GEMM kernel from scratch using CuTe's TiledCopy, TiledMMA, and swizzled shared memory.
CUDAHello, MMA ; Your First Tensor Core Instruction
How to use CuTe's TiledMMA to execute a matrix multiply-accumulate on NVIDIA Tensor Cores.
CUDASwizzling ; Avoiding Shared Memory Bank Conflicts
How CuTe's Swizzle XORs address bits to eliminate shared memory bank conflicts with a single line of code.
CUDAThe Parallel Copy ; Orchestrating Threads with TiledCopy
How TiledCopy bundles thread layout, copy atoms, and value layout into one declarative object for coordinated, vectorized parallel copies.
CUDAThe Naive Copy ; Scalar vs. Vectorized Memory Movement
Why scalar copies leave 75% of memory bandwidth on the table, and how CuTe's auto-vectorization fixes it.
CUDAThe Art of Slicing ; Partitioning Data Across Blocks and Threads
How CuTe's local_tile and local_partition replace manual index math to slice matrices across CTAs and threads.
CUDAHello, Layout! ; Visualizing Memory in CuTe
Understanding CuTe Layouts: how shape and stride turn flat memory into multidimensional grids.
LifeMy 2 Cents on Doing Hard Things
Reflections on why hard things are worth doing and how to keep going when it gets tough.
CUDABeating PyTorch: Writing a Faster Softmax Kernel in CUDA
Writing a faster Softmax kernel in CUDA than PyTorch's implementation.
Machine LearningStable Diffusion 1.5: How I Optimized It
A detailed worklog on optimizing Stable Diffusion 1.5 for performance.
LogicPropositional Logic
A deep dive into the fundamental building blocks of mathematical logic.
Machine LearningRaw Dawgging Linear Regression
Understanding Linear Regression by building it from the ground up.