Latest Thoughts

I write to clear my mind and share what I learn.

CUDA

Cute-DSL: I Wrote a CUDA Kernel in Python and My GPU Didn't Even Cry

Welcome to the ultimate guide to cute-dsl! Bringing the power of CuTe's concepts like Layouts, Tilers, and vectorized memory operations into a familiar, Pythonic interface.

CUDA

WGMMA ; Warpgroup MMA

How to use Warpgroup MMA (WGMMA) to feed NVIDIA Tensor Cores directly from shared memory, bypassing the register file bottleneck.

CUDA

The TMA Revolution (Async Copy)

With the Hopper and Blackwell architectures, NVIDIA introduced the Tensor Memory Accelerator (TMA). Instead of having threads manually calculating pointers and copying data, a single thread can offload the entire tile copy to dedicated hardware.

CUDA

The Global GEMM ; Putting It All Together

Writing a complete three-level tiled GEMM kernel from scratch using CuTe's TiledCopy, TiledMMA, and swizzled shared memory.

CUDA

Hello, MMA ; Your First Tensor Core Instruction

How to use CuTe's TiledMMA to execute a matrix multiply-accumulate on NVIDIA Tensor Cores.

CUDA

Swizzling ; Avoiding Shared Memory Bank Conflicts

How CuTe's Swizzle XORs address bits to eliminate shared memory bank conflicts with a single line of code.

CUDA

The Parallel Copy ; Orchestrating Threads with TiledCopy

How TiledCopy bundles thread layout, copy atoms, and value layout into one declarative object for coordinated, vectorized parallel copies.

CUDA

The Naive Copy ; Scalar vs. Vectorized Memory Movement

Why scalar copies leave 75% of memory bandwidth on the table, and how CuTe's auto-vectorization fixes it.

CUDA

The Art of Slicing ; Partitioning Data Across Blocks and Threads

How CuTe's local_tile and local_partition replace manual index math to slice matrices across CTAs and threads.

CUDA

Hello, Layout! ; Visualizing Memory in CuTe

Understanding CuTe Layouts: how shape and stride turn flat memory into multidimensional grids.

Life

My 2 Cents on Doing Hard Things

Reflections on why hard things are worth doing and how to keep going when it gets tough.

CUDA

Beating PyTorch: Writing a Faster Softmax Kernel in CUDA

Writing a faster Softmax kernel in CUDA than PyTorch's implementation.

Machine Learning

Stable Diffusion 1.5: How I Optimized It

A detailed worklog on optimizing Stable Diffusion 1.5 for performance.

Logic

Propositional Logic

A deep dive into the fundamental building blocks of mathematical logic.

Machine Learning

Raw Dawgging Linear Regression

Understanding Linear Regression by building it from the ground up.