Blog — Darshan Baslani | CUDA, GPU Programming & ML Systems

CUDA

Cute-DSL: I Wrote a CUDA Kernel in Python and My GPU Didn't Even Cry

Welcome to the ultimate guide to cute-dsl! Bringing the power of CuTe's concepts like Layouts, Tilers, and vectorized memory operations into a familiar, Pythonic interface.

April 19, 2026

CUDA

WGMMA ; Warpgroup MMA

How to use Warpgroup MMA (WGMMA) to feed NVIDIA Tensor Cores directly from shared memory, bypassing the register file bottleneck.

April 4, 2026

CUDA

The TMA Revolution (Async Copy)

With the Hopper and Blackwell architectures, NVIDIA introduced the Tensor Memory Accelerator (TMA). Instead of having threads manually calculating pointers and copying data, a single thread can offload the entire tile copy to dedicated hardware.

March 23, 2026

CUDA

The Global GEMM ; Putting It All Together

Writing a complete three-level tiled GEMM kernel from scratch using CuTe's TiledCopy, TiledMMA, and swizzled shared memory.

March 7, 2026

CUDA

Hello, MMA ; Your First Tensor Core Instruction

How to use CuTe's TiledMMA to execute a matrix multiply-accumulate on NVIDIA Tensor Cores.

March 3, 2026

CUDA

Swizzling ; Avoiding Shared Memory Bank Conflicts

How CuTe's Swizzle XORs address bits to eliminate shared memory bank conflicts with a single line of code.

February 26, 2026

CUDA

The Parallel Copy ; Orchestrating Threads with TiledCopy

How TiledCopy bundles thread layout, copy atoms, and value layout into one declarative object for coordinated, vectorized parallel copies.

February 24, 2026

CUDA

The Naive Copy ; Scalar vs. Vectorized Memory Movement

Why scalar copies leave 75% of memory bandwidth on the table, and how CuTe's auto-vectorization fixes it.

February 22, 2026

CUDA

The Art of Slicing ; Partitioning Data Across Blocks and Threads

How CuTe's local_tile and local_partition replace manual index math to slice matrices across CTAs and threads.

June 20, 2025

CUDA

Hello, Layout! ; Visualizing Memory in CuTe

Understanding CuTe Layouts: how shape and stride turn flat memory into multidimensional grids.

June 15, 2025

Life

Latest Thoughts

Cute-DSL: I Wrote a CUDA Kernel in Python and My GPU Didn't Even Cry

WGMMA ; Warpgroup MMA

The TMA Revolution (Async Copy)

The Global GEMM ; Putting It All Together

Hello, MMA ; Your First Tensor Core Instruction

Swizzling ; Avoiding Shared Memory Bank Conflicts

The Parallel Copy ; Orchestrating Threads with TiledCopy

The Naive Copy ; Scalar vs. Vectorized Memory Movement

The Art of Slicing ; Partitioning Data Across Blocks and Threads

Hello, Layout! ; Visualizing Memory in CuTe

My 2 Cents on Doing Hard Things

Beating PyTorch: Writing a Faster Softmax Kernel in CUDA

Stable Diffusion 1.5: How I Optimized It

Propositional Logic

Raw Dawgging Linear Regression