MountainAI
Войти
systemshardware

CUDA basics

GPU architecture, memory hierarchy, kernels — understanding hardware to optimise deep learning code.

Уровни глубины

L0Intro~0ч

Knows GPUs have thousands of parallel cores and that `.to('cuda')` moves tensors there.

L1Basics~10ч

Understands CUDA thread hierarchy (block/grid), shared memory, and basic memory transfer patterns.

L2Working~25ч

Writes and profiles simple CUDA kernels in C++; uses CUDA profiler (nsight/nvprof); reduces memory bandwidth bottlenecks.

L3Advanced~40ч

Implements optimised GEMM/attention kernels; uses Triton for custom operations; analyses roofline model.

L4Research~80ч

Contributes to ML compilers, sparse kernel libraries, or hardware-aware training research.

Ресурсы