GPU Mental Model

How GPU Work Flows From Code To Cores

A conceptual map for remembering how a PyTorch operation becomes GPU work: CPU/board, PCIe, GPU command engine, VRAM, caches, SMs, warp schedulers, warps, CUDA cores, and registers.

The Big Route

Information and instructions move inward from the host system toward the small registers where actual arithmetic happens.

Animated Module

The GPU Company: Script, Memory, Dispatch, Workers

Dispatching

Scene 1

Boardroom: CPU Submits The Proposal

The CPU runs Python and calls PyTorch. PyTorch identifies the tensor operation and chooses a CUDA kernel or library routine. The CPU submits the job; it does not do the GPU math.

CPU

PyTorch

CUDA

proposal.py c = torch.relu(a @ b)

Action: CPU submits a launch request.

Scene 2

Translator Pipeline Approves The Work

The operation passes through the software stack: PyTorch op to CUDA kernel/library, then PTX, then SASS for the specific GPU. The CUDA driver/libcuda is the official channel into the GPU.

PyTorch op

CUDA kernel

PTX

SASS

libcuda / driver gate

approved launch

Action: driver delivers the approved kernel launch.

Scene 3

Memory Warehouse Prepares Materials

Tensors live in VRAM. PyTorch's CUDA allocator reserves space, and kernels receive pointers to those buffers. Memory utilization is capacity pressure, not SM compute utilization.

tensor A

tensor B

output C

VRAM utilization 6.4 GB / 24 GB

Allocated/reserved memory shows capacity pressure; it is not the same as SM compute utilization.

Action: memory supplies tensor buffers and addresses.

Scene 4

CEO Receives The Kernel Launch

The GPU front end receives the launch. The block scheduler assigns CTAs/thread blocks to available SMs. The CEO schedules blocks; it does not do arithmetic.

kernel

CTA 0 CTA 1 CTA 2

Action: CEO schedules CTAs/thread blocks to SMs.

Scene 5

VP / SM Organizes The Block

An SM accepts a block, tracks registers and shared-memory limits, forms resident warps, and exposes ready warps to warp schedulers.

CTA / thread block

registers shared memory occupancy

Action: VP organizes resident warps.

Scene 6

Regional Managers Pick Ready Warps

Warp schedulers issue instructions for ready warps. If one warp waits on memory, another ready warp can run.

warp scheduler

warp A ready warp B waiting on load warp C ready

same instruction, many lanes

Action: scheduler dispatches ready warps.

Scene 7

VP Sends Work To Multiple Warps

Warps execute the same instruction across lanes. Execution units operate on register values while loads/stores move through VRAM, L2, and SM-local paths.

Warp A lanes 0-31

Warp B lanes 32-63

Warp C lanes 64-95

registers L1/shared L2 VRAM

Action: workers compute using register materials.

Scene 8

Metaphor To Hardware Map

The story maps back to the real path: CPU/PyTorch/Driver submits, GPU scheduler assigns, SM schedules warps, execution units compute, memory supplies data.

CPU / PyTorch / Driver GPU Scheduler SM Warp Scheduler Warps + Registers

VRAM / L2 / L1 + shared memory feed the worker floor

Action: complete GPU workflow recap.

Boardroom

Board / CPU Launches work

PCIe Host to device road

CEO / GPU Command + scheduling

VRAM Global memory

L2 Shared cache

SMs / VPs Block execution

Warp Schedulers Instruction issue

L1 Local cache

Warps 32-lane teams

Registers Working materials

Under The Story

The metaphor is useful, but these are the actual mechanics that matter when a tensor operation turns into GPU work.

Kernel Launch Anatomy

A launch is mostly instructions plus a launch configuration and pointers to device memory. The CPU submits it; the GPU executes it asynchronously on a CUDA stream.

kernel<<<gridDim, blockDim, sharedMemBytes, stream>>>(
    A_ptr, B_ptr, C_ptr, N
);

gridDim: how many CTAs/thread blocks exist for the whole problem.
blockDim: how many logical threads are inside each block.
sharedMemBytes: dynamic shared memory requested per block.
stream: ordered queue where the launch is submitted.
A_ptr/B_ptr/C_ptr: addresses of tensor buffers already allocated in VRAM.

Grid, Block, Warp, Thread

Gridall blocks

Block / CTAassigned to one SM

Warpusually 32 lanes

Threadone logical lane

Hardware groups adjacent threads into warps. A warp runs one instruction stream across multiple lanes, each with its own register state and data address.

PyTorch Dispatch Path

Python calls a PyTorch op.
ATen/dispatcher selects backend implementation.
CUDA backend calls a library kernel, generated kernel, or custom kernel.
Launch is enqueued on a CUDA stream.
CPU continues unless a synchronization point forces it to wait.

Memory Hierarchy

RegistersPer-thread, tiny, fastest. Execution units read operands from here.

Shared memory / L1Per-SM, low latency. Shared memory is explicitly used by kernels; L1 caches local/global access patterns.

L2Shared across the GPU. A common cache before going to global memory.

VRAM / HBM / GDDRLarge device memory. Tensors live here when they are on CUDA.

CPU RAMHost memory. Moving data from here crosses PCIe or NVLink.

Warp Scheduling

Each SM has warp schedulers. They pick ready warps and issue instructions. If a warp stalls on memory, another ready warp can run, which hides latency.

High occupancy can help hide latency, but occupancy is not the same as performance. A kernel can have high occupancy and still be memory-bound.

SIMT Details

One warp instruction applies across lanes.
Each lane has different registers and usually different data addresses.
Branch divergence can serialize paths inside a warp.
Coalesced memory access makes global loads/stores more efficient.

PTX vs SASS

PyTorch: c = a + b
CUDA:   C[i] = A[i] + B[i]
PTX:    ld.global, add.f32, st.global
SASS:   LDG, FADD, STG

This is conceptual. Exact SASS depends on the GPU architecture and compiler decisions.

Streams And Sync

CUDA launches are asynchronous. Operations in the same stream execute in order. Different streams can overlap copies and compute when dependencies and hardware resources allow.

torch.cuda.synchronize()
# CPU waits until queued GPU work finishes

How To Observe It

nvidia-smiBroad GPU utilization, memory use, process list.

PyTorch memory APIsAllocated vs reserved memory inside the CUDA caching allocator.

Nsight SystemsCPU/GPU timeline, stream overlap, launch gaps.

Nsight ComputeOccupancy, warp stalls, memory throughput, tensor core use.

Example Script

A tiny PyTorch operation that creates GPU work. PyTorch chooses kernels and thread blocks unless you write a custom CUDA/C++ kernel.

import torch

device = "cuda"

a = torch.randn((4096, 4096), device=device)
b = torch.randn((4096, 4096), device=device)

# PyTorch dispatches a GPU kernel.
c = torch.relu(a @ b)

torch.cuda.synchronize()
print(c.shape)

Code To Machine Path

PyTorchHigh-level tensor operation: matrix multiply, ReLU, copy, reduction.
CUDARuntime and libraries pick or launch GPU kernels.
PTXPortable NVIDIA assembly-like intermediate representation.
SASSFinal GPU-dependent machine instructions for your specific architecture.
libcuda.soThe driver-side bridge that talks to the GPU.

The Organization Analogy

The GPU feels easier when the hardware hierarchy is mapped to people assigning work.

CEO: GPU / GigaThread Engine

Receives launched kernels and assigns thread blocks to available SMs. The CEO does not do the arithmetic; it distributes the work.

VP: Streaming Multiprocessor

Gets assigned thread blocks, owns local execution resources, and coordinates warps, L1/shared memory, registers, and load/store paths.

Regional Manager: Warp Scheduler

Chooses ready warps and issues the same instruction across lanes while memory loads move through cache and into registers.

Manager: Warp

A warp controls up to 32 threads. Each lane runs the same instruction, usually on different data addresses.

Workers: CUDA / Tensor Cores

CUDA cores, tensor cores, and other execution units perform the low-level math once instructions and operands are ready.

Materials: Registers

Registers hold the immediate values each lane works on. Arithmetic is fast because the operands are already right next to the workers.

One Kernel Launch, Step By Step

For the line c = torch.relu(a @ b), the actual route is a chain of scheduling and data movement.

01

CPU Creates The Job

Python calls PyTorch. PyTorch selects CUDA kernels or library routines such as cuBLAS for matrix multiplication.
02

Driver Hands It To The GPU

The CUDA runtime and driver prepare the launch. Instructions and metadata cross the CPU-board boundary through PCIe.
03

CEO Assigns Thread Blocks

The GPU scheduler distributes thread blocks across SMs. Each block is a chunk of the problem with threads and data pointers.
04

VP Splits Blocks Into Warps

An SM accepts blocks, forms warps, tracks resources, and prepares the load/store paths that bring data from VRAM through cache.
05

Regional Manager Issues Instructions

Warp schedulers pick ready warps. If data is waiting on memory, another ready warp can run while the first waits.
06

Workers Compute In Registers

CUDA cores and tensor cores operate on values in registers, then results flow back through cache to VRAM.

Key Notes

Thread: a logical lane of execution with instructions and pointers to its data.

Thread block: a group assigned to one SM. PyTorch or a CUDA/C++ kernel defines the block layout.

Warp: usually 32 threads executing one instruction stream across different data lanes.

VRAM is big but far: registers are tiny and close; L1/L2 sit in between to reduce expensive memory traffic.

PTX is not the final instruction form: the final SASS depends on the actual GPU architecture.