A conceptual map for remembering how a PyTorch operation becomes GPU work: CPU/board, PCIe, GPU command engine,
VRAM, caches, SMs, warp schedulers, warps, CUDA cores, and registers.
The Big Route
Information and instructions move inward from the host system toward the small registers where actual arithmetic happens.
Animated Module
The GPU Company: Script, Memory, Dispatch, Workers
Dispatching
Scene 1
Boardroom: CPU Submits The Proposal
The CPU runs Python and calls PyTorch. PyTorch identifies the tensor operation and chooses a CUDA kernel or library routine. The CPU submits the job; it does not do the GPU math.
CPU
PyTorch
CUDA
proposal.pyc = torch.relu(a @ b)
Action: CPU submits a launch request.
Scene 2
Translator Pipeline Approves The Work
The operation passes through the software stack: PyTorch op to CUDA kernel/library, then PTX, then SASS for the specific GPU. The CUDA driver/libcuda is the official channel into the GPU.
PyTorch op
CUDA kernel
PTX
SASS
libcuda / driver gate
approved launch
Action: driver delivers the approved kernel launch.
Scene 3
Memory Warehouse Prepares Materials
Tensors live in VRAM. PyTorch's CUDA allocator reserves space, and kernels receive pointers to those buffers. Memory utilization is capacity pressure, not SM compute utilization.
tensor A
tensor B
output C
VRAM utilization6.4 GB / 24 GB
Allocated/reserved memory shows capacity pressure; it is not the same as SM compute utilization.
Action: memory supplies tensor buffers and addresses.
Scene 4
CEO Receives The Kernel Launch
The GPU front end receives the launch. The block scheduler assigns CTAs/thread blocks to available SMs. The CEO schedules blocks; it does not do arithmetic.
kernel
CTA 0CTA 1CTA 2
Action: CEO schedules CTAs/thread blocks to SMs.
Scene 5
VP / SM Organizes The Block
An SM accepts a block, tracks registers and shared-memory limits, forms resident warps, and exposes ready warps to warp schedulers.
CTA / thread block
registersshared memoryoccupancy
Action: VP organizes resident warps.
Scene 6
Regional Managers Pick Ready Warps
Warp schedulers issue instructions for ready warps. If one warp waits on memory, another ready warp can run.
warp scheduler
warp A readywarp B waiting on loadwarp C ready
same instruction, many lanes
Action: scheduler dispatches ready warps.
Scene 7
VP Sends Work To Multiple Warps
Warps execute the same instruction across lanes. Execution units operate on register values while loads/stores move through VRAM, L2, and SM-local paths.
Warp A
lanes 0-31
Warp B
lanes 32-63
Warp C
lanes 64-95
registersL1/sharedL2VRAM
A
B
C
Action: workers compute using register materials.
Scene 8
Metaphor To Hardware Map
The story maps back to the real path: CPU/PyTorch/Driver submits, GPU scheduler assigns, SM schedules warps, execution units compute, memory supplies data.
CPU / PyTorch / DriverGPU SchedulerSMWarp SchedulerWarps + Registers
The metaphor is useful, but these are the actual mechanics that matter when a tensor operation turns into GPU work.
Kernel Launch Anatomy
A launch is mostly instructions plus a launch configuration and pointers to device memory. The CPU submits it; the GPU executes it asynchronously on a CUDA stream.
kernel<<<gridDim, blockDim, sharedMemBytes, stream>>>(
A_ptr, B_ptr, C_ptr, N
);
gridDim: how many CTAs/thread blocks exist for the whole problem.
blockDim: how many logical threads are inside each block.
sharedMemBytes: dynamic shared memory requested per block.
stream: ordered queue where the launch is submitted.
A_ptr/B_ptr/C_ptr: addresses of tensor buffers already allocated in VRAM.
Grid, Block, Warp, Thread
Gridall blocks
Block / CTAassigned to one SM
Warpusually 32 lanes
Threadone logical lane
Hardware groups adjacent threads into warps. A warp runs one instruction stream across multiple lanes, each with its own register state and data address.
PyTorch Dispatch Path
Python calls a PyTorch op.
ATen/dispatcher selects backend implementation.
CUDA backend calls a library kernel, generated kernel, or custom kernel.
Launch is enqueued on a CUDA stream.
CPU continues unless a synchronization point forces it to wait.
Memory Hierarchy
RegistersPer-thread, tiny, fastest. Execution units read operands from here.
Shared memory / L1Per-SM, low latency. Shared memory is explicitly used by kernels; L1 caches local/global access patterns.
L2Shared across the GPU. A common cache before going to global memory.
VRAM / HBM / GDDRLarge device memory. Tensors live here when they are on CUDA.
CPU RAMHost memory. Moving data from here crosses PCIe or NVLink.
Warp Scheduling
Each SM has warp schedulers. They pick ready warps and issue instructions. If a warp stalls on memory, another ready warp can run, which hides latency.
High occupancy can help hide latency, but occupancy is not the same as performance. A kernel can have high occupancy and still be memory-bound.
SIMT Details
One warp instruction applies across lanes.
Each lane has different registers and usually different data addresses.
Branch divergence can serialize paths inside a warp.
Coalesced memory access makes global loads/stores more efficient.
PTX vs SASS
PyTorch: c = a + b
CUDA: C[i] = A[i] + B[i]
PTX: ld.global, add.f32, st.global
SASS: LDG, FADD, STG
This is conceptual. Exact SASS depends on the GPU architecture and compiler decisions.
Streams And Sync
CUDA launches are asynchronous. Operations in the same stream execute in order. Different streams can overlap copies and compute when dependencies and hardware resources allow.
torch.cuda.synchronize()
# CPU waits until queued GPU work finishes
How To Observe It
nvidia-smiBroad GPU utilization, memory use, process list.
PyTorch memory APIsAllocated vs reserved memory inside the CUDA caching allocator.