A Technical Deep Dive Into CPU & GPU Internals

Many modern systems (e.g., autonomous systems, cloud infrastructure, gaming devices, machine learning applications, and scientific computing systems) demand unprecedented levels of computing power, speed and efficiency. Unlike traditional software, which often relies on sequential processing, these systems are driven by the need for massively parallel processing. As systems become increasingly complex, addressing their specific computing needs (e.g., dynamic scalability, high-throughput data processing, or low-latency responses) forms the basis for system design.

This is where the need for a deeper understanding of CPUs (Central Processing Units) and GPUs (Graphics Processing Units) becomes apparent.

CPUs & GPUs: A Brief Comparison

While both CPUs and GPUs are integral to modern computing systems, they are designed with distinct internal structures and optimized for different types of operations.

CPUs have been the traditional cornerstone of computing. They not only control the overall functionality of a system but also specialize in handling complex and sequential tasks, such as running operating systems and executing single-threaded applications. GPUs, on the other hand, were initially developed for rendering high-quality graphics and subsequently evolved into powerful processors capable of highly parallelized processing. Today, GPUs form an integral part of systems used for gaming, machine learning, real-time analytics, scientific computation, and other computationally intensive fields.

Image Source: NVIDIA via Wikimedia Commons

Modern CPUs commonly have multiple cores (e.g., dual-core, quad-core, octa-core), enabling parallel processing and multitasking. The CPU core is the fundamental unit of processing power, and usually comprises an arithmetic logic unit, control unit, and registers. CPUs also have multiple levels of cache for faster data access. Cores generally communicate through shared caches and direct connection

GPUs possess a large number of smaller ALUs/cores arranged in a grid structure. Each core is optimized to execute simple, repetitive calculations but with significantly high parallelism such that thousands of threads can be handled simultaneously. Instead of large, dedicated caches, GPUs often use shared memory accessible by all cores, facilitating parallel data access. The GPU control unit is relatively simpler than its CPU counterpart, and is optimized for managing many simultaneous operations without complex branching. Finally, data flows are optimized for parallelism, allowing each core to execute the same operation on different data independently.

The Internals of the CPU

Let’s dive deep into the internal components of a CPU.

Control Unit (CU): This is the command center within a CPU. It directs the operations of other components by interpreting program instructions, and guiding data flow. This is achieved through the following steps.

Fetching instructions from the main memory (or cache), and decoding them to understand the specific operation required.
Generating control signals that direct other components (e.g., ALU or registers) to ensure orderly data flow, thus enabling smooth instruction execution, and maintaining the integrity of computations.

Arithmetic Logic Unit (ALU): This component performs arithmetic operations (addition, subtraction, multiplication & division), and logical comparisons (AND, OR, NOT & XOR.) It receives data from registers, executes the relevant operations, and then returns the results to the registers (or writes them to memory.) After executing an operation, it sets condition codes or flags (e.g., zero flag or carry flag) that indicate the result of the operation, which enables the CPU to make decisions for branching operations.

Cache/Cache Hierarchy: The cache is a small, high-speed memory within the CPU for storing frequently accessed data and instructions. It reduces the need to access the main memory, thus minimizing latency and improving processing speed. The cache hierarchy is divided into different levels (L1, L2, and sometimes L3), each with varying sizes and access speeds.

L1 Cache: This is the smallest and fastest cache, closest to the core, and usually split into two sections – instruction cache, and data cache.
L2 Cache: This is larger and slower than the L1 cache, and serves as an intermediary between L1 and the main memory. L2 can either be dedicated to a core, or shared between multiple cores.
L3 Cache: This is an additional cache level found in high-performance CPUs. It is larger and slower than L1 and L2, and is shared across all cores.

Registers: These are small storage locations within the CPU that hold data and instructions temporarily during execution, or carry out specific tasks. Their primary function is to enable the quickest access to data needed for computations (much faster than accessing even L1 cache.) Register are of two types:

general-purpose registers that are meant for temporarily storing data and instructions used by the CPU (e.g., accumulator registers, data registers, and index registers.)
special purpose registers for carrying out specific tasks, such as tracking the address of the next instruction (program counters), storing the address of the last item put onto the stack (stack pointers), and holding condition flags to guide the flow of program control (status registers.)

Interconnects, Buses & Other Components: These components facilitate communication between the CPU and other system components. Modern CPUs often use high-speed Interconnects (e.g., Intel QuickPath Interconnect (QPI) or AMD’s Infinity Fabric) to reduce latency and improve the bandwidth between cores, memory controllers, and other components. In older CPU designs, the functions of the interconnect were achieved through the Front-Side Bus (FSB). Moreover, Many CPUs now include an Integrated Memory Controller, reducing latency by allowing the CPU to communicate directly with memory rather than relying on an external chipset. Data Buses transfer data, Address Buses hold memory addresses, and Control Buses carry control signals.

Furthermore, modern CPUs incorporate two techniques for maximizing instruction throughput and efficiency: pipelining and out-of-order execution. Pipelining divides the instruction cycle into stages, allowing multiple instructions to be processed simultaneously at different stages. Out-of-order execution rearranges the order of instructions, enabling instructions that are not dependent on the previous ones to execute earlier. These techniques reduce idle time in the pipeline, and optimize performance.

The Internals of the GPU

Let’s dive deep into the internal components of a GPU.

Streaming Multiprocessors (SMs): These are foundational GPU components analogous to CPU cores but with a difference. While CPU cores are optimized for complex, sequential tasks, SMs are optimized for the parallel execution of a large number of simple instructions. Each SM, under the hood, consists of hundreds of small cores capable of executing thousands of simple arithmetic and logic operations simultaneously. These are often referred to as CUDA cores (NVIDIA) or stream processors (AMD.)

SMs process threads in groups called warps (NVIDIA) or wavefronts (in AMD.) Each group typically contains 32 threads, and all the threads within a warp execute the same instruction simultaneously, i.e., Single Instruction, Multiple Threads (SIMT). This allows the GPU to handle parallel computations across large datasets. SMs also contain instruction pipelines, allowing multiple instructions to be processed simultaneously in a pipelined fashion.

Memory/Memory Hierarchy: GPUs have a sophisticated memory hierarchy that is key to maximizing data throughput and minimizing latency for large datasets.

Global Memory (or Device Memory): This is the primary memory in a GPU that is relatively large (often in gigabytes) and accessible by all SMs but with a relatively high latency. It is off-chip, and used for storing large datasets, textures, and other information required for computations.
Shared Memory: Each SM has a limited amount of shared memory that is on-chip, and accessible by all the cores within that SM. Shared memory is much faster than global memory, providing low-latency data access to threads within the same block. As a result, threads within the same SM can communicate and share data without the need to access global memory.
Registers: Each SM also has a large number of registers for storing temporary data for each thread, providing the fastest data access. These registers are allocated per thread, and allow the cores to perform computations without the need to access slower memory.
Specialized Memory (e.g., constant & texture memory): These are memory units optimized for specialized tasks—for instance, read-only access for values that do not change during execution (constant memory) and tasks that revolve around spatial locality properties (texture memory).

Memory Buses & Memory Coalescing: Modern GPUs are designed with wide memory buses (e.g., 256-bit, 384-bit) and specialized memory technologies, e.g., Graphics Double Data Rate (GDDR) memory, or High Bandwidth Memory (HBM). The wide memory buses enable GPUs to move larger data volumes in each cycle. Moreover, memory coalescing techniques are used to optimize memory access patterns. For instance, when threads in a warp access adjacent memory locations, these accesses are coalesced into a single transaction, reducing the number of memory requests and improving bandwidth utilization.

Data Interconnects: Modern GPUs use advanced interconnects (e.g., crossbar switches) to facilitate the rapid transfer of data between SMs, memory controllers, and other components. The crossbar switch is a network of interconnections that allows any SM to access any part of the memory subsystem. The Peripheral Component Interconnect Express (PCIe) is traditionally used to connect GPUs to the CPU, allowing data to flow between system memory and the GPU. NVIDIA uses NVLink to facilitate high-bandwidth and low-latency GPU-to-GPU communication, or GPU-to-CPU data transfer.

Control Logic & Warp Scheduling: GPUs achieve efficient parallelism through warp scheduling and lightweight control mechanisms. The SMs schedule Warps (or Wavefronts) to keep the cores active as much as possible. A technique called latency hiding minimizes idle time and improves utilization and throughput. When one warp is waiting for data (e.g., from memory), the scheduler switches to another warp that is ready to execute.

Moreover, the control logic within a GPU is much more streamlined than that within a CPU. This is due to the Single Instruction, Multiple Threads (SIMT) execution model, which allows GPUs to execute the same instruction across multiple threads simultaneously, thus greatly simplifying the control logic. However, if different threads within a warp need to execute different instructions (divergence), the GPU performance can suffer as the warp must serialize these operations. Another important process is thread block scheduling in which GPUs organize threads into blocks, which are scheduled across available SMs. Each SM can manage multiple blocks simultaneously, allowing for scalability in the number of tasks handled.

Tensor Cores & Ray Tracing Cores: Modern GPU architectures (e.g., NVIDIA’s Volta and later models) include specialized units called tensor cores to accelerate matrix multiplication operations for Deep Learning. These tensor cores perform mixed-precision matrix multiplications much more efficiently (and in much less time) than traditional CUDA cores. Modern GPUs also include ray tracing cores for real-time ray tracing in graphics. These cores accelerate calculations required for realistic lighting and reflections, enhancing visual fidelity in graphics-intensive applications.

How do CPUs & GPUs work together?

The integration of both CPU and GPU in modern systems enables unprecedented computing power, allowing for complex simulations, faster data processing, and advancements in artificial intelligence. Understanding the roles and interactions of CPUs and GPUs in today’s technology landscape provides insight into how they enable innovations across diverse industries, from gaming and visualization to scientific research and autonomous systems.

As explained earlier, the CPU is the general-purpose processor responsible for managing overall system operations, while the GPU is a specialized processor optimized for handling parallel tasks. Their collaboration involves dividing computational workloads based on each processor’s strengths. The CPU manages high-level system control, task scheduling, and coordination between hardware components. It determines which tasks to offload to the GPU, and which tasks to execute itself. The CPU sends data-intensive, parallelizable tasks (e.g., matrix multiplications, image rendering, simulations, etc.) to the GPU. The data transfer generally happens through PCIe and related technologies.

Some modern systems use unified memory architectures, e.g., Unified Memory (NVIDIA) or Heterogeneous System Architecture (AMD), that enable the CPU and GPU to access the same memory space, thus reducing the need to copy data between the two. Moreover, high-bandwidth interconnects, e.g., NVLink (NVIDIA), enable GPUs to communicate directly with each other, reducing the need for CPU intervention in certain data exchanges. So, once the CPU assigns tasks, the GPUs share data directly among themselves to perform those tasks, improving processing efficiency.

Image Source: Optimizing MRI Data Processing by exploiting GPU Acceleration for Efficient Image Analysis and Reconstruction: Irfan Ullah & Hammad Omer, 2023

In high-performance computing and large data centers, the CPU often orchestrates multiple GPUs, distributing tasks across them and managing the data flow. Frameworks like CUDA and OpenCL are used to develop applications, particularly in machine learning, that split workloads between the CPUs and GPUs based on their computational nature. Similarly, DirectX and Vulkan allow efficient CPU-GPU interaction and collaboration in graphics.

Closing Comments

Developing a deep understanding of the architectural components of CPUs and GPUs (e.g., the arrangement and function of cores, cache hierarchies, and control units) is key to fully exploiting their unique capabilities, including their computational complementarity. The collaboration between CPUs and GPUs creates a powerful and efficient computing environment capable of handling a wide range of modern applications. As technology continues to evolve, innovations in CPUs and GPUs will remain a crucial factor in unlocking new possibilities in gaming, machine learning, scientific research, and beyond.

PS: 10 – 15% of this paper was written with the help of generative AI.