Computing with GPU-acceleration on Python
Contents
❯ Why do you need a GPU?
GPU, also called “video card” or “graphics processor” is the most important component of the computer responsible for displaying pictures and videos. A GPU, unlike a regular CPU (CPU), excels at splitting tasks into subtasks and parallelizing them. A GPU always has many cores, so calculations are performed more efficiently on it. The GPU is therefore ideal for multitasking. The following table provides a simplified comparison of CPUs and GPUs.
CPU |
GPU |
The main component of the computer. Performs all basic computing work |
A specialized component – processes graphics and is responsible for displaying video |
Number of cores – from 2 to 64 (most CPUs) |
The number of cores is thousands |
Performs processes sequentially |
Executes processes in parallel |
Does better with one big task at a time |
Better at multi-tasking at once |
Difference between CPU and GPU. Source.
❯ PyCuda
Tool CUDA from NVIDIA, written in Python, provides an API for working with drivers and a runtime environment that facilitates accelerated processing of GPU-enabled tasks using various tools and libraries. Python is widely recognized as the most popular language for scientific computing, engineering, data analysis, and deep learning. Python is an interpreted language, so it is often criticized for its relatively low performance. Of note in this area is the PyCuda library, which provides a straightforward and Python-like way to access NVIDIA’s CUDA API for parallel computing. In this regard, PyCuda has a number of key advantages, including:
-
Completeness: PyCuda provides access to the full range of capabilities contained in the CUDA API, allowing you to use these capabilities to their full potential.
-
Automated error checking: Any CUDA errors encountered at runtime are automatically converted to Python exceptions. This makes it more convenient to handle errors and debug the code.
-
Speed: At the heart of PyCuda is a layer implemented in C++, so you can take advantage of all the above benefits without sacrificing performance.
You can learn more about this topic from the useful documentation provided here.
❯ Basic CUDA terminology
Here is the basic CUDA programming terminology you need to know to get involved with GPU-based computing:
Description:
Grid // Grid
Block // Block
Shared Memory // Shared memory
Registers // Registers
Thread // Flow
Local Memory // Local memory
Global Memory // Global memory
Constant Memory // Constant memory
Texture Memory // Texture memory
GPU grid architecture.
-
Core: a function executed in parallel by many threads in the GPU. This is a fundamental concept of GPU programming. The kernel represents the code that executes on the GPU device.
-
grid: thread block collection Represents the overall layout of threads performing the function of a kernel on a GPU.
-
Flow: The smallest unit of execution in which a specific task is solved within the kernel. Represents a separate instance of code running on the GPU.
❯ We run the first program on the GPU
Before you start, make sure you have an NVIDIA GPU. You can check which GPU is on your machine at https://www.pcmag.com/how-to/what-graphics-card-do-i-have
If you already have Python installed on your machine, you can run the following command to install PyCuda:
pip install pycuda
After you have successfully installed PyCuda and configured your computing environment, you can run the following program that doubles each of the elements in the given array:
import pycuda.driver as cuda
import pycuda.autoinit
import numpy
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void doublify(float *a)
{
int idx = threadIdx.x + threadIdx.y*4;
a[idx] *= 2;
}
""")
# Инициализация массива
a = numpy.random.randn(4,4).astype(numpy.float32)
# Выделение памяти на устройстве (GPU)
a_gpu = cuda.mem_alloc(a.nbytes)
# Перенос данных с хоста (CPU) на устройство (GPU)
cuda.memcpy_htod(a_gpu, a)
# Получение ссылки на нашу функцию
func = mod.get_function("doublify")
# Вызов функции с размером блока (4,4)
func(a_gpu, block=(4,4,1))
# Выборка данных из GPU
a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
# Дальнейшая обработка нового массива
a_doubled = ...
In the first step in this code, we import PyCuda and initialize CUDA with pycuda.autoinit
. The code is then compiled using SourceModule
then it is checked for any errors, and if no errors are found, it is downloaded to the device. Finally, we apply the class pycuda.driver.Function
with which we generate a link to our function and call it, providing the necessary arguments. The following sections will explain in more detail step by step what needs to be done to make the kernel work on the GPU.
❯ Host and device code
In CUDA, the decision about where exactly to execute a function – on the CPU or on the GPU – is made based on its signature. In this way, it can be decided both by the developers and by the program itself.
Code designed to run on the CPU is called host codeand the code for execution on the GPU device code. As mentioned above, the code of these two types differs in function signature. If the function refers to the device code, then there is a keyword at the very beginning of it __global__
or __device__
whereas function names from host code do not have such qualifiers. Sometimes host functions also have an optional keyword __host__
.
❯ How does indexing work in device code?
In CUDA, array/vector indexing can be done in one, two, or three dimensions, depending on the dimension of the array. Let’s consider how array indexing is arranged in CUDA for 1D, 2D and 3D cases:
1D:
-
Indexing a one-dimensional array is not difficult. By means of
threadIdx.x
each stream is given a unique index denoting the stream ID in the x direction. -
Therefore, the index can refer to the corresponding element in a one-dimensional array. Returning to the previous example (Case 1D):
mod = SourceModule("""
__global__ void doublify(float *a)
{
int idx = threadIdx.x;
a[idx] *= 2;
}
""")
2D:
-
In a two-dimensional array, streams are arranged in the form of a flat grid (grid), where both the blocks of streams and the streams themselves are given IDs simultaneously along the x and y axes.
-
To calculate the index when working with a two-dimensional array, you usually need to multiply the index of the row (
threadIdx.y
orblockDim.y * blockIdx.y + threadIdx.y
) to the width of the array and add the column index (threadIdx.x
orblockDim.x * blockIdx.x + threadIdx.x
). -
By this index, you can refer to the desired element in the two-dimensional array. Returning to the previous example (case 2D):
mod = SourceModule("""
__global__ void doublify(float a[4][4])
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
a[i][j] *= 2;
}
""")
3D:
-
Similarly, in the case of a 3D array, streams are arranged in a 3D grid, and stream blocks and individual streams are given IDs along the x, y, and z axes.
-
To calculate the index when working with a 3D array, you need to multiply the depth index (
threadIdx.z
orblockDim.z * blockIdx.z + threadIdx.z
) to the product of the height and width of the array. -
Then you need to multiply the row index (
threadIdx.y
orblockDim.y * blockIdx.y + threadIdx.y
) to the width of the array and add the column index (threadIdx.x
orblockDim.x * blockIdx.x + threadIdx.x
). -
Based on the obtained index result, you can refer to the desired element in the 3D array. Returning to the previous example (Case 3D):
mod = SourceModule("""
__global__ void doublify(float a[4][4][4])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int k = blockIdx.z * blockDim.z + threadIdx.z;
a[i][j][k] *= 2;
}
""")
Based on these indexing schemes, CUDA provides efficient parallel access to the array across different threads in the GPU core. Read more about stream indexing here.
❯ Common mistakes when programming with PyCuda
Invalid memory access: in this case, an error message may appear: “RuntimeError: CUDA error: an illegal memory access was encountered”. This error is a CUDA-specific variant of “Index out of range”. You can solve this problem as follows:
-
Make sure the block size and data size are set correctly, meaning that the typical chunk of data you’re operating on fits into a typical block. If the block is larger than the data size, this situation can cause invalid memory accesses. To prevent such an error from occurring in the future, adjust the block size according to the desired data size.
-
Introduce a range check for indexes: If you can’t match the block size to the data size as described above, you can add conditions ifwhich will help prevent data from being returned as soon as the index enumeration exceeds the limits of the valid range. By checking the index number against the data size, you can prevent illegal memory accesses.
“Out of memory” error: You may encounter an error message with the following content: “RuntimeError: CUDA out of memory.” . This sometimes happens when working with large datasets or complex models that require a lot of memory from the GPU. To solve this problem, you can do the following:
-
Reduce data size: For example, you can reduce the size of the data being processed. In particular, try downsampling the data or select a smaller subset of the data for analysis. By reducing the data size, you will free up some GPU memory and avoid such errors.
-
Apply batch processing: you can not process the entire dataset at once, but divide it into smaller batches. By processing data in such reduced portions, it is possible to minimize GPU memory consumption at any given time. Such an approach is particularly convenient, for example, when training deep learning models, where data can be processed gradually, like batches.
Other errors encountered during compilation in CUDA may be due to incorrect environment settings or programming errors.
❯ Supplement: strides
The concept of array steps (strides) is also considered in several sources, but so far I have not encountered errors that would be related to the inability to use steps.
In NumPy arrays, steps are an indexing scheme that specifies how many bytes to skip to move to the next element. In the context of a two-dimensional matrix, the step tuple contains two elements: the number of bytes to move to the next row and the number of bytes to move to the next column.
When working with calculations with GPU in mind, understanding the steps becomes important. And here’s why:
When programming for CUDA, the placement of steps plays a crucial role for efficient data access and memory coalescing. Consolidation means addressing adjacent areas of memory as if they were located in a continuous block. With parallel calculations, this approach allows you to significantly increase the bandwidth of the memory and the overall performance of the system.
In CUDA, threads are combined blocks, and memory accesses are coordinated from the threads of each block. When accessing data from global memory, each thread usually processes its own portion of the data. When using steps, threads can access memory locations that are contiguous or close to each other. In this way, more efficient memory access patterns are produced.
By carefully designing these patterns and choosing the correct step size when working with CUDA, you can optimize memory pooling and achieve higher performance when computing on the GPU. This kind of optimization is especially important when working with large datasets or when performing memory-intensive operations in parallel.
❯ Conclusion
Developing programs that are supposed to run on the GPU is sometimes more difficult than regular programming in C++ or Python, because a different programming model is used here and you need to deeply understand the hardware on which the program runs. However, many high-level languages, and Python in particular, offer libraries that allow you to program for the GPU without getting bogged down in the low-level details of the code. However, I recommend that you thoroughly study the basics of parallel computing and GPU architecture, and only then try your hand at such programming.
Link:
It is known how difficult it is to program for GPUs due to the unique architecture of these processors, which, moreover, is constantly evolving. This topic is discussed in more detail here: https://dl.acm.org/doi/10.1145/3611643.3616365.
News, product reviews and contests from the Timeweb.Cloud team – in our Telegram channel ↩