Configuring Linux for training models with GPU

Configuring Linux for training models with GPU

Well, the collection of iron is complete! My stand with GPU stands and waits for the command to act. But, of course, just assembling a PC is only the beginning of the journey. Now you need to teach the system to work with this beast by installing Linuxdrivers, CUDA and other joys. And this, as we know, can turn out to be another quest: if everything does not work perfectly right away, then the “show of unpredictable problems” will definitely begin. I’m not a big fan of configuring and reconfiguring and I’m not a big expert in linux, but I periodically have to resort to settings, so I decided that I would immediately write everything in the form of scripts to simplify the process and be able to roll back. The result is in the form of scripts that “They will do everything for you” and you can see their description here! With any luck, they won’t even break the system.joke, they will definitely break it).

Three steps to success

I will not touch the Linux installation, it is well described, I will only say that I chose the Ubuntu 24.04 Desktop version as the basis (sometimes a desktop environment is needed). And then he performed the system settings according to his needs.

For ease of setup, I have divided the installation into three parts, each of which solves specific tasks, making the process more flexible and convenient:

  1. Setting up remote access — includes SSH and security to connect to the machine.

  2. Installing drivers and CUDA – This is the key to harnessing the power of the GPU, without which your hardware is simply useless.

  3. Development tools — Docker, Jupyter and other nice little things to make writing and testing code comfortable and safe.

For each step, I wrote scripts that install and remove or manage installed components. Settings for each step in config.env files. More detailed readme.

First step: remote access

I use the PC as a home server, but sometimes I use it as a desktop environment, otherwise it would be possible to install a server version of Linux. In general, the PC stands in the dark without a monitor, and everything running on it must be accessible remotely. Therefore, the first step is to configure remote access. For this, it is provided:

  • SSH – for secure connection to the server.

  • UFW (Uncomplicated Firewall) – to protect the network.

  • RDP – For remote desktop.

  • VNC — the same for graphical access.

  • Samba — for sharing files on the network.

Detailed readme for the first stage.

Second step: NVIDIA and CUDA drivers

Now let’s consider the moment for which it all began. After all, I needed a GPU, and if so, I couldn’t do without it NVIDIA drivers.

So, what do we install:

  • NVIDIA drivers so that the video card finally understands what they want from it.

  • CUDA the magic of parallel computing without CUDA cannot do without mesh training.

  • cuDNN cuDNN library for deep learning.

  • Python -for development, in my case, the Ubuntu distribution already included Python 3.12, but it was necessary to install the second version before 3.11.

We adjust the config and run the scripts, if you’re lucky, you won’t get a sudden reboot with a black screen (which, by the way, also looks quite minimalistic and stylish). If this happens after all, then maybe you are just Malevich?

Let’s continue with those whose installation was successful. Let’s check the installation of nvidia software:

$ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

If the output of the following command shows exactly your GPU, then karma is clean and everything is just ahead, if not, it’s time to review life priorities. I was lucky.

$ nvidia-smi

Fri Sep 27 17:01:20 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090 Ti     Off | 00000000:01:00.0 Off |                  Off |
|  0%   41C    P8              15W / 450W |   4552MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2441      C   python                                     4546MiB |
+---------------------------------------------------------------------------------------+

And the icing on the cake – let’s check if your GPU is really ready to work for the benefit of science. Use the following code (don’t forget to install pytorсh first):

import torch
print("CUDA доступна: ", torch.cuda.is_available())
print("Количество доступных GPU: ", torch.cuda.device_count())

The result should be:

python test_gpu.py
CUDA доступна:  True
Количество доступных GPU:  1

If the conclusion confirms that CUDA is available, then the setup was successful and you are ready to dive into the world of GPU-accelerated deep learning. Well, or at least start to figure out what else went wrong.

Detailed readme for the second stage.

Step Three: Development Tools

After the first two stages, we have configured remote access, installed drivers, and CUDA is working. What’s next? And then you need an environment to work in, so that you can train your models, run them for testing, and generally fully load all those cpu/gpu cores and memory that are available in the hardware. Scripts will help here, which in my case install a minimum of the components I need, namely:

  • Git: Version control system.

  • Docker: Containerization platform.

  • Jupyter – Isn’t it the dream of every developer to see their mistakes right away in the browser?

  • Ray — a platform for those who have decided that a single GPU is boring and it’s time to scale up.

Detailed readme for the third stage.

Conclusion

You can probably do better, cooler and so on, but I hope that my scripts will help someone save time on preparing a PC for training models, and will cause someone a healthy or unhealthy reaction. I will be happy for the first, thank the second and regret the third. Next time I plan to talk about the installation of LLM models.

Related posts