Thursday, January 23, 2025

Optimizing AI Workloads with NVIDIA GPUs, Time Slicing, and Karpenter

Maximizing GPU efficiency in your Kubernetes environment

In this article, we will explore how to deploy GPU-based workloads in an EKS cluster using the Nvidia Device Plugin, and ensuring efficient GPU utilization through features like Time Slicing. We will also discuss setting up node-level autoscaling to optimize GPU resources with solutions like Karpenter. By implementing these strategies, you can maximize GPU efficiency and scalability in your Kubernetes environment.

Additionally, we will delve into practical configurations for integrating Karpenter with an EKS cluster, and discuss best practices for balancing GPU workloads. This approach will help in dynamically adjusting resources based on demand, leading to cost-effective and high-performance GPU management. The diagram below illustrates an EKS cluster with CPU and GPU-based node groups, along with the implementation of Time Slicing and Karpenter functionalities. Let’s discuss each item in detail.

AI nvidia 1

Basics of GPU and LLM

A Graphics Processing Unit (GPU) was originally designed to accelerate image processing tasks. However, due to its parallel processing capabilities, it can handle numerous tasks concurrently. This versatility has expanded its use beyond graphics, making it highly effective for applications in Machine Learning and Artificial Intelligence.

AI nvidia 5

When a process is launched on GPU-based instances these are the steps involved at the OS and hardware level:

  • Shell interprets the command and creates a new process using fork (create new process) and exec (Replace the process’s memory space with a new program) system calls.
  • Allocate memory for the input data and the results using cudaMalloc(memory is allocated in the GPU’s VRAM)
  • Process interacts with GPU Driver to initialize the GPU context here GPU driver manages resources including memory, compute units and scheduling
  • Data is transferred from CPU memory to GPU memory
  • Then the process instructs GPU to start computations using CUDA kernels and the GPU schedular manages the execution of the tasks
  • CPU waits for the GPU to finish its task, and the results are transferred back to the CPU for further processing or output.
  • GPU memory is freed, and GPU context gets destroyed and all resources are released. The process exits as well, and the OS reclaims the resource

Compared to a CPU which executes instructions in sequence, GPUs process the instructions simultaneously. GPUs are also more optimized for high performance computing because they don’t have the overhead a CPU has, like handling interrupts and virtual memory that is necessary to run an operating system. GPUs were never designed to run an OS, and thus their processing is more specialized and faster.

AI nvidia 2

Large Language Models

A Large Language Model refers to:

  • “Large”: Large Refers to the model’s extensive parameters and data volume with which it is trained on
  • “Language”: Model can understand and generate human language
  • “Model”: Model refers to neural networks

AI nvidia 3

Run LLM Model

Ollama is the tool to run open-source Large Language Models and can be download here https://ollama.com/download

Pull the example model llama3:8b using ollama cli

ollama -h
Large language model runner
​
Usage:
  ollama [flags]
  ollama [command]
​
Available Commands:
  serve Start ollama
  create Create a model from a Modelfile
  show Show information for a model
  run Run a model
  pull Pull a model from a registry
  push Push a model to a registry
  list List models
  ps List running models
  cp Copy a model
  rm Remove a model
  help Help about any command
​
Flags:
  -h, --help help for ollama
  -v, --version Show version information
​
Use "ollama [command] --help" for more information about a command.

ollama pull llama3:8b: Pull the model

ollama pull llama3:8b
pulling manifest 
pulling 6a0746a1ec1a... 100% ▕█████████████████████████████████████████████████████████████████████▏ 4.7 GB 
pulling 4fa551d4f938... 100% ▕█████████████████████████████████████████████████████████████████████▏ 12 KB 
pulling 8ab4849b038c... 100% ▕█████████████████████████████████████████████████████████████████████▏ 254 B 
pulling 577073ffcc6c... 100% ▕█████████████████████████████████████████████████████████████████████▏ 110 B 
pulling 3f8eb4da87fa... 100% ▕█████████████████████████████████████████████████████████████████████▏ 485 B 
verifying sha256 digest 
writing manifest 
removing any unused layers 
success

ollama list: List the models

developer:src > ollama show llama3:8b
  Model 
        arch llama 
        parameters 8.0B 
        quantization Q4_0 
        context length 8192 
        embedding length 4096 

  Parameters 
        num_keep 24 
        stop "<|start_header_id|>" 
        stop "<|end_header_id|>" 
        stop "<|eot_id|>" 

  License 
        META LLAMA 3 COMMUNITY LICENSE AGREEMENT 
        Meta Llama 3 Version Release Date: April 18, 2024

ollama run llama3:8b: Run the model

developer:src > ollama run llama3:8b
>>> print all primes between 1 and n
Here is a Python solution that prints all prime numbers between 1 and `n`:
​
```Python
def print_primes(n):
    for possiblePrime in range(2, n + 1):
        # Assume number is prime until shown it is not. 
        isPrime = True
        for num in range(2, int(possiblePrime ** 0.5) + 1):
            if possiblePrime % num == 0:
                isPrime = False
                break
        if isPrime:
            print(possiblePrime)
​
n = int(input("Enter the number: "))
print_primes(n)
```
​
In this code, we loop through all numbers from `2` to `n`. For each number, we assume it's prime and then check if it has any 
divisors other than `1` and itself. If it does, then it's not a prime number. If it doesn't have any divisors, then it is a 
prime number.
​
The reason why we only need to check up to the square root of the number is because a larger factor of the number would be a 
multiple of smaller factor that has already been checked.
​
Please note that this code might take some time for large values of `n` because it's not very efficient. There are more 
efficient algorithms to find prime numbers, but they are also more complex.

In the next post…

Hosting LLMs on a CPU takes more time because some Large Language model images are very big, slowing inference speed. So, in the next post let’s look into the solution to host these LLM on an EKS cluster using Nvidia Device Plugin and Time Slicing.

Questions of comments? Please leave me a comment below.

Share:

Related Articles

Latest Articles