GPU and Special Hardware

Back

Loading concept...

GPU and Special Hardware in Kubernetes

The Story of the Super Workshop ๐Ÿญ

Imagine you have a giant workshop with many worker tables (these are your Kubernetes nodes). Most tables are good for regular workโ€”cutting paper, writing, drawing. But sometimes you need special tools: a super-powerful laser cutter, a 3D printer, or a microscope.

These special tools are like GPUs and special hardware in Kubernetes. Not every table has them, and you need a smart way to:

  1. Tell Kubernetes which tables have special tools (Node Feature Discovery)
  2. Let your projects use those tools properly (Device Plugins)

Letโ€™s explore this magical workshop!


What is a GPU? ๐ŸŽฎ

A GPU (Graphics Processing Unit) is like a super-brain thatโ€™s really good at doing many small tasks at once.

Simple Example:

  • Your regular brain (CPU): Solves one hard math problem at a time
  • GPU brain: Solves 1000 easy math problems ALL AT ONCE!

Why Do We Need GPUs?

Task CPU (Regular Brain) GPU (Super Brain)
Training AI ๐Ÿข Slow (days) ๐Ÿš€ Fast (hours)
Video editing ๐Ÿ˜ด Sluggish โšก Smooth
Scientific math ๐Ÿ“š One by one ๐ŸŽ† Thousands together

Device Plugins: The Tool Librarians ๐Ÿ“š

The Problem

Kubernetes is smart, but it doesnโ€™t automatically know about special hardware. Itโ€™s like having a librarian who knows about books but not about the 3D printer in the corner.

The Solution: Device Plugins!

A Device Plugin is like a special helper that tells Kubernetes:

โ€œHey! This node has 2 GPUs ready to use!โ€

graph TD A["๐Ÿ–ฅ๏ธ Node with GPU"] --> B["Device Plugin"] B --> C["๐Ÿ“ข Tells Kubernetes"] C --> D["โœ… GPU Available!"] D --> E["๐Ÿš€ Pods Can Use GPU"]

How Device Plugins Work

Step 1: Discovery The device plugin finds all the GPUs on the node.

Step 2: Registration It tells the kubelet: โ€œI manage GPUs!โ€

Step 3: Allocation When a pod asks for a GPU, the plugin gives it one.

Real Example: NVIDIA Device Plugin

This is the most popular GPU plugin. It lets your pods use NVIDIA graphics cards.

# Installing NVIDIA device plugin
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin
spec:
  selector:
    matchLabels:
      name: nvidia-plugin
  template:
    spec:
      containers:
      - name: nvidia-plugin
        image: nvidia/k8s-device-plugin

Requesting a GPU in Your Pod

Once the plugin is running, asking for a GPU is easy!

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
  - name: cuda-app
    image: my-ml-app
    resources:
      limits:
        nvidia.com/gpu: 1

๐ŸŽฏ Key Point: The nvidia.com/gpu: 1 line is like saying โ€œI need 1 special tool from the GPU shelf!โ€


Node Feature Discovery: The Detective ๐Ÿ”

The Problem

Your cluster has 100 nodes. Some have GPUs. Some have fast SSDs. Some have special Intel features. How does Kubernetes know what each node can do?

Enter: Node Feature Discovery (NFD)!

NFD is like a detective that visits every node and creates a detailed report of its special abilities.

graph TD A["๐Ÿ” NFD Visits Node"] --> B["Checks Hardware"] B --> C["Finds: GPU โœ…"] B --> D["Finds: Fast SSD โœ…"] B --> E["Finds: Intel AVX โœ…"] C --> F["๐Ÿท๏ธ Adds Labels to Node"] D --> F E --> F F --> G["Scheduler Knows Everything!"]

What NFD Discovers

Category Examples
CPU Intel, AMD, number of cores, special instructions
Memory How much RAM, memory speed
Storage SSD, NVMe, rotational drives
Network Speed, SR-IOV capability
GPU NVIDIA, AMD, model, memory
Custom Your own special features!

NFD Labels: The Name Tags

After NFD runs, your nodes get labels like name tags:

feature.node.kubernetes.io/cpu-cpuid.AVX512F=true
feature.node.kubernetes.io/pci-1234.present=true
feature.node.kubernetes.io/system-os_release.ID=ubuntu

Installing Node Feature Discovery

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nfd-worker
  namespace: node-feature-discovery
spec:
  selector:
    matchLabels:
      app: nfd-worker
  template:
    spec:
      containers:
      - name: nfd-worker
        image: registry.k8s.io/nfd/node-feature-discovery:v0.14.0
        args:
          - "-feature-sources=all"

Using NFD Labels for Scheduling

Now you can tell Kubernetes: โ€œRun this pod ONLY on nodes with GPUs!โ€

apiVersion: v1
kind: Pod
metadata:
  name: ml-training
spec:
  nodeSelector:
    feature.node.kubernetes.io/pci-10de.present: "true"
  containers:
  - name: trainer
    image: my-ml-trainer

๐Ÿง  Fun Fact: 10de is NVIDIAโ€™s vendor code. NFD found it automatically!


How They Work Together ๐Ÿค

Device Plugins and NFD are best friends!

graph TD A["Node Feature Discovery"] -->|Finds GPUs| B["Adds Labels"] B --> C["Scheduler Sees Labels"] D["Device Plugin"] -->|Registers GPUs| E["Kubelet Knows Count"] E --> F["Pods Can Request GPUs"] C --> G["Smart Scheduling!"] F --> G

The Complete Flow

  1. NFD scans the node and adds labels
  2. Device Plugin tells kubelet about GPU count
  3. You write a pod asking for GPU
  4. Scheduler finds nodes with GPU label
  5. Kubelet allocates an actual GPU to your pod
  6. Your pod runs with GPU power! ๐Ÿš€

Common Device Plugins ๐Ÿ”Œ

Plugin Hardware What It Does
NVIDIA GPU Exposes NVIDIA graphics cards
AMD GPU Exposes AMD graphics cards
Intel GPU/FPGA Intel accelerators
SR-IOV Network Fast network cards
RDMA Network Ultra-fast networking

Practice Example: ML Training Setup

Letโ€™s set up a cluster for machine learning!

Step 1: Install NFD

kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/node-feature-discovery/v0.14.0/deployment/nfd.yaml

Step 2: Install NVIDIA Device Plugin

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

Step 3: Check Your Nodes

kubectl get nodes -o json | jq '.items[].metadata.labels' | grep feature

Step 4: Run Your ML Pod

apiVersion: v1
kind: Pod
metadata:
  name: pytorch-training
spec:
  nodeSelector:
    feature.node.kubernetes.io/pci-10de.present: "true"
  containers:
  - name: pytorch
    image: pytorch/pytorch:latest
    resources:
      limits:
        nvidia.com/gpu: 2
    command: ["python", "train.py"]

Key Takeaways ๐ŸŽฏ

  1. Device Plugins = Librarians that manage special hardware
  2. NFD = Detective that discovers what each node can do
  3. Labels = Name tags that help scheduling
  4. Together = Smart placement of GPU workloads!

Remember This Analogy:

๐Ÿญ Workshop = Cluster ๐Ÿช‘ Tables = Nodes ๐Ÿ”ง Special Tools = GPUs/Hardware ๐Ÿ“‹ Tool Inventory = Device Plugin ๐Ÿ” Inspector = Node Feature Discovery ๐Ÿท๏ธ Labels = What tools each table has


Quick Reference

Request 1 GPU:

resources:
  limits:
    nvidia.com/gpu: 1

Target GPU Nodes:

nodeSelector:
  feature.node.kubernetes.io/pci-10de.present: "true"

Check Available GPUs:

kubectl describe node <node-name> | grep nvidia

You now understand how Kubernetes finds and uses special hardware! ๐ŸŽ‰

The detective (NFD) discovers whatโ€™s special about each node, and the librarian (Device Plugin) makes sure your pods can use those special tools. Together, they make GPU workloads on Kubernetes magical! โœจ

Loading story...

Story - Premium Content

Please sign in to view this story and start learning.

Upgrade to Premium to unlock full access to all stories.

Stay Tuned!

Story is coming soon.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.