guidehomelab 2026-06-28nvidiagpuvllmllm

NVIDIA Driver + vLLM Complete Setup Guide

Ubuntu 24.04 LTS | RTX 3090 | Qwen2.5-7B-Instruct

Verified: 2026-02-17

Verified Environment

Component	Detail
GPU	NVIDIA GeForce RTX 3090 (GA102, Compute Capability 8.6, Ampere)
OS	Ubuntu 24.04.4 LTS, Linux 6.17.0-14-generic
NVIDIA Driver	nvidia-driver-590-open, 590.48.01 (CUDA 13.1)
Docker	29.2.1
NVIDIA Container Toolkit	1.18.2
vLLM Image	vllm/vllm-openai:latest (v0.15.1, ~29.5GB)
Model	Qwen/Qwen2.5-7B-Instruct (bf16, ~22GB VRAM)

Prerequisites

The following must be installed before starting:

# Docker
sudo apt install docker.io
sudo usermod -aG docker $USER

# NVIDIA Container Toolkit (enables --gpus flag)
# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Part 1: NVIDIA Driver Installation

Step 1. Check recommended driver

ubuntu-drivers devices

Look for the recommended line:

driver   : nvidia-driver-590-open - distro non-free recommended

Step 2. Install driver

sudo apt update
sudo apt install nvidia-driver-590-open

Alternative (auto-selects recommended):

sudo ubuntu-drivers install

Step 3. Remove old drivers (if any)

Multiple driver versions can conflict. Remove previous versions:

# Check installed drivers
dpkg -l | grep nvidia-driver

# Remove old versions (adjust package names as needed)
sudo apt remove --purge nvidia-driver-580-open nvidia-driver-575-open
sudo apt autoremove

Step 4. Reboot

sudo reboot

Step 5. Verify driver after reboot

nvidia-smi

Expected output:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01              Driver Version: 590.48.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:06:00.0  On |                  N/A |
+-----------------------------------------+------------------------+----------------------+

If nvidia-smi fails, the driver did not load correctly. Check dmesg | grep -i nvidia for errors.

Part 2: vLLM Server Deployment

Step 6. Clean up old containers

docker rm -f vllm-qwen 2>/dev/null

Step 7. Pull vLLM image (optional, docker run will pull automatically)

docker pull vllm/vllm-openai:latest

Step 8. Start vLLM server

docker run -d --gpus all \
  -e LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --name vllm-qwen \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen2.5-7B-Instruct \
  --max-model-len 4096 \
  --enforce-eager

Option breakdown:

Option	Purpose
`--gpus all`	GPU passthrough to container
`-e LD_LIBRARY_PATH=...`	CRITICAL - Bypasses CUDA compat library issue on GeForce GPUs (see Troubleshooting)
`-v ~/.cache/huggingface:...`	Share model cache with host (avoids re-downloading ~14GB)
`-p 8000:8000`	Expose OpenAI-compatible API
`--enforce-eager`	Skip torch.compile/CUDAGraph for fast startup (~1-2 min instead of 10+ min)
`--max-model-len 4096`	Limit context length to fit in 24GB VRAM

Step 9. Wait for server ready

docker logs -f vllm-qwen

Wait for this line (takes ~1-2 minutes):

INFO:     Application startup complete.

Press Ctrl+C to stop following logs.

Part 3: Verification

Step 10. API test

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 32
  }'

Expected response:

{
  "model": "Qwen/Qwen2.5-7B-Instruct",
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Hello! How can I assist you today?"
    },
    "finish_reason": "stop"
  }]
}

Step 11. Check GPU memory usage

docker exec vllm-qwen nvidia-smi

Expected:

|   0  NVIDIA GeForce RTX 3090    |   22190MiB /  24576MiB |
|    0   N/A  N/A    131   C   VLLM::EngineCore    22112MiB |

Step 12. Check model list

curl -s http://localhost:8000/v1/models | python3 -m json.tool

Container Management

# Follow logs
docker logs -f vllm-qwen

# Stop server
docker stop vllm-qwen

# Restart stopped server
docker start vllm-qwen

# Remove container
docker stop vllm-qwen && docker rm vllm-qwen

Quick Restart After Reboot

After a system reboot, run these commands to get the LLM server back up:

# 1. Verify driver loaded
nvidia-smi

# 2. Remove stale container and start fresh
docker rm -f vllm-qwen 2>/dev/null
docker run -d --gpus all \
  -e LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --name vllm-qwen \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen2.5-7B-Instruct \
  --max-model-len 4096 \
  --enforce-eager

# 3. Wait for startup (~1-2 min)
docker logs -f vllm-qwen
# Look for: "Application startup complete."
# Ctrl+C to exit logs

# 4. Test
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"Hello"}],"max_tokens":32}'

Troubleshooting

Error 803/804: forward compatibility failure

RuntimeError: Unexpected error from cudaGetDeviceCount().
Error 804: forward compatibility was attempted on non supported HW
Error 803: system has unsupported display driver / cuda driver combination

Root cause: The vLLM Docker image includes /usr/local/cuda-XX.X/compat/libcuda.so (NVIDIA Forward Compatibility library). On container startup, this compat library loads before the host driver library. Forward Compatibility only works on datacenter GPUs (Tesla, A100, H100, etc.) — it fails on GeForce/RTX consumer GPUs.

Library loading order (the problem):

Container compat:  /usr/local/cuda-13.0/compat/libcuda.so.580.82.07   ← loaded first (BROKEN on GeForce)
Host driver:       /usr/lib/x86_64-linux-gnu/libcuda.so.590.48.01      ← should be loaded instead

Solution: Set LD_LIBRARY_PATH to prioritize host driver libraries over the compat directory:

-e LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64

This is already included in the docker run command above.

Model loading takes 10+ minutes

Without --enforce-eager, vLLM runs torch.compile + CUDAGraph capture on first startup. Add --enforce-eager to skip this (slight inference performance tradeoff).

nvidia-smi works but torch.cuda fails inside container

Missing LD_LIBRARY_PATH environment variable. Ensure -e LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64 is in your docker run command.

nvidia-smi fails after reboot

# Check if kernel module loaded
lsmod | grep nvidia

# Check for errors
dmesg | grep -i nvidia

# Reinstall driver if needed
sudo apt install --reinstall nvidia-driver-590-open
sudo reboot

Port 8000 already in use

# Find what's using the port
sudo lsof -i :8000

# Or use a different port
docker run -d --gpus all ... -p 8001:8000 ...
# Then access via http://localhost:8001

Driver Version Reference

Package	Version	CUDA	Notes
nvidia-driver-535-open	535.x	12.2	LTS
nvidia-driver-550-open	550.x	12.4	Stable
nvidia-driver-570-open	570.211.01	12.8	Stable
nvidia-driver-580-open	580.126.09	13.0	-
nvidia-driver-590-open	590.48.01	13.1	recommended

RTX 3090 (Ampere, CC 8.6) supports CUDA 11.1+. All drivers above are compatible.

API Usage Examples

Chat completion

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain Docker in one sentence."}
    ],
    "max_tokens": 128,
    "temperature": 0.7
  }'

Streaming

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [{"role": "user", "content": "Tell me a joke"}],
    "max_tokens": 128,
    "stream": true
  }'

Python client (OpenAI SDK compatible)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
    max_tokens=128,
)
print(response.choices[0].message.content)

Files

Raw Markdown (index.md)