---
title: NVIDIA Driver + vLLM Complete Setup Guide
date: 2026-06-28
type: guide
domain: homelab
tags: [nvidia, gpu, vllm, llm]
---


**Ubuntu 24.04 LTS | RTX 3090 | Qwen2.5-7B-Instruct**

Verified: 2026-02-17

---

## Verified Environment

| Component | Detail |
|---|---|
| GPU | NVIDIA GeForce RTX 3090 (GA102, Compute Capability 8.6, Ampere) |
| OS | Ubuntu 24.04.4 LTS, Linux 6.17.0-14-generic |
| NVIDIA Driver | nvidia-driver-590-open, 590.48.01 (CUDA 13.1) |
| Docker | 29.2.1 |
| NVIDIA Container Toolkit | 1.18.2 |
| vLLM Image | vllm/vllm-openai:latest (v0.15.1, ~29.5GB) |
| Model | Qwen/Qwen2.5-7B-Instruct (bf16, ~22GB VRAM) |

---

## Prerequisites

The following must be installed before starting:

```bash
# Docker
sudo apt install docker.io
sudo usermod -aG docker $USER

# NVIDIA Container Toolkit (enables --gpus flag)
# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
```

---

## Part 1: NVIDIA Driver Installation

### Step 1. Check recommended driver

```bash
ubuntu-drivers devices
```

Look for the `recommended` line:

```
driver   : nvidia-driver-590-open - distro non-free recommended
```

### Step 2. Install driver

```bash
sudo apt update
sudo apt install nvidia-driver-590-open
```

Alternative (auto-selects recommended):

```bash
sudo ubuntu-drivers install
```

### Step 3. Remove old drivers (if any)

Multiple driver versions can conflict. Remove previous versions:

```bash
# Check installed drivers
dpkg -l | grep nvidia-driver

# Remove old versions (adjust package names as needed)
sudo apt remove --purge nvidia-driver-580-open nvidia-driver-575-open
sudo apt autoremove
```

### Step 4. Reboot

```bash
sudo reboot
```

### Step 5. Verify driver after reboot

```bash
nvidia-smi
```

Expected output:

```
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01              Driver Version: 590.48.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:06:00.0  On |                  N/A |
+-----------------------------------------+------------------------+----------------------+
```

If `nvidia-smi` fails, the driver did not load correctly. Check `dmesg | grep -i nvidia` for errors.

---

## Part 2: vLLM Server Deployment

### Step 6. Clean up old containers

```bash
docker rm -f vllm-qwen 2>/dev/null
```

### Step 7. Pull vLLM image (optional, docker run will pull automatically)

```bash
docker pull vllm/vllm-openai:latest
```

### Step 8. Start vLLM server

```bash
docker run -d --gpus all \
  -e LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --name vllm-qwen \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen2.5-7B-Instruct \
  --max-model-len 4096 \
  --enforce-eager
```

**Option breakdown:**

| Option | Purpose |
|---|---|
| `--gpus all` | GPU passthrough to container |
| `-e LD_LIBRARY_PATH=...` | **CRITICAL** - Bypasses CUDA compat library issue on GeForce GPUs (see Troubleshooting) |
| `-v ~/.cache/huggingface:...` | Share model cache with host (avoids re-downloading ~14GB) |
| `-p 8000:8000` | Expose OpenAI-compatible API |
| `--enforce-eager` | Skip torch.compile/CUDAGraph for fast startup (~1-2 min instead of 10+ min) |
| `--max-model-len 4096` | Limit context length to fit in 24GB VRAM |

### Step 9. Wait for server ready

```bash
docker logs -f vllm-qwen
```

Wait for this line (takes ~1-2 minutes):

```
INFO:     Application startup complete.
```

Press `Ctrl+C` to stop following logs.

---

## Part 3: Verification

### Step 10. API test

```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 32
  }'
```

Expected response:

```json
{
  "model": "Qwen/Qwen2.5-7B-Instruct",
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Hello! How can I assist you today?"
    },
    "finish_reason": "stop"
  }]
}
```

### Step 11. Check GPU memory usage

```bash
docker exec vllm-qwen nvidia-smi
```

Expected:

```
|   0  NVIDIA GeForce RTX 3090    |   22190MiB /  24576MiB |
|    0   N/A  N/A    131   C   VLLM::EngineCore    22112MiB |
```

### Step 12. Check model list

```bash
curl -s http://localhost:8000/v1/models | python3 -m json.tool
```

---

## Container Management

```bash
# Follow logs
docker logs -f vllm-qwen

# Stop server
docker stop vllm-qwen

# Restart stopped server
docker start vllm-qwen

# Remove container
docker stop vllm-qwen && docker rm vllm-qwen
```

---

## Quick Restart After Reboot

After a system reboot, run these commands to get the LLM server back up:

```bash
# 1. Verify driver loaded
nvidia-smi

# 2. Remove stale container and start fresh
docker rm -f vllm-qwen 2>/dev/null
docker run -d --gpus all \
  -e LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --name vllm-qwen \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen2.5-7B-Instruct \
  --max-model-len 4096 \
  --enforce-eager

# 3. Wait for startup (~1-2 min)
docker logs -f vllm-qwen
# Look for: "Application startup complete."
# Ctrl+C to exit logs

# 4. Test
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"Hello"}],"max_tokens":32}'
```

---

## Troubleshooting

### Error 803/804: forward compatibility failure

```
RuntimeError: Unexpected error from cudaGetDeviceCount().
Error 804: forward compatibility was attempted on non supported HW
Error 803: system has unsupported display driver / cuda driver combination
```

**Root cause:** The vLLM Docker image includes `/usr/local/cuda-XX.X/compat/libcuda.so` (NVIDIA Forward Compatibility library). On container startup, this compat library loads *before* the host driver library. Forward Compatibility only works on datacenter GPUs (Tesla, A100, H100, etc.) — it fails on GeForce/RTX consumer GPUs.

**Library loading order (the problem):**

```
Container compat:  /usr/local/cuda-13.0/compat/libcuda.so.580.82.07   ← loaded first (BROKEN on GeForce)
Host driver:       /usr/lib/x86_64-linux-gnu/libcuda.so.590.48.01      ← should be loaded instead
```

**Solution:** Set `LD_LIBRARY_PATH` to prioritize host driver libraries over the compat directory:

```
-e LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64
```

This is already included in the `docker run` command above.

### Model loading takes 10+ minutes

Without `--enforce-eager`, vLLM runs torch.compile + CUDAGraph capture on first startup. Add `--enforce-eager` to skip this (slight inference performance tradeoff).

### nvidia-smi works but torch.cuda fails inside container

Missing `LD_LIBRARY_PATH` environment variable. Ensure `-e LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64` is in your `docker run` command.

### nvidia-smi fails after reboot

```bash
# Check if kernel module loaded
lsmod | grep nvidia

# Check for errors
dmesg | grep -i nvidia

# Reinstall driver if needed
sudo apt install --reinstall nvidia-driver-590-open
sudo reboot
```

### Port 8000 already in use

```bash
# Find what's using the port
sudo lsof -i :8000

# Or use a different port
docker run -d --gpus all ... -p 8001:8000 ...
# Then access via http://localhost:8001
```

---

## Driver Version Reference

| Package | Version | CUDA | Notes |
|---|---|---|---|
| nvidia-driver-535-open | 535.x | 12.2 | LTS |
| nvidia-driver-550-open | 550.x | 12.4 | Stable |
| nvidia-driver-570-open | 570.211.01 | 12.8 | Stable |
| nvidia-driver-580-open | 580.126.09 | 13.0 | - |
| **nvidia-driver-590-open** | **590.48.01** | **13.1** | **recommended** |

RTX 3090 (Ampere, CC 8.6) supports CUDA 11.1+. All drivers above are compatible.

---

## API Usage Examples

### Chat completion

```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain Docker in one sentence."}
    ],
    "max_tokens": 128,
    "temperature": 0.7
  }'
```

### Streaming

```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [{"role": "user", "content": "Tell me a joke"}],
    "max_tokens": 128,
    "stream": true
  }'
```

### Python client (OpenAI SDK compatible)

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
    max_tokens=128,
)
print(response.choices[0].message.content)
```