NVIDIA Driver + vLLM Complete Setup Guide
Ubuntu 24.04 LTS | RTX 3090 | Qwen2.5-7B-Instruct
Verified: 2026-02-17
Verified Environment
| Component | Detail |
|---|---|
| GPU | NVIDIA GeForce RTX 3090 (GA102, Compute Capability 8.6, Ampere) |
| OS | Ubuntu 24.04.4 LTS, Linux 6.17.0-14-generic |
| NVIDIA Driver | nvidia-driver-590-open, 590.48.01 (CUDA 13.1) |
| Docker | 29.2.1 |
| NVIDIA Container Toolkit | 1.18.2 |
| vLLM Image | vllm/vllm-openai:latest (v0.15.1, ~29.5GB) |
| Model | Qwen/Qwen2.5-7B-Instruct (bf16, ~22GB VRAM) |
Prerequisites
The following must be installed before starting:
# Docker
sudo apt install docker.io
sudo usermod -aG docker $USER
# NVIDIA Container Toolkit (enables --gpus flag)
# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Part 1: NVIDIA Driver Installation
Step 1. Check recommended driver
ubuntu-drivers devices
Look for the recommended line:
driver : nvidia-driver-590-open - distro non-free recommended
Step 2. Install driver
sudo apt update
sudo apt install nvidia-driver-590-open
Alternative (auto-selects recommended):
sudo ubuntu-drivers install
Step 3. Remove old drivers (if any)
Multiple driver versions can conflict. Remove previous versions:
# Check installed drivers
dpkg -l | grep nvidia-driver
# Remove old versions (adjust package names as needed)
sudo apt remove --purge nvidia-driver-580-open nvidia-driver-575-open
sudo apt autoremove
Step 4. Reboot
sudo reboot
Step 5. Verify driver after reboot
nvidia-smi
Expected output:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01 Driver Version: 590.48.01 CUDA Version: 13.1 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:06:00.0 On | N/A |
+-----------------------------------------+------------------------+----------------------+
If nvidia-smi fails, the driver did not load correctly. Check dmesg | grep -i nvidia for errors.
Part 2: vLLM Server Deployment
Step 6. Clean up old containers
docker rm -f vllm-qwen 2>/dev/null
Step 7. Pull vLLM image (optional, docker run will pull automatically)
docker pull vllm/vllm-openai:latest
Step 8. Start vLLM server
docker run -d --gpus all \
-e LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--name vllm-qwen \
vllm/vllm-openai:latest \
--model Qwen/Qwen2.5-7B-Instruct \
--max-model-len 4096 \
--enforce-eager
Option breakdown:
| Option | Purpose |
|---|---|
--gpus all |
GPU passthrough to container |
-e LD_LIBRARY_PATH=... |
CRITICAL - Bypasses CUDA compat library issue on GeForce GPUs (see Troubleshooting) |
-v ~/.cache/huggingface:... |
Share model cache with host (avoids re-downloading ~14GB) |
-p 8000:8000 |
Expose OpenAI-compatible API |
--enforce-eager |
Skip torch.compile/CUDAGraph for fast startup (~1-2 min instead of 10+ min) |
--max-model-len 4096 |
Limit context length to fit in 24GB VRAM |
Step 9. Wait for server ready
docker logs -f vllm-qwen
Wait for this line (takes ~1-2 minutes):
INFO: Application startup complete.
Press Ctrl+C to stop following logs.
Part 3: Verification
Step 10. API test
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 32
}'
Expected response:
{
"model": "Qwen/Qwen2.5-7B-Instruct",
"choices": [{
"message": {
"role": "assistant",
"content": "Hello! How can I assist you today?"
},
"finish_reason": "stop"
}]
}
Step 11. Check GPU memory usage
docker exec vllm-qwen nvidia-smi
Expected:
| 0 NVIDIA GeForce RTX 3090 | 22190MiB / 24576MiB |
| 0 N/A N/A 131 C VLLM::EngineCore 22112MiB |
Step 12. Check model list
curl -s http://localhost:8000/v1/models | python3 -m json.tool
Container Management
# Follow logs
docker logs -f vllm-qwen
# Stop server
docker stop vllm-qwen
# Restart stopped server
docker start vllm-qwen
# Remove container
docker stop vllm-qwen && docker rm vllm-qwen
Quick Restart After Reboot
After a system reboot, run these commands to get the LLM server back up:
# 1. Verify driver loaded
nvidia-smi
# 2. Remove stale container and start fresh
docker rm -f vllm-qwen 2>/dev/null
docker run -d --gpus all \
-e LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--name vllm-qwen \
vllm/vllm-openai:latest \
--model Qwen/Qwen2.5-7B-Instruct \
--max-model-len 4096 \
--enforce-eager
# 3. Wait for startup (~1-2 min)
docker logs -f vllm-qwen
# Look for: "Application startup complete."
# Ctrl+C to exit logs
# 4. Test
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"Hello"}],"max_tokens":32}'
Troubleshooting
Error 803/804: forward compatibility failure
RuntimeError: Unexpected error from cudaGetDeviceCount().
Error 804: forward compatibility was attempted on non supported HW
Error 803: system has unsupported display driver / cuda driver combination
Root cause: The vLLM Docker image includes /usr/local/cuda-XX.X/compat/libcuda.so (NVIDIA Forward Compatibility library). On container startup, this compat library loads before the host driver library. Forward Compatibility only works on datacenter GPUs (Tesla, A100, H100, etc.) — it fails on GeForce/RTX consumer GPUs.
Library loading order (the problem):
Container compat: /usr/local/cuda-13.0/compat/libcuda.so.580.82.07 ← loaded first (BROKEN on GeForce)
Host driver: /usr/lib/x86_64-linux-gnu/libcuda.so.590.48.01 ← should be loaded instead
Solution: Set LD_LIBRARY_PATH to prioritize host driver libraries over the compat directory:
-e LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64
This is already included in the docker run command above.
Model loading takes 10+ minutes
Without --enforce-eager, vLLM runs torch.compile + CUDAGraph capture on first startup. Add --enforce-eager to skip this (slight inference performance tradeoff).
nvidia-smi works but torch.cuda fails inside container
Missing LD_LIBRARY_PATH environment variable. Ensure -e LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64 is in your docker run command.
nvidia-smi fails after reboot
# Check if kernel module loaded
lsmod | grep nvidia
# Check for errors
dmesg | grep -i nvidia
# Reinstall driver if needed
sudo apt install --reinstall nvidia-driver-590-open
sudo reboot
Port 8000 already in use
# Find what's using the port
sudo lsof -i :8000
# Or use a different port
docker run -d --gpus all ... -p 8001:8000 ...
# Then access via http://localhost:8001
Driver Version Reference
| Package | Version | CUDA | Notes |
|---|---|---|---|
| nvidia-driver-535-open | 535.x | 12.2 | LTS |
| nvidia-driver-550-open | 550.x | 12.4 | Stable |
| nvidia-driver-570-open | 570.211.01 | 12.8 | Stable |
| nvidia-driver-580-open | 580.126.09 | 13.0 | - |
| nvidia-driver-590-open | 590.48.01 | 13.1 | recommended |
RTX 3090 (Ampere, CC 8.6) supports CUDA 11.1+. All drivers above are compatible.
API Usage Examples
Chat completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain Docker in one sentence."}
],
"max_tokens": 128,
"temperature": 0.7
}'
Streaming
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [{"role": "user", "content": "Tell me a joke"}],
"max_tokens": 128,
"stream": true
}'
Python client (OpenAI SDK compatible)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Hello"}],
max_tokens=128,
)
print(response.choices[0].message.content)