--- title: NVIDIA Driver + vLLM Complete Setup Guide date: 2026-06-28 type: guide domain: homelab tags: [nvidia, gpu, vllm, llm] --- **Ubuntu 24.04 LTS | RTX 3090 | Qwen2.5-7B-Instruct** Verified: 2026-02-17 --- ## Verified Environment | Component | Detail | |---|---| | GPU | NVIDIA GeForce RTX 3090 (GA102, Compute Capability 8.6, Ampere) | | OS | Ubuntu 24.04.4 LTS, Linux 6.17.0-14-generic | | NVIDIA Driver | nvidia-driver-590-open, 590.48.01 (CUDA 13.1) | | Docker | 29.2.1 | | NVIDIA Container Toolkit | 1.18.2 | | vLLM Image | vllm/vllm-openai:latest (v0.15.1, ~29.5GB) | | Model | Qwen/Qwen2.5-7B-Instruct (bf16, ~22GB VRAM) | --- ## Prerequisites The following must be installed before starting: ```bash # Docker sudo apt install docker.io sudo usermod -aG docker $USER # NVIDIA Container Toolkit (enables --gpus flag) # https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \ sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt update sudo apt install nvidia-container-toolkit sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker ``` --- ## Part 1: NVIDIA Driver Installation ### Step 1. Check recommended driver ```bash ubuntu-drivers devices ``` Look for the `recommended` line: ``` driver : nvidia-driver-590-open - distro non-free recommended ``` ### Step 2. Install driver ```bash sudo apt update sudo apt install nvidia-driver-590-open ``` Alternative (auto-selects recommended): ```bash sudo ubuntu-drivers install ``` ### Step 3. Remove old drivers (if any) Multiple driver versions can conflict. Remove previous versions: ```bash # Check installed drivers dpkg -l | grep nvidia-driver # Remove old versions (adjust package names as needed) sudo apt remove --purge nvidia-driver-580-open nvidia-driver-575-open sudo apt autoremove ``` ### Step 4. Reboot ```bash sudo reboot ``` ### Step 5. Verify driver after reboot ```bash nvidia-smi ``` Expected output: ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 590.48.01 Driver Version: 590.48.01 CUDA Version: 13.1 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 Off | 00000000:06:00.0 On | N/A | +-----------------------------------------+------------------------+----------------------+ ``` If `nvidia-smi` fails, the driver did not load correctly. Check `dmesg | grep -i nvidia` for errors. --- ## Part 2: vLLM Server Deployment ### Step 6. Clean up old containers ```bash docker rm -f vllm-qwen 2>/dev/null ``` ### Step 7. Pull vLLM image (optional, docker run will pull automatically) ```bash docker pull vllm/vllm-openai:latest ``` ### Step 8. Start vLLM server ```bash docker run -d --gpus all \ -e LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -p 8000:8000 \ --name vllm-qwen \ vllm/vllm-openai:latest \ --model Qwen/Qwen2.5-7B-Instruct \ --max-model-len 4096 \ --enforce-eager ``` **Option breakdown:** | Option | Purpose | |---|---| | `--gpus all` | GPU passthrough to container | | `-e LD_LIBRARY_PATH=...` | **CRITICAL** - Bypasses CUDA compat library issue on GeForce GPUs (see Troubleshooting) | | `-v ~/.cache/huggingface:...` | Share model cache with host (avoids re-downloading ~14GB) | | `-p 8000:8000` | Expose OpenAI-compatible API | | `--enforce-eager` | Skip torch.compile/CUDAGraph for fast startup (~1-2 min instead of 10+ min) | | `--max-model-len 4096` | Limit context length to fit in 24GB VRAM | ### Step 9. Wait for server ready ```bash docker logs -f vllm-qwen ``` Wait for this line (takes ~1-2 minutes): ``` INFO: Application startup complete. ``` Press `Ctrl+C` to stop following logs. --- ## Part 3: Verification ### Step 10. API test ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-7B-Instruct", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 32 }' ``` Expected response: ```json { "model": "Qwen/Qwen2.5-7B-Instruct", "choices": [{ "message": { "role": "assistant", "content": "Hello! How can I assist you today?" }, "finish_reason": "stop" }] } ``` ### Step 11. Check GPU memory usage ```bash docker exec vllm-qwen nvidia-smi ``` Expected: ``` | 0 NVIDIA GeForce RTX 3090 | 22190MiB / 24576MiB | | 0 N/A N/A 131 C VLLM::EngineCore 22112MiB | ``` ### Step 12. Check model list ```bash curl -s http://localhost:8000/v1/models | python3 -m json.tool ``` --- ## Container Management ```bash # Follow logs docker logs -f vllm-qwen # Stop server docker stop vllm-qwen # Restart stopped server docker start vllm-qwen # Remove container docker stop vllm-qwen && docker rm vllm-qwen ``` --- ## Quick Restart After Reboot After a system reboot, run these commands to get the LLM server back up: ```bash # 1. Verify driver loaded nvidia-smi # 2. Remove stale container and start fresh docker rm -f vllm-qwen 2>/dev/null docker run -d --gpus all \ -e LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -p 8000:8000 \ --name vllm-qwen \ vllm/vllm-openai:latest \ --model Qwen/Qwen2.5-7B-Instruct \ --max-model-len 4096 \ --enforce-eager # 3. Wait for startup (~1-2 min) docker logs -f vllm-qwen # Look for: "Application startup complete." # Ctrl+C to exit logs # 4. Test curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"Hello"}],"max_tokens":32}' ``` --- ## Troubleshooting ### Error 803/804: forward compatibility failure ``` RuntimeError: Unexpected error from cudaGetDeviceCount(). Error 804: forward compatibility was attempted on non supported HW Error 803: system has unsupported display driver / cuda driver combination ``` **Root cause:** The vLLM Docker image includes `/usr/local/cuda-XX.X/compat/libcuda.so` (NVIDIA Forward Compatibility library). On container startup, this compat library loads *before* the host driver library. Forward Compatibility only works on datacenter GPUs (Tesla, A100, H100, etc.) — it fails on GeForce/RTX consumer GPUs. **Library loading order (the problem):** ``` Container compat: /usr/local/cuda-13.0/compat/libcuda.so.580.82.07 ← loaded first (BROKEN on GeForce) Host driver: /usr/lib/x86_64-linux-gnu/libcuda.so.590.48.01 ← should be loaded instead ``` **Solution:** Set `LD_LIBRARY_PATH` to prioritize host driver libraries over the compat directory: ``` -e LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64 ``` This is already included in the `docker run` command above. ### Model loading takes 10+ minutes Without `--enforce-eager`, vLLM runs torch.compile + CUDAGraph capture on first startup. Add `--enforce-eager` to skip this (slight inference performance tradeoff). ### nvidia-smi works but torch.cuda fails inside container Missing `LD_LIBRARY_PATH` environment variable. Ensure `-e LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64` is in your `docker run` command. ### nvidia-smi fails after reboot ```bash # Check if kernel module loaded lsmod | grep nvidia # Check for errors dmesg | grep -i nvidia # Reinstall driver if needed sudo apt install --reinstall nvidia-driver-590-open sudo reboot ``` ### Port 8000 already in use ```bash # Find what's using the port sudo lsof -i :8000 # Or use a different port docker run -d --gpus all ... -p 8001:8000 ... # Then access via http://localhost:8001 ``` --- ## Driver Version Reference | Package | Version | CUDA | Notes | |---|---|---|---| | nvidia-driver-535-open | 535.x | 12.2 | LTS | | nvidia-driver-550-open | 550.x | 12.4 | Stable | | nvidia-driver-570-open | 570.211.01 | 12.8 | Stable | | nvidia-driver-580-open | 580.126.09 | 13.0 | - | | **nvidia-driver-590-open** | **590.48.01** | **13.1** | **recommended** | RTX 3090 (Ampere, CC 8.6) supports CUDA 11.1+. All drivers above are compatible. --- ## API Usage Examples ### Chat completion ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-7B-Instruct", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain Docker in one sentence."} ], "max_tokens": 128, "temperature": 0.7 }' ``` ### Streaming ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-7B-Instruct", "messages": [{"role": "user", "content": "Tell me a joke"}], "max_tokens": 128, "stream": true }' ``` ### Python client (OpenAI SDK compatible) ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed") response = client.chat.completions.create( model="Qwen/Qwen2.5-7B-Instruct", messages=[{"role": "user", "content": "Hello"}], max_tokens=128, ) print(response.choices[0].message.content) ```