VLLM对比Ollama,6卡A5000 部署VLLM + Dify​​的详细教程

vllm 对比Ollama

​一、硬件与基础环境准备​

​1. 服务器配置要求​

  • ​GPU​​:6×NVIDIA A5000(24GB显存/卡,共144GB显存)
  • ​内存​​:≥64GB RAM
  • ​存储​​:≥500GB SSD(推荐NVMe)
  • ​系统​​:Ubuntu 22.04 LTS / Debian 12

​2. 环境初始化​

# 安装基础工具
sudo apt update && sudo apt install -y docker.io nvidia-container-toolkit
# 配置Docker使用NVIDIA GPU
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

​二、VLLM多卡部署(6卡优化)​

​1. 安装vLLM​

# 创建虚拟环境
conda create -n vllm python=3.10 -y && conda activate vllm
# 安装vLLM(推荐0.5.4+)
pip install vllm -i https://pypi.tuna.tsinghua.edu.cn/simple

​2. 启动6卡推理服务​

vllm serve --model /path/to/model \  
   --tensor-parallel-size 6 \          # 并行数=GPU数量
   --gpu-memory-utilization 0.85 \     # 显存利用率阈值(6卡建议0.8~0.9)
   --max-num-seqs 64 \                 # 高并发优化
   --enforce-eager \                   # 避免多卡兼容问题
   --port 8000 \                       # 服务端口
   --api-key "your-token"              # 访问令牌(增强安全性)

​三、Dify部署与对接VLLM​

​1. 部署Dify服务​

# 拉取Dify代码
git clone https://github.com/langgenius/dify.git
cd dify/docker

# 修改配置(关键步骤)
cp .env.example .env
nano .env  # 修改以下参数:
# 模型端点指向VLLM服务
MODEL_PROVIDER=vllm
VLLM_API_BASE=http://localhost:8000/v1  # VLLM的OpenAI兼容API地址
VLLM_MODEL_NAME=your-model-name         # 与vLLM启动时的模型名一致

​2. 启动Dify​

docker compose up -d  # 自动构建容器

​四、外部应用API调用方法​

​1. 通过Dify调用(业务层)​

  • ​Dify API地址​​:http://<服务器IP>:80/v1(默认端口)
  • ​认证​​:Header中添加 Authorization: Bearer {DIFY_API_KEY}
  • ​请求示例​​(生成文本):
import requests
url = "http://<服务器IP>/v1/completion"
data = {
  "inputs": "你好,介绍一下vLLM",
  "response_mode": "blocking"
}
headers = {"Authorization": "Bearer dify-api-key"}
response = requests.post(url, json=data, headers=headers)

​2. 直接调用VLLM(高性能场景)​

# 使用OpenAI兼容API(Python示例)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="your-token")
response = client.chat.completions.create(
  model="your-model-name",
  messages=[{"role": "user", "content": "解释量子力学"}]
)

​五、VLLM对比Ollama的核心优势​

​维度​​VLLM​​Ollama​
​多卡支持​✅ 原生6卡张量并行(--tensor-parallel-size=6❌ 仅支持单卡,多卡需手动切换
​吞吐量​⭐ 连续批处理技术,6卡并发提升5-10倍⚠️ 单请求处理,并发能力弱
​生产就绪​✅ 工业级部署(API密钥、监控、扩缩容)❌ 定位开发测试,无企业级特性
​显存管理​✅ PagedAttention动态分配,支持百亿模型⚠️ 全模型加载,易OOM
​安全性​✅ 内置API密钥认证❌ 默认无认证,需Nginx反向代理

💡 ​​关键结论​​:
VLLM是​​生产级AI服务​​的首选,尤其适合高并发、低延迟场景(如API服务);
Ollama更适合​​本地快速原型验证​​,但在多卡利用率和安全性上存在明显短板。


​六、常见问题排查​

  1. ​多卡启动失败​​: export VLLM_WORKER_MULTIPROC_METHOD=spawn # 解决多进程卡死
  2. ​显存不足​​:
    • 降低--gpu-memory-utilization至0.7
    • 添加--swap-space 16 使用主机内存扩展
  3. ​Dify连接VLLM失败​​:
    • 检查.envVLLM_API_BASE是否含/v1路径
    • 确保vLLM启动参数含--api-key且与Dify配置一致

部署完成后,可通过 nvidia-smi 监控GPU利用率,正常运行时6卡负载应均衡(±5%差异)。

英文参考原文:Based on the information available, here’s a comparison of vLLM and Ollama, two popular frameworks for running large language models (LLMs) locally: 

vLLM 

  • Focus: High-throughput, low-latency LLM inference and serving, particularly suited for production environments.
  • Key Features:
    • PagedAttention: A memory management technique that optimizes GPU memory usage for faster inference speeds, especially with long sequences and large models.
    • Continuous Batching: Processes incoming requests dynamically to maximize hardware utilization.
    • High Performance: Consistently delivers superior throughput and lower latency, particularly for concurrent requests.
    • Scalability: Designed for scalability, including support for tensor parallelism and pipeline parallelism for distributed inference across multiple GPUs or nodes.
    • OpenAI-compatible API: Simplifies integration with applications.
  • Hardware Requirements: Optimized for high-end, CUDA-enabled NVIDIA GPUs, although it technically supports CPU inference (less optimized).
  • Ease of Use: Offers more control and optimization options but has a steeper learning curve, requiring more technical knowledge for setup. 

Ollama 

  • Focus: User-friendly, local deployment and management of LLMs, prioritizing simplicity and accessibility.
  • Key Features:
    • Ease of Use: Offers a streamlined workflow for downloading, running, and managing models with a simple command-line interface (CLI) and an OpenAI-compatible API.
    • Broad Hardware Compatibility: Works well on both GPUs and CPUs, making it accessible to users with consumer-grade hardware.
    • Local Deployment with Privacy: Ensures data privacy and control by keeping data processing within your local environment.
    • Adaptable: Supports various model types and offers token streaming for faster responses.
    • Growing Performance: While potentially slower than vLLM on high-end GPUs, recent updates have significantly improved its performance.
  • Hardware Requirements: Designed to work reasonably well even on consumer-grade hardware.
  • Ease of Use: Prioritizes simplicity, making it easy to install and run models with just a few commands. 

In Summary:

  • Choose vLLM when: You need maximum performance and scalability in production environments, especially when utilizing high-end GPUs for high-throughput workloads.
  • Choose Ollama when: You prioritize ease of use, broad hardware compatibility (including CPU-only setups), and local data privacy for development, prototyping, or simpler projects. 

Hybrid Approach:

It’s worth considering a hybrid approach where you use Ollama for development and prototyping and then deploy with vLLM in production for optimal performance.