一、硬件与基础环境准备
1. 服务器配置要求
- GPU:6×NVIDIA A5000(24GB显存/卡,共144GB显存)
- 内存:≥64GB RAM
- 存储:≥500GB SSD(推荐NVMe)
- 系统:Ubuntu 22.04 LTS / Debian 12
2. 环境初始化
# 安装基础工具
sudo apt update && sudo apt install -y docker.io nvidia-container-toolkit
# 配置Docker使用NVIDIA GPU
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
二、VLLM多卡部署(6卡优化)
1. 安装vLLM
# 创建虚拟环境
conda create -n vllm python=3.10 -y && conda activate vllm
# 安装vLLM(推荐0.5.4+)
pip install vllm -i https://pypi.tuna.tsinghua.edu.cn/simple
2. 启动6卡推理服务
vllm serve --model /path/to/model \
--tensor-parallel-size 6 \ # 并行数=GPU数量
--gpu-memory-utilization 0.85 \ # 显存利用率阈值(6卡建议0.8~0.9)
--max-num-seqs 64 \ # 高并发优化
--enforce-eager \ # 避免多卡兼容问题
--port 8000 \ # 服务端口
--api-key "your-token" # 访问令牌(增强安全性)
三、Dify部署与对接VLLM
1. 部署Dify服务
# 拉取Dify代码
git clone https://github.com/langgenius/dify.git
cd dify/docker
# 修改配置(关键步骤)
cp .env.example .env
nano .env # 修改以下参数:
# 模型端点指向VLLM服务
MODEL_PROVIDER=vllm
VLLM_API_BASE=http://localhost:8000/v1 # VLLM的OpenAI兼容API地址
VLLM_MODEL_NAME=your-model-name # 与vLLM启动时的模型名一致
2. 启动Dify
docker compose up -d # 自动构建容器
四、外部应用API调用方法
1. 通过Dify调用(业务层)
- Dify API地址:
http://<服务器IP>:80/v1
(默认端口) - 认证:Header中添加
Authorization: Bearer {DIFY_API_KEY}
- 请求示例(生成文本):
import requests
url = "http://<服务器IP>/v1/completion"
data = {
"inputs": "你好,介绍一下vLLM",
"response_mode": "blocking"
}
headers = {"Authorization": "Bearer dify-api-key"}
response = requests.post(url, json=data, headers=headers)
2. 直接调用VLLM(高性能场景)
# 使用OpenAI兼容API(Python示例)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="your-token")
response = client.chat.completions.create(
model="your-model-name",
messages=[{"role": "user", "content": "解释量子力学"}]
)
五、VLLM对比Ollama的核心优势
维度 | VLLM | Ollama |
---|---|---|
多卡支持 | ✅ 原生6卡张量并行(--tensor-parallel-size=6 ) | ❌ 仅支持单卡,多卡需手动切换 |
吞吐量 | ⭐ 连续批处理技术,6卡并发提升5-10倍 | ⚠️ 单请求处理,并发能力弱 |
生产就绪 | ✅ 工业级部署(API密钥、监控、扩缩容) | ❌ 定位开发测试,无企业级特性 |
显存管理 | ✅ PagedAttention动态分配,支持百亿模型 | ⚠️ 全模型加载,易OOM |
安全性 | ✅ 内置API密钥认证 | ❌ 默认无认证,需Nginx反向代理 |
💡 关键结论:
VLLM是生产级AI服务的首选,尤其适合高并发、低延迟场景(如API服务);
Ollama更适合本地快速原型验证,但在多卡利用率和安全性上存在明显短板。
六、常见问题排查
- 多卡启动失败:
export VLLM_WORKER_MULTIPROC_METHOD=spawn # 解决多进程卡死
- 显存不足:
- 降低
--gpu-memory-utilization
至0.7 - 添加
--swap-space 16
使用主机内存扩展
- 降低
- Dify连接VLLM失败:
- 检查
.env
中VLLM_API_BASE
是否含/v1
路径 - 确保vLLM启动参数含
--api-key
且与Dify配置一致
- 检查
部署完成后,可通过 nvidia-smi
监控GPU利用率,正常运行时6卡负载应均衡(±5%差异)。
英文参考原文:Based on the information available, here’s a comparison of vLLM and Ollama, two popular frameworks for running large language models (LLMs) locally:
vLLM
- Focus: High-throughput, low-latency LLM inference and serving, particularly suited for production environments.
- Key Features:
- PagedAttention: A memory management technique that optimizes GPU memory usage for faster inference speeds, especially with long sequences and large models.
- Continuous Batching: Processes incoming requests dynamically to maximize hardware utilization.
- High Performance: Consistently delivers superior throughput and lower latency, particularly for concurrent requests.
- Scalability: Designed for scalability, including support for tensor parallelism and pipeline parallelism for distributed inference across multiple GPUs or nodes.
- OpenAI-compatible API: Simplifies integration with applications.
- Hardware Requirements: Optimized for high-end, CUDA-enabled NVIDIA GPUs, although it technically supports CPU inference (less optimized).
- Ease of Use: Offers more control and optimization options but has a steeper learning curve, requiring more technical knowledge for setup.
Ollama
- Focus: User-friendly, local deployment and management of LLMs, prioritizing simplicity and accessibility.
- Key Features:
- Ease of Use: Offers a streamlined workflow for downloading, running, and managing models with a simple command-line interface (CLI) and an OpenAI-compatible API.
- Broad Hardware Compatibility: Works well on both GPUs and CPUs, making it accessible to users with consumer-grade hardware.
- Local Deployment with Privacy: Ensures data privacy and control by keeping data processing within your local environment.
- Adaptable: Supports various model types and offers token streaming for faster responses.
- Growing Performance: While potentially slower than vLLM on high-end GPUs, recent updates have significantly improved its performance.
- Hardware Requirements: Designed to work reasonably well even on consumer-grade hardware.
- Ease of Use: Prioritizes simplicity, making it easy to install and run models with just a few commands.
In Summary:
- Choose vLLM when: You need maximum performance and scalability in production environments, especially when utilizing high-end GPUs for high-throughput workloads.
- Choose Ollama when: You prioritize ease of use, broad hardware compatibility (including CPU-only setups), and local data privacy for development, prototyping, or simpler projects.
Hybrid Approach:
It’s worth considering a hybrid approach where you use Ollama for development and prototyping and then deploy with vLLM in production for optimal performance.