vLLM 高性能部署实战指南

vLLM 简介与核心特性

vLLM 是由加州大学伯克利分校 Sky Computing Lab 开发的开源大语言模型推理引擎，专为高吞吐量、低延迟的 LLM 服务而设计。自 2023 年发布以来，vLLM 已成为业界最受欢迎的推理框架之一。

为什么选择 vLLM？

特性	传统推理框架	vLLM
内存管理	预分配连续内存	PagedAttention 动态管理
批处理	静态批处理	Continuous Batching
吞吐量	中等	提升 10-20 倍
GPU 利用率	40-60%	80-95%
延迟	较高	显著降低

核心架构组件

┌─────────────────────────────────────────────────────────┐
│                    vLLM Architecture                     │
├─────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │
│  │   API Layer │  │   Scheduler │  │   Worker    │     │
│  │  (OpenAI)   │  │  (Continuous│  │   (GPU)     │     │
│  │             │  │   Batching) │  │             │     │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘     │
│         │                │                │             │
│  ┌──────┴────────────────┴────────────────┴──────┐     │
│  │            PagedAttention Manager              │     │
│  │         (Block Table & KV Cache)               │     │
│  └────────────────────────────────────────────────┘     │
└─────────────────────────────────────────────────────────┘

PagedAttention：内存管理革命

传统注意力机制的内存问题

传统 Transformer 推理中，每个请求的 KV Cache 需要预分配连续内存空间，导致严重的内存碎片化和浪费：

# 传统方式的问题示意
# 假设最大序列长度 2048，每个请求预分配完整空间
request_1: [■■■■■■■■----------------]  # 实际使用 512，浪费 1536
request_2: [■■■■■■■■■■■■------------]  # 实际使用 1024，浪费 1024
request_3: [■■■■--------------------]  # 实际使用 256，浪费 1792
# 总计预分配: 6144 tokens，实际使用: 1792 tokens，浪费: 4352 tokens (70%)

PagedAttention 工作原理

PagedAttention 借鉴操作系统虚拟内存管理的思想，将 KV Cache 划分为固定大小的块（Block）：

# PagedAttention 内存管理
block_size = 16  # tokens per block

# 逻辑视图 vs 物理视图
Logical View (每个请求的 KV Cache):
Request 1: [Block 0] -> [Block 1] -> [Block 3]
Request 2: [Block 0] -> [Block 2] -> [Block 4]
Request 3: [Block 0] -> [Block 5]

Physical View (GPU 内存):
Block Table:
┌─────────┬─────────────────────────────────────┐
│ Block 0 │ [tokens 0-15]  [tokens 0-15]  [...] │
│ Block 1 │ [tokens 16-31]                      │
│ Block 2 │ [tokens 16-31]                      │
│ Block 3 │ [tokens 32-47]                      │
│ Block 4 │ [tokens 32-47]                      │
│ Block 5 │ [tokens 16-31]                      │
└─────────┴─────────────────────────────────────┘

内存共享优化

PagedAttention 支持多种内存共享策略：

Prompt 共享：相同系统提示词共享 KV Cache
Copy-on-Write：Fork 请求时共享块，修改时复制
并行采样：多个输出序列共享相同的 prompt KV Cache

# Copy-on-Write 示例
# 原始序列
Sequence A: [Block 0] -> [Block 1] -> [Block 2]

# Fork 后的两个序列（共享 Block 0-2）
Sequence A: [Block 0] -> [Block 1] -> [Block 2] -> [Block 4]
Sequence B: [Block 0] -> [Block 1] -> [Block 2] -> [Block 5]
# Block 0, 1, 2 被共享，只有新 token 需要新块

Continuous Batching：吞吐量优化

静态批处理的局限

时间轴: ──────────────────────────────────────────────>

静态批处理:
Batch 1: [Req1 ████] [Req2 ██████] [Req3 ██]
         等待 Req1 和 Req2 完成才能处理 Batch 2
Batch 2: [Req4 ███] [Req5 ██████] [Req6 ████]

问题: 短请求（Req3）必须等待长请求（Req2）完成

Continuous Batching 机制

Continuous Batching（也称为 Inflight Batching 或 Iteration-level Scheduling）允许在每次迭代中动态调整批处理内容：

时间轴: ──────────────────────────────────────────────>

Continuous Batching:
Iter 1: [Req1 █] [Req2 █] [Req3 █] [Req4 █]
Iter 2: [Req1 █] [Req2 █] [Req3 █] [Req4 █] [Req5 █]  ← Req3 完成，加入 Req5
Iter 3: [Req1 █] [Req2 █] [Req4 █] [Req5 █]           ← Req1 完成
Iter 4: [Req2 █] [Req4 █] [Req5 █] [Req6 █]           ← 加入 Req6
Iter 5: [Req2 █] [Req4 █] [Req5 █] [Req6 █]

优势: GPU 始终满载，短请求快速返回

调度策略

vLLM 支持多种调度策略：

# 1. FCFS (First-Come-First-Served) - 默认
# 按到达顺序处理，公平但可能导致头部阻塞

# 2. Priority-based
# 高优先级请求优先处理

# 3. Preemption
# 当内存不足时，可以抢占低优先级请求
# 被抢占的请求状态保存到 CPU，稍后恢复

class SchedulingPolicy:
    def __init__(self):
        self.waiting_queue = []      # 等待处理的请求
        self.running_queue = []      # 正在运行的请求
        self.swapped_queue = []      # 被抢占换出的请求
    
    def schedule(self):
        # 尝试从 swapped_queue 恢复请求
        # 从 waiting_queue 接纳新请求
        # 在 running_queue 中执行一次迭代
        pass

安装与配置

系统要求

组件	最低要求	推荐配置
GPU	NVIDIA GPU (Compute Capability >= 7.0)	A100 / H100 / RTX 4090
CUDA	11.8	12.1+
Python	3.8	3.10+
内存	模型大小的 1.5 倍	模型大小的 2 倍
磁盘	100GB	500GB+ SSD

安装方式

方式一：pip 安装（推荐）

# 基础安装
pip install vllm

# 指定 CUDA 版本
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121

# 开发版本
pip install vllm --pre --extra-index-url https://download.pytorch.org/whl/nightly/cu121

方式二：Docker 部署

# 拉取官方镜像
docker pull vllm/vllm-openai:latest

# 运行容器
docker run --gpus all \
    -p 8000:8000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=<your_token>" \
    vllm/vllm-openai:latest \
    --model meta-llama/Llama-2-7b-hf

方式三：源码编译

# 克隆仓库
git clone https://github.com/vllm-project/vllm.git
cd vllm

# 安装依赖
pip install -e .

# 编译 CUDA 扩展（可选，用于自定义 kernel）
python setup.py build_ext --inplace

环境配置

# 设置 Hugging Face 缓存目录
export HF_HOME=/path/to/hf_cache
export HUGGINGFACE_HUB_CACHE=/path/to/hf_cache

# 设置模型下载镜像（国内用户）
export HF_ENDPOINT=https://hf-mirror.com

# CUDA 可见设备
export CUDA_VISIBLE_DEVICES=0,1,2,3

# vLLM 特定配置
export VLLM_WORKER_MULTIPROC_METHOD=spawn  # 多进程启动方式
export VLLM_ATTENTION_BACKEND=FLASH_ATTN   # 注意力后端

模型部署实战

基础部署

Python API

from vllm import LLM, SamplingParams

# 加载模型
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    tensor_parallel_size=1,      # 张量并行大小
    gpu_memory_utilization=0.9,  # GPU 内存使用率
    max_model_len=4096,          # 最大序列长度
)

# 配置采样参数
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512,
)

# 推理
prompts = [
    "The future of artificial intelligence is",
    "Once upon a time",
    "In the world of technology",
]
outputs = llm.generate(prompts, sampling_params)

# 输出结果
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

OpenAI-Compatible Server

# 启动 OpenAI 兼容服务
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-hf \
    --port 8000 \
    --host 0.0.0.0 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 4096 \
    --max-num-seqs 256

# 客户端调用
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-hf",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is machine learning?"}
    ],
    temperature=0.7,
    max_tokens=512
)

print(response.choices[0].message.content)

多 GPU 部署

张量并行 (Tensor Parallelism)

# 4 GPU 张量并行
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-70b-hf \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.95

流水线并行 (Pipeline Parallelism)

# 结合张量并行和流水线并行
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-70b-hf \
    --tensor-parallel-size 2 \
    --pipeline-parallel-size 2 \
    --gpu-memory-utilization 0.9

量化部署

AWQ 量化

# 使用 AWQ 4-bit 量化
python -m vllm.entrypoints.openai.api_server \
    --model TheBloke/Llama-2-7B-AWQ \
    --quantization awq \
    --dtype float16

GPTQ 量化

# 使用 GPTQ 4-bit 量化
python -m vllm.entrypoints.openai.api_server \
    --model TheBloke/Llama-2-7B-GPTQ \
    --quantization gptq \
    --dtype float16

FP8 量化 (Hopper GPU)

# 使用 FP8 量化（需要 H100/H200）
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-hf \
    --quantization fp8 \
    --kv-cache-dtype fp8

长上下文部署

from vllm import LLM

# 支持 128K 上下文的配置
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    max_model_len=128000,
    rope_scaling={
        "type": "dynamic",
        "factor": 4.0
    },
    gpu_memory_utilization=0.95,
    # 使用滑动窗口注意力节省内存
    enable_prefix_caching=True,
)

性能优化策略

1. 内存优化

# 配置参数优化
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    
    # GPU 内存使用
    gpu_memory_utilization=0.95,  # 提高 GPU 内存使用率
    
    # KV Cache 配置
    max_num_batched_tokens=4096,  # 最大批处理 token 数
    max_num_seqs=256,             # 最大并发序列数
    max_model_len=8192,           # 最大模型长度
    
    # 注意力优化
    attention_backend="flash_attn",  # FlashAttention 后端
    
    # 前缀缓存
    enable_prefix_caching=True,   # 启用前缀缓存
)

2. 批处理优化

# 动态批处理配置
sampling_params = SamplingParams(
    # 使用 ignore_eos 可以提高吞吐量（谨慎使用）
    ignore_eos=False,
    
    # 设置合理的 max_tokens
    max_tokens=1024,
    
    # 采样参数
    temperature=0.7,
    top_p=0.95,
    top_k=50,
)

# 使用异步生成提高并发
from vllm import AsyncLLMEngine

engine = AsyncLLMEngine.from_engine_args(
    engine_args,
    start_engine_loop=True,
)

3. 推理加速技术

Speculative Decoding

# 投机解码（Draft-then-Verify）
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    speculative_model="[ngram]",  # 使用 n-gram 投机
    num_speculative_tokens=5,      # 投机 token 数
)

前缀缓存 (Prefix Caching)

# 启用前缀缓存，共享系统提示词
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    enable_prefix_caching=True,
)

# 相同前缀的请求会自动共享 KV Cache
prompts = [
    "[SYSTEM] You are a helpful assistant. [USER] What is AI?",
    "[SYSTEM] You are a helpful assistant. [USER] Explain ML.",
    "[SYSTEM] You are a helpful assistant. [USER] Define DL.",
]

4. 性能调优参数表

参数	说明	推荐值
`gpu_memory_utilization`	GPU 内存使用率	0.85-0.95
`max_num_seqs`	最大并发序列数	256-512
`max_num_batched_tokens`	最大批处理 token 数	2048-4096
`max_model_len`	最大序列长度	根据需求
`tensor_parallel_size`	张量并行数	根据 GPU 数量
`pipeline_parallel_size`	流水线并行数	多节点时使用

生产环境配置

Kubernetes 部署

# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-server
  template:
    metadata:
      labels:
        app: vllm-server
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command:
        - python
        - -m
        - vllm.entrypoints.openai.api_server
        - --model
        - meta-llama/Llama-2-7b-hf
        - --tensor-parallel-size
        - "1"
        - --gpu-memory-utilization
        - "0.9"
        - --max-model-len
        - "4096"
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: "32Gi"
            cpu: "8"
          requests:
            nvidia.com/gpu: "1"
            memory: "16Gi"
            cpu: "4"
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm-server
  ports:
  - port: 8000
    targetPort: 8000
  type: ClusterIP

负载均衡配置

# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vllm-ingress
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "100m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
spec:
  rules:
  - host: vllm.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: vllm-service
            port:
              number: 8000

自动扩缩容 (HPA)

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-server
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm:gpu_utilization
      target:
        type: AverageValue
        averageValue: "80"
  - type: Pods
    pods:
      metric:
        name: vllm:queue_size
      target:
        type: AverageValue
        averageValue: "10"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60

监控与运维

关键指标

指标类别	指标名称	说明
性能	`time_per_output_token`	每个输出 token 的时间
性能	`time_to_first_token`	首 token 延迟
吞吐量	`tokens_per_second`	每秒生成 token 数
资源	`gpu_utilization`	GPU 利用率
资源	`gpu_memory_usage`	GPU 内存使用
队列	`num_waiting_requests`	等待请求数
队列	`num_running_requests`	运行中请求数

Prometheus 监控

# metrics_exporter.py
from prometheus_client import start_http_server, Gauge, Counter
import time

# 定义指标
tokens_generated = Counter('vllm_tokens_generated_total', 'Total tokens generated')
requests_total = Counter('vllm_requests_total', 'Total requests')
gpu_utilization = Gauge('vllm_gpu_utilization', 'GPU utilization percentage')
queue_size = Gauge('vllm_queue_size', 'Number of requests in queue')

# 启动 metrics 服务器
start_http_server(9090)

# 在服务中更新指标
def on_request_complete(num_tokens):
    tokens_generated.inc(num_tokens)
    requests_total.inc()

日志配置

import logging

# 配置 vLLM 日志
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('/var/log/vllm/server.log'),
        logging.StreamHandler()
    ]
)

logger = logging.getLogger('vllm')

# 在服务中添加日志
logger.info(f"Request received: {request_id}")
logger.debug(f"Input tokens: {num_input_tokens}")
logger.info(f"Request completed: {request_id}, output tokens: {num_output_tokens}")

健康检查

# health_check.py
from fastapi import FastAPI, HTTPException
import httpx

app = FastAPI()

@app.get("/health")
async def health_check():
    """健康检查端点"""
    try:
        # 检查 vLLM 服务状态
        async with httpx.AsyncClient() as client:
            response = await client.get("http://localhost:8000/health")
            if response.status_code == 200:
                return {"status": "healthy"}
            else:
                raise HTTPException(status_code=503, detail="Service unhealthy")
    except Exception as e:
        raise HTTPException(status_code=503, detail=str(e))

@app.get("/ready")
async def readiness_check():
    """就绪检查端点"""
    # 检查模型是否加载完成
    if model_loaded:
        return {"status": "ready"}
    raise HTTPException(status_code=503, detail="Model not loaded")

常见问题与解决方案

1. OOM (Out of Memory)

问题现象: CUDA out of memory 错误

解决方案:

# 降低 GPU 内存使用率
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    gpu_memory_utilization=0.7,  # 从 0.9 降低到 0.7
    max_model_len=2048,          # 限制最大长度
    max_num_seqs=128,            # 减少并发数
)

# 使用量化模型
# AWQ/GPTQ 4-bit 量化可减少 75% 内存占用

2. 首 token 延迟过高

问题现象: Time to First Token (TTFT) 超过 1 秒

解决方案:

# 启用前缀缓存
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    enable_prefix_caching=True,
)

# 使用 chunked prefill
# vLLM 会自动将长 prompt 分块处理

3. 吞吐量不足

问题现象: tokens/second 低于预期

解决方案:

# 优化批处理参数
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    max_num_batched_tokens=4096,  # 增加批处理大小
    max_num_seqs=512,             # 增加并发数
    gpu_memory_utilization=0.95,  # 提高内存使用
)

# 使用投机解码
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    speculative_model="[ngram]",
    num_speculative_tokens=5,
)

4. 模型加载失败

问题现象: 无法下载或加载模型

解决方案:

# 1. 检查 Hugging Face Token
export HUGGING_FACE_HUB_TOKEN=your_token_here

# 2. 使用镜像站点（国内）
export HF_ENDPOINT=https://hf-mirror.com

# 3. 预下载模型
huggingface-cli download meta-llama/Llama-2-7b-hf \
    --local-dir /path/to/local/model \
    --local-dir-use-symlinks False

# 4. 使用本地路径
python -m vllm.entrypoints.openai.api_server \
    --model /path/to/local/model

5. 多 GPU 通信问题

问题现象: NCCL 错误或多 GPU 训练卡住

解决方案:

# 设置 NCCL 环境变量
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1  # 禁用 InfiniBand
export NCCL_SOCKET_IFNAME=eth0  # 指定网络接口

# 使用 TCP 通信
export VLLM_WORKER_MULTIPROC_METHOD=spawn

最佳实践总结

部署 checklist

选择合适的量化方案（AWQ/GPTQ/FP8）
配置合理的 gpu_memory_utilization (0.85-0.95)
启用前缀缓存减少重复计算
配置健康检查和监控指标
设置合理的超时和重试机制
实现请求限流和优先级队列
配置自动扩缩容策略
定期备份模型和配置

性能基准

模型	GPU	量化	吞吐量 (tokens/s)	延迟 (ms/token)
Llama-2-7B	A100 40GB	FP16	1200	35
Llama-2-7B	A100 40GB	AWQ-4bit	1800	25
Llama-2-70B	A100 80GB x8	FP16	800	55
Llama-2-70B	H100 80GB x4	FP8	1500	30

参考资源

本文档最后更新于 2024-05-07，基于 vLLM 0.4.x 版本编写。