LLM工程化 高级 量化 INT8 INT4 GPTQ

LLM 量化技术详解:从 INT8 到 GPTQ、AWQ

AIEng Hub
阅读约 30 分钟

什么是模型量化?

模型量化(Model Quantization)是将模型参数从高精度(如 FP32/FP16)转换为低精度(如 INT8/INT4)表示的技术,可以显著降低模型的存储和计算需求。

┌─────────────────────────────────────────────────────────────┐
│                    模型量化原理                              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   FP32 (32位浮点)                                           │
│   ┌─────────────────────────────────────┐                  │
│   │ 符号位 │ 指数位(8) │ 尾数位(23)     │  4字节          │
│   └─────────────────────────────────────┘                  │
│                                                             │
│        ↓ 量化                                                │
│                                                             │
│   INT8 (8位整数)                                            │
│   ┌────────────────┐                                       │
│   │ 0-255 整数范围  │  1字节                                │
│   └────────────────┘                                       │
│                                                             │
│   量化公式:                                                 │
│   Q = round((R - Z) / S)                                    │
│   R = S * (Q - Z)                                           │
│                                                             │
│   其中:                                                     │
│   - R: 原始浮点值                                           │
│   - Q: 量化后的整数值                                        │
│   - S: 缩放因子 (Scale)                                      │
│   - Z: 零点 (Zero Point)                                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

量化的优势:

  • 存储减少:INT8 比 FP32 减少 75% 存储
  • 内存带宽:降低数据传输压力
  • 推理速度:整数运算比浮点更快
  • 功耗降低:适合边缘设备部署

量化方法对比

方法精度压缩比适用场景主要工具
INT88-bit4x通用推理TensorRT, ONNX Runtime
GPTQ4-bit8xGPU 推理AutoGPTQ, ExLlama
AWQ4-bit8xGPU 推理AutoAWQ, vLLM
GGUF2-8 bit4-16xCPU/边缘llama.cpp, ollama
SmoothQuant8-bit4x大模型vLLM, TensorRT-LLM

INT8 量化实战

1. 动态量化(Post-Training)

# int8_quantization.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch.quantization

def apply_dynamic_quantization(model_path: str, output_path: str):
    """应用动态 INT8 量化"""
    
    # 加载模型
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype=torch.float32,
        device_map="cpu"
    )
    
    # 应用动态量化
    quantized_model = torch.quantization.quantize_dynamic(
        model,
        {torch.nn.Linear},  # 量化线性层
        dtype=torch.qint8
    )
    
    # 保存量化模型
    torch.save(quantized_model.state_dict(), f"{output_path}/pytorch_model.bin")
    
    # 计算压缩率
    original_size = sum(p.numel() * p.element_size() for p in model.parameters())
    quantized_size = sum(p.numel() * p.element_size() for p in quantized_model.parameters())
    
    print(f"原始模型大小: {original_size / 1024**2:.2f} MB")
    print(f"量化后大小: {quantized_size / 1024**2:.2f} MB")
    print(f"压缩比: {original_size / quantized_size:.2f}x")
    
    return quantized_model

# 使用示例
# quantized_model = apply_dynamic_quantization("meta-llama/Llama-2-7b", "./quantized_model")

2. 静态量化(Calibration-based)

# static_quantization.py
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

def apply_static_quantization(model_path: str, calibration_data: list):
    """应用静态 INT8 量化"""
    
    model = AutoModelForCausalLM.from_pretrained(model_path)
    model.eval()
    
    # 准备量化配置
    model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
    
    # 插入观察者
    torch.quantization.prepare(model, inplace=True)
    
    # 校准(使用代表性数据)
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    with torch.no_grad():
        for text in calibration_data:
            inputs = tokenizer(text, return_tensors="pt")
            model(**inputs)
    
    # 转换为量化模型
    torch.quantization.convert(model, inplace=True)
    
    return model

# 校准数据示例
calibration_data = [
    "人工智能是计算机科学的一个分支",
    "机器学习是AI的重要技术",
    "深度学习使用神经网络",
    # ... 更多代表性数据
]

GPTQ 量化实战

GPTQ(Gradient-based Post-training Quantization)是一种针对大语言模型的 4-bit 量化方法。

1. 使用 AutoGPTQ

# gptq_quantization.py
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer

def quantize_with_gptq(
    model_name: str,
    output_path: str,
    bits: int = 4,
    group_size: int = 128
):
    """使用 GPTQ 量化模型"""
    
    # 量化配置
    quantize_config = BaseQuantizeConfig(
        bits=bits,                    # 4-bit 量化
        group_size=group_size,        # 分组大小
        desc_act=False,               # 是否降序激活
        static_groups=False,
    )
    
    # 加载模型
    model = AutoGPTQForCausalLM.from_pretrained(
        model_name,
        quantize_config,
        device_map="auto",
        trust_remote_code=True
    )
    
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # 准备校准数据
    calibration_data = [
        "auto-gptq is an easy-to-use model quantization library",\n        "with user-friendly apis based on GPTQ algorithm",
        "it enables you to quantize and deploy large language models",
    ]
    
    # 量化
    print("开始量化...")
    model.quantize(calibration_data)
    
    # 保存
    model.save_quantized(output_path)
    tokenizer.save_pretrained(output_path)
    
    print(f"量化完成,保存到: {output_path}")
    return model

# 加载量化模型
def load_gptq_model(model_path: str):
    """加载 GPTQ 量化模型"""
    from auto_gptq import AutoGPTQForCausalLM
    
    model = AutoGPTQForCausalLM.from_quantized(
        model_path,
        device_map="auto",
        use_safetensors=True,
        trust_remote_code=True
    )
    
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    return model, tokenizer

# 使用示例
# quantize_with_gptq("meta-llama/Llama-2-7b-hf", "./llama-7b-gptq")

2. GPTQ 参数调优

# gptq_advanced.py
def quantize_with_different_configs(model_name: str):
    """测试不同 GPTQ 配置"""
    
    configs = [
        {"bits": 4, "group_size": 128, "desc_act": False},  # 标准配置
        {"bits": 4, "group_size": 64, "desc_act": False},   # 更高精度
        {"bits": 4, "group_size": 128, "desc_act": True},   # 更好精度,更慢
        {"bits": 3, "group_size": 128, "desc_act": False},  # 更高压缩
    ]
    
    results = []
    
    for config in configs:
        output_path = f"./quantized_{config['bits']}bit_g{config['group_size']}"
        
        quantize_config = BaseQuantizeConfig(**config)
        model = AutoGPTQForCausalLM.from_pretrained(
            model_name,
            quantize_config,
            device_map="auto"
        )
        
        # 量化并评估
        # ... 评估代码
        
        results.append({
            "config": config,
            "perplexity": ppl,
            "size_mb": size
        })
    
    return results

AWQ 量化实战

AWQ(Activation-aware Weight Quantization)是一种激活感知的权重量化方法,通常比 GPTQ 精度更高。

1. 使用 AutoAWQ

# awq_quantization.py
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

def quantize_with_awq(
    model_path: str,
    output_path: str,
    zero_point: bool = True,
    q_group_size: int = 128,
    w_bit: int = 4,
    version: str = "GEMM"
):
    """使用 AWQ 量化模型"""
    
    # 加载模型
    model = AutoAWQForCausalLM.from_pretrained(
        model_path,
        device_map="auto",
        safetensors=True
    )
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    
    # 准备校准数据
    calibration_data = [
        "AWQ is a quantization method that sets smaller weights to 0",
        "and rounds the rest to the nearest quantization threshold",
        "to reduce the quantization error",
    ]
    
    # 量化配置
    quant_config = {
        "zero_point": zero_point,
        "q_group_size": q_group_size,
        "w_bit": w_bit,
        "version": version,
    }
    
    # 量化
    model.quantize(
        tokenizer,
        quant_config=quant_config,
        calib_data=calibration_data
    )
    
    # 保存
    model.save_quantized(output_path)
    tokenizer.save_pretrained(output_path)
    
    print(f"AWQ 量化完成,保存到: {output_path}")
    return model

# 加载 AWQ 模型
def load_awq_model(model_path: str):
    """加载 AWQ 量化模型"""
    from awq import AutoAWQForCausalLM
    
    model = AutoAWQForCausalLM.from_quantized(
        model_path,
        device_map="auto",
        safetensors=True
    )
    
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    return model, tokenizer

GGUF 量化(llama.cpp)

GGUF 是 llama.cpp 使用的量化格式,支持从 2-bit 到 8-bit 的多种量化级别。

1. 转换和量化

# 1. 克隆 llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# 2. 编译
make

# 3. 安装 Python 依赖
pip install -r requirements.txt

# 4. 转换 HuggingFace 模型为 GGUF
python convert_hf_to_gguf.py \
    --input-dir /path/to/your/model \
    --output-dir ./models \
    --outfile model-f16.gguf

# 5. 量化(Q4_K_M 推荐)
./quantize ./models/model-f16.gguf ./models/model-Q4_K_M.gguf Q4_K_M

2. Python 量化脚本

# gguf_quantize.py
import subprocess
import os

def quantize_to_gguf(
    model_path: str,
    output_dir: str,
    quant_types: list = ["Q4_K_M", "Q5_K_M", "Q8_0"]
):
    """将模型转换为 GGUF 并量化"""
    
    os.makedirs(output_dir, exist_ok=True)
    
    # 步骤1: 转换为 FP16 GGUF
    f16_path = os.path.join(output_dir, "model-f16.gguf")
    
    print("步骤1: 转换为 FP16 GGUF...")
    subprocess.run([
        "python", "convert_hf_to_gguf.py",
        "--input-dir", model_path,
        "--output-dir", output_dir,
        "--outfile", "model-f16.gguf"
    ], check=True)
    
    # 步骤2: 应用不同级别的量化
    for quant_type in quant_types:
        output_path = os.path.join(output_dir, f"model-{quant_type}.gguf")
        
        print(f"步骤2: 应用 {quant_type} 量化...")
        subprocess.run([
            "./quantize",
            f16_path,
            output_path,
            quant_type
        ], check=True)
        
        # 获取文件大小
        size_mb = os.path.getsize(output_path) / (1024 * 1024)
        print(f"  {quant_type}: {size_mb:.2f} MB")
    
    print("量化完成!")

# 量化类型说明
QUANT_TYPES = {
    "Q4_0": "4-bit,最小,质量较低",
    "Q4_K_M": "4-bit,推荐,平衡质量和大小",
    "Q4_K_S": "4-bit,更小,质量稍低",
    "Q5_K_M": "5-bit,更高质量",
    "Q6_K": "6-bit,高质量",
    "Q8_0": "8-bit,接近无损",
}

3. 使用量化模型

# use_gguf.py
from llama_cpp import Llama

def load_gguf_model(model_path: str, **kwargs):
    """加载 GGUF 模型"""
    
    default_params = {
        "n_ctx": 4096,          # 上下文长度
        "n_threads": 8,         # CPU 线程数
        "n_gpu_layers": 0,      # GPU 层数(0=纯CPU)
        "verbose": False,
    }
    default_params.update(kwargs)
    
    model = Llama(model_path=model_path, **default_params)
    return model

def generate_with_gguf(model, prompt: str, **kwargs):
    """使用 GGUF 模型生成文本"""
    
    default_params = {
        "max_tokens": 512,
        "temperature": 0.7,
        "top_p": 0.9,
        "stop": ["</s>", "Human:", "Assistant:"],
    }
    default_params.update(kwargs)
    
    output = model(prompt, **default_params)
    return output["choices"][0]["text"]

# 使用示例
# model = load_gguf_model("./models/model-Q4_K_M.gguf", n_gpu_layers=35)
# response = generate_with_gguf(model, "你好,请介绍一下自己")

量化质量评估

1. 困惑度评估

# evaluate_quantization.py
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import numpy as np

def evaluate_perplexity(model, tokenizer, test_data: list):
    """评估模型困惑度"""
    
    model.eval()
    total_loss = 0
    total_tokens = 0
    
    with torch.no_grad():
        for text in test_data:
            inputs = tokenizer(text, return_tensors="pt")
            
            outputs = model(**inputs, labels=inputs["input_ids"])
            loss = outputs.loss
            
            total_loss += loss.item() * inputs["input_ids"].size(1)
            total_tokens += inputs["input_ids"].size(1)
    
    avg_loss = total_loss / total_tokens
    perplexity = np.exp(avg_loss)
    
    return perplexity

# 对比原始模型和量化模型
def compare_models(original_path: str, quantized_path: str, test_data: list):
    """对比原始模型和量化模型"""
    
    tokenizer = AutoTokenizer.from_pretrained(original_path)
    
    # 加载原始模型
    original_model = AutoModelForCausalLM.from_pretrained(
        original_path,
        torch_dtype=torch.float16,
        device_map="auto"
    )
    
    # 加载量化模型
    quantized_model = AutoModelForCausalLM.from_pretrained(
        quantized_path,
        device_map="auto"
    )
    
    # 评估
    print("评估原始模型...")
    original_ppl = evaluate_perplexity(original_model, tokenizer, test_data)
    
    print("评估量化模型...")
    quantized_ppl = evaluate_perplexity(quantized_model, tokenizer, test_data)
    
    print(f"\n原始模型困惑度: {original_ppl:.2f}")
    print(f"量化模型困惑度: {quantized_ppl:.2f}")
    print(f"相对增长: {(quantized_ppl/original_ppl - 1)*100:.2f}%")

2. 生成质量对比

def compare_generation_quality(
    original_model,
    quantized_model,
    tokenizer,
    test_prompts: list
):
    """对比生成质量"""
    
    results = []
    
    for prompt in test_prompts:
        inputs = tokenizer(prompt, return_tensors="pt")
        
        # 原始模型生成
        with torch.no_grad():
            original_output = original_model.generate(
                **inputs,
                max_new_tokens=100,
                temperature=0.7
            )
        original_text = tokenizer.decode(original_output[0], skip_special_tokens=True)
        
        # 量化模型生成
        with torch.no_grad():
            quantized_output = quantized_model.generate(
                **inputs,
                max_new_tokens=100,
                temperature=0.7
            )
        quantized_text = tokenizer.decode(quantized_output[0], skip_special_tokens=True)
        
        results.append({
            "prompt": prompt,
            "original": original_text,
            "quantized": quantized_text
        })
    
    return results

量化选择指南

场景推荐方法配置说明
GPU 推理AWQ/GPTQ4-bit, group_size=128速度最快,显存占用低
CPU 推理GGUFQ4_K_M 或 Q5_K_Mllama.cpp 优化好
边缘设备GGUFQ4_0 或 Q3_K_M最小体积
高精度要求GGUFQ8_0接近 FP16 精度
大模型部署GPTQ4-bit, desc_act=True70B+ 模型推荐

最佳实践

1. 量化流程

1. 选择基座模型(FP16/BF16)

2. 准备校准数据(代表性样本)

3. 选择量化方法(AWQ/GPTQ/GGUF)

4. 执行量化

5. 评估质量(困惑度、生成测试)

6. 部署上线

2. 校准数据准备

def prepare_calibration_data(
    source_files: list,
    num_samples: int = 128,
    max_length: int = 2048
):
    """准备校准数据"""
    
    calibration_data = []
    
    for file_path in source_files:
        with open(file_path, 'r', encoding='utf-8') as f:
            text = f.read()
            
        # 分割为样本
        samples = text.split('\n\n')
        
        for sample in samples[:num_samples // len(source_files)]:
            if len(sample) > 100:  # 过滤太短的内容
                calibration_data.append(sample[:max_length])
    
    return calibration_data[:num_samples]

3. 混合精度部署

# mixed_precision.py
class MixedPrecisionConfig:
    """混合精度配置"""
    
    # 关键层保持高精度
    KEEP_FP16_LAYERS = [
        "lm_head",
        "embed_tokens",
        "norm",
    ]
    
    @classmethod
    def apply(cls, model):
        """应用混合精度"""
        for name, module in model.named_modules():
            if any(layer in name for layer in cls.KEEP_FP16_LAYERS):
                module = module.to(torch.float16)
                print(f"保持 FP16: {name}")

总结

方法精度损失速度提升显存节省推荐场景
INT81.5-2x50%通用推理
GPTQ-4bit2-3x75%GPU 部署
AWQ-4bit2-3x75%GPU 部署(推荐)
GGUF-Q4_K_M3-4x (CPU)75%CPU/边缘
GGUF-Q8_0极低1.5-2x (CPU)50%高精度CPU

选择建议:

  • GPU 环境优先选择 AWQ 4-bit
  • CPU 环境选择 GGUF Q4_K_M
  • 对精度敏感选择 GGUF Q8_0 或 INT8
  • 极致压缩选择 GGUF Q3_K_M

相关资源: