标签 deepseek 下的文章

在 deepseek-r1:32b 中小尺寸模型下,A100 单卡可提供 1.8QPS,单问题 7 秒内完成响应。 对于小团队够用。 单卡推理比4090弱

2025-04-11T03:24:11.png

  • 服务器硬件配置:
    ○ CPU:Intel(R) Xeon(R) Platinum 8336C CPU @ 2.30GHz * 2
    ○ GPU:NVIDIA A100-SXM4-80GB * 8
    ○ MEM:1960G
  • 网络:本机
  • 测试工具:Apache Benchmark (ab)
  • 模型:deepseek-r1:32b
ab -n 1000 -c 10 -s 30000 -T "application/json" -p payload.json -v 4 http://1.1.1.1:11434/v1/completions > ab_detailed_log01.txt 2>&1

# payload.json
{
  "model": "deepseek-r1:32b",
  "prompt": "你好,你是谁?"
}

2025-04-11T03:26:27.png

并发数总请求数成功请求数失败请求数吞吐率 (请求/秒)平均响应时间 (毫秒)95% 响应时间 (毫秒)最长响应时间 (毫秒)
101000100001.825504.49570018423
501000100001.8227520.9912950632275
1001000100001.8354613.0305713258628
1501000100001.8182645.2948717090249
2001000100001.82110118.858113820117009
4001000100001.82220315.946222405224113
8001000100001.24644384.462688739698507

8卡A100,可运行deepseek-r1:671b(671b-q4_K_M 404GB),响应较慢
2025-04-11T03:23:50.png

背景

本地部署DeepSeek-Coder-V2-Lite-Instruct:14b。要求基础的高可用、监控、安全能力。ollama默认只能使用第一张显卡,多个模型同时调用会有bug(ollama ps显示100GPU,但使用CPU推理);无法高可用

具体方案

多GPU Ollama部署方案,通过系统服务化+负载均衡实现4块4090显卡的并行利用,边缘使用nginx负载均衡。

  • 服务器名:AIGC-01
  • 服务器硬件配置

    • CPU:AMD Ryzen Threadripper PRO 3955WX 16-Cores
    • GPU:4 x NVIDIA RTX 4090
    • MEM:128G
  • 模型:DeepSeek-Coder-V2-Lite-Instruct:14b

ollama配置

# 备份ollama
cd /etc/systemd/system/
mv ollama.service ollama.service.bak

# 创建4个独立服务文件(每个GPU对应一个端口)
for i in {0..3}; do
sudo tee /etc/systemd/system/ollama-gpu${i}.service > /dev/null <<EOF
[Unit]
Description=Ollama Service (GPU $i)

[Service]
# 关键参数配置
Environment="CUDA_VISIBLE_DEVICES=$i"
Environment="OLLAMA_HOST=0.0.0.0:$((11434+i))"
ExecStart=/usr/local/bin/ollama serve

Restart=always
User=ollama
Group=ollama

[Install]
WantedBy=multi-user.target
EOF
done


# 重载服务配置
sudo systemctl daemon-reload

# 启动所有GPU实例
sudo systemctl start ollama-gpu{0..3}.service

# 设置开机自启
sudo systemctl enable ollama-gpu{0..3}.service

nginx配置

nginx 需要编译额外模块,用于健康检查

root@sunmax-AIGC-01:/etc/systemd/system# nginx -V
nginx version: nginx/1.24.0
built by gcc 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.2) 
built with OpenSSL 1.1.1f  31 Mar 2020
TLS SNI support enabled
configure arguments: --with-http_ssl_module --add-module=./nginx_upstream_check_module
# /etc/nginx/sites-available/mga.maxiot-inc.com.conf

# 在http块中添加(如果放在server外请确保在http上下文中)
log_format detailed '$remote_addr - $remote_user [$time_local] '
                    '"$request" $status $body_bytes_sent '
                    '"$http_referer" "$http_user_agent" '
                    'RT=$request_time URT=$upstream_response_time '
                    'Host=$host Proto=$server_protocol '
                    'Header={\"X-Forwarded-For\": \"$proxy_add_x_forwarded_for\", '
                    '\"X-Real-IP\": \"$remote_addr\", '
                    '\"User-Agent\": \"$http_user_agent\", '
                    '\"Content-Type\": \"$content_type\"} '
                    'SSL=$ssl_protocol/$ssl_cipher '
                    'Upstream=$upstream_addr '
                    'Request_Length=$request_length '
                    'Request_Method=$request_method '
                    'Server_Name=$server_name '
                    'Server_Port=$server_port ';

upstream ollama_backend {
    server 127.0.0.1:11436;
    server 127.0.0.1:11437;
}

server {
    listen 443 ssl;
    server_name mga.maxiot-inc.com;

    ssl_certificate /etc/nginx/ssl/maxiot-inc.com.pem;
    ssl_certificate_key /etc/nginx/ssl/maxiot-inc.com.key;
    # 访问日志
    access_log /var/log/nginx/mga_maxiot_inc_com_access.log detailed;

    # 错误日志
    error_log /var/log/nginx/mga_maxiot_inc_com_error.log;

    # 负载均衡设置,指向 ollama_backend
    location / {
        proxy_pass http://ollama_backend;  # 会在两个服务器之间轮询
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

}