1575 字
8 分钟

Multi‑Node Private Deployment of DeepSeek-r1:671B Full Version on K8s + SGLang

Application Prospects#

As the trillion-parameter DeepSeek-r1 model delivers breakthrough results in complex tasks like code generation and mathematical reasoning, enterprise-grade private deployment demands are growing exponentially. In today’s market, Ollama—with its lightweight architecture and cross-platform compatibility (supporting the full NVIDIA/AMD GPU stack and mainstream model formats)—offers developers an out‑of‑the‑box local debugging solution. However, its single-node architecture and naive scheduling strategy result in more than a 30% performance gap in throughput compared with specialized inference frameworks such as vLLM and SGLang under production-grade, high-concurrency inference workloads.

Solution Overview#

This article uses the full DeepSeek-r1-671B model as the baseline and dives into a cloud-native inference acceleration architecture built on Kubernetes + SGLang. By combining the LeaderWorkerSet controller for distributed workload orchestration and the Volcano batch scheduling system for preemptive GPU resource allocation, we build an enterprise-grade deployment solution with the following characteristics:

  1. Performance Leap: RadixAttention in SGLang boosts KV cache reuse by 60%+

  2. Elastic Topology: Supports dynamic scaling for multi-node, multi-GPU setups (compatible with heterogeneous H100/A100 clusters)

  3. Production Ready: Integrated Prometheus + Grafana real-time inference monitoring, with TP99 latency controlled within 200 ms

Why Choose the SGLang Inference Engine#

SGLang vs Ollama: Key Capability Comparison Matrix#

CapabilitySGLang (Production-Grade Engine)Ollama (Developer Tool)
Architecture Design✅ Distributed inference architecture (multi-node, multi-GPU collaboration)❌ Single-node only (local GPU only)
Performance🔥 300%+ throughput boost (RadixAttention optimizations)⏳ Suited for low-concurrency scenarios (naive scheduling strategy)
Production Readiness📊 Built-in Prometheus metrics + circuit breaking & degradation❌ No monitoring or high availability guarantees
Scalability⚡ Dynamic scaling + heterogeneous cluster management (K8s/Volcano integration)❌ Fixed resource configuration (no cluster support)
Enterprise Features🔒 Commercial SLA support + custom OP development❌ Community-only maintenance
Typical ScenariosTrillion-parameter model deployment in production (e-commerce/finance high-concurrency workloads)Local debugging for individual developers (fast small-model validation)

Why SGLang?#

  1. Crushing Performance Advantage

    • Ollama single-GPU QPS ≤ 20, SGLang distributed cluster QPS ≥ 200 (10x improvement)

    • In 32k context scenarios, SGLang inference latency stays stable within 300 ms, while Ollama frequently runs into OOM

  2. Cost Efficiency

    • With KV cache reuse, cluster resource utilization reaches 85%+ (Ollama only 40%–50%)

    • Supports FP8 quantization, reducing hardware costs by up to 60% for the same throughput

  3. Risk Control

    • Ollama lacks circuit breaking/degradation mechanisms; burst traffic can easily cause cascading failures

    • SGLang has built-in multi-level traffic control, ensuring uninterrupted SLA for core business workloads


Decision Recommendations:

  • Choose SGLang when you need to handle online production traffic, run models with tens of billions of parameters and above, and maximize resource utilization.

  • Choose Ollama only for personal learning/research, quick validation of small models, and local testing with no SLA requirements.

Through architecture-level optimizations and production-hardened design, SGLang achieves a generation-level lead over Ollama in performance, stability, and scalability.

Environment Preparation#

In this setup we deploy the full DeepSeek-r1:671B model.

Hardware Configuration

ServerCountCPU (cores)Memory (TB)OS Version
NVIDIA A800 80GB21282Ubuntu 22.04.5 LTS

Software Stack

SoftwareVersionNotes
Kubernetesv1.30.6Container orchestration engine
GPU Operatorv24.9.1Automated GPU driver management and configuration
Volcanov1.9.0Scheduling engine
NVIDIA Driver550.127.05GPU driver
NVIDIA-Fabric Manager550.127.05NVSwitch interconnect
CUDA12.4CUDA
MLNX_OFED24.10-0.7.0.0IB driver
NCCL2.21.5Multi-GPU communication
SGLangv0.4.3.post2-cu124LLM inference engine
LeaderWorkerSetv0.5.1PodGroup Deploy API
open-webuiv0.5.14AI chat UI tool

Model Preparation#

Option 1: Download via HuggingFace
Repo: https://huggingface.co/deepseek-ai/DeepSeek-R1

Option 2: Download via ModelScope (recommended for users in mainland China)
Repo: https://modelscope.cn/models/deepseek-ai/DeepSeek-R1/files

Terminal window
1、安装ModelScope
pip3 install modelscope
2、下载完整模型repo
mkdir /mnt/catcat_data/model/DeepSeek-R1 -p
nohup modelscope download --model deepseek-ai/DeepSeek-R1 --local_dir /mnt/catcat_data/model/DeepSeek-R1 &

In practice, the full model takes about 642 GB on Linux.

Deployment#

Deploying the LWS API#

GitHub repo: https://github.com/kubernetes-sigs/lws

The main benefits of using the LWS API include:

  • Simplified deployment of distributed inference: LWS exposes a declarative API. You only need to define the configurations for Leaders and Workers; the Kubernetes controller then automatically manages their lifecycle. This lets you deploy complex distributed inference workloads without manually managing dependencies and replica counts between Leaders and Workers.

  • Seamless horizontal scaling: As mentioned above, distributed inference services need multiple Pods to work together. When scaling out, these Pods must be treated as an atomic group. LWS can integrate seamlessly with K8s HPA and be used directly as the HPA scaling target, enabling group-based scaling of the inference service.

  • Topology-aware scheduling: In distributed inference, different Pods need to exchange large volumes of data. To minimize communication latency, the LWS API incorporates topology-aware scheduling so that Leader and Worker Pods are scheduled onto nodes that are as close as possible in the RDMA network topology.

Terminal window
安装 LWS API CRD
VERSION=v0.5.1
kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws">https://github.com/kubernetes-sigs/lws/releases/download/$VERSION/manifests.yaml
检查LWS 资源
kubectl get pods -n lws-system
kubectl get svc -n lws-system
kubectl api-resources |grep -i lws

Deploy DeepSeek-R1#

Terminal window
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: sglang
labels:
app: sglang
spec:
replicas: 1
startupPolicy: LeaderCreated
rolloutStrategy:
type: RollingUpdate
rollingUpdateConfiguration:
maxSurge: 0
maxUnavailable: 2
leaderWorkerTemplate:
size: 2
restartPolicy: RecreateGroupOnPodRestart
leaderTemplate:
metadata:
labels:
role: leader
spec:
containers:
- name: sglang-head
image: lmsysorg/sglang:v0.4.3.post2-cu124
imagePullPolicy: IfNotPresent
workingDir: /sgl-workspace
command: ["sh", "-c"]
args:
- >
cd /sgl-workspace && python3 -m sglang.launch_server
--model-path /mnt/catcat_data/model/DeepSeek-R1
--served-model-name deepseek-r1
--tp 16
--dist-init-addr $LWS_LEADER_ADDRESS:20000
--nnodes $LWS_GROUP_SIZE
--node-rank 0
--trust-remote-code
--context-length 131072
--enable-metrics
--host 0.0.0.0
--port 8000
env:
- name: GLOO_SOCKET_IFNAME
value: eth0
- name: NCCL_IB_HCA
value: "mlx5_0,mlx5_1,mlx5_4,mlx5_5"
- name: NCCL_P2P_LEVEL
value: "NVL"
- name: NCCL_IB_GID_INDEX
value: "0"
- name: NCCL_IB_CUDA_SUPPORT
value: "1"
- name: NCCL_IB_DISABLE
value: "0"
- name: NCCL_SOCKET_IFNAME
value: "eth0"
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_NET_GDR_LEVEL
value: "2"
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: SGLANG_USE_MODELSCOPE
value: "true"
ports:
- containerPort: 8000
name: http
protocol: TCP
- containerPort: 20000
name: distributed
protocol: TCP
resources:
limits:
cpu: "128"
memory: "1Ti"
nvidia.com/gpu: "8"
rdma/ib: "4"
requests:
cpu: "128"
memory: "1Ti"
nvidia.com/gpu: "8"
rdma/ib: "4"
securityContext:
capabilities:
add:
- IPC_LOCK
- SYS_PTRACE
volumeMounts:
- mountPath: /mnt/catcat_data/model
name: model-volume
- mountPath: /dev/shm
name: shm-volume
- name: localtime
mountPath: /etc/localtime
readOnly: true
readinessProbe:
tcpSocket:
port: 8000
initialDelaySeconds: 120
periodSeconds: 30
volumes:
- name: model-volume
hostPath:
path: /mnt/catcat_data/model
- name: shm-volume
emptyDir:
sizeLimit: 512Gi
medium: Memory
- name: localtime
hostPath:
path: /etc/localtime
type: File
schedulerName: volcano
workerTemplate:
metadata:
name: sglang-worker
spec:
containers:
- name: sglang-worker
image: lmsysorg/sglang:v0.4.3.post2-cu124
imagePullPolicy: IfNotPresent
workingDir: /sgl-workspace
command: ["sh", "-c"]
args:
- >
cd /sgl-workspace && python3 -m sglang.launch_server
--model-path /mnt/catcat_data/model/DeepSeek-R1
--served-model-name deepseek-r1
--tp 16
--dist-init-addr $LWS_LEADER_ADDRESS:20000
--nnodes $LWS_GROUP_SIZE
--node-rank $LWS_WORKER_INDEX
--trust-remote-code
--context-length 131072
--enable-metrics
--host 0.0.0.0
--port 8000
env:
- name: GLOO_SOCKET_IFNAME
value: eth0
- name: NCCL_IB_HCA
value: "mlx5_0,mlx5_1,mlx5_4,mlx5_5"
- name: NCCL_P2P_LEVEL
value: "NVL"
- name: NCCL_IB_GID_INDEX
value: "0"
- name: NCCL_IB_CUDA_SUPPORT
value: "1"
- name: NCCL_IB_DISABLE
value: "0"
- name: NCCL_SOCKET_IFNAME
value: "eth0"
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_NET_GDR_LEVEL
value: "2"
- name: SGLANG_USE_MODELSCOPE
value: "true"
- name: LWS_WORKER_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
ports:
- containerPort: 8000
name: http
protocol: TCP
- containerPort: 20000
name: distributed
protocol: TCP
resources:
limits:
cpu: "128"
memory: "1Ti"
nvidia.com/gpu: "8"
rdma/ib: "4"
requests:
cpu: "128"
memory: "1Ti"
nvidia.com/gpu: "8"
rdma/ib: "4"
securityContext:
capabilities:
add:
- IPC_LOCK
- SYS_PTRACE
volumeMounts:
- mountPath: /mnt/catcat_data/model
name: model-volume
- mountPath: /dev/shm
name: shm-volume
- name: localtime
mountPath: /etc/localtime
readOnly: true
volumes:
- name: model-volume
hostPath:
path: /mnt/catcat_data/model
- name: shm-volume
emptyDir:
sizeLimit: 512Gi
medium: Memory
- name: localtime
hostPath:
path: /etc/localtime
type: File
schedulerName: volcano
Terminal window
kubectl apply -f deepseek-r1-lws-sglang.yaml
kubectl get lws -n deepseek
NAME AGE
sglang 1h
kubectl get pods -n deepseek |grep sglang
sglang-0 1/1 Running 0 2h
sglang-0-1 1/1 Running 0 2h
Terminal window
##查看日志
~# kubectl logs -n deepseek sglang-0
[2025-02-16 12:25:49] server_args=ServerArgs(model_path='deepseek-ai/DeepSeek-R1', tokenizer_path='deepseek-ai/DeepSeek-R1', tokenizer_mode='auto', load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='deepseek-ai/DeepSeek-R1', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=30000, mem_fraction_static=0.81, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=4096, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=0.3, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=8, stream_interval=1, stream_output=False, random_seed=773491082, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='sglang_storage', enable_cache_report=False, dp_size=8, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=True, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, return_hidden_states=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, enable_flashinfer_mla=False)
Downloading Model to directory: /root/.cache/modelscope/hub/deepseek-ai/DeepSeek-R1
Downloading Model to directory: /root/.cache/modelscope/hub/deepseek-ai/DeepSeek-R1
INFO 02-16 12:25:53 __init__.py:190] Automatically detected platform cuda.
INFO 02-16 12:25:53 __init__.py:190] Automatically detected platform cuda.
INFO 02-16 12:26:01 __init__.py:190] Automatically detected platform cuda.
INFO 02-16 12:26:01 __init__.py:190] Automatically detected platform cuda.
INFO 02-16 12:26:01 __init__.py:190] Automatically detected platform cuda.
INFO 02-16 12:26:01 __init__.py:190] Automatically detected platform cuda.
INFO 02-16 12:26:01 __init__.py:190] Automatically detected platform cuda.
INFO 02-16 12:26:01 __init__.py:190] Automatically detected platform cuda.
INFO 02-16 12:26:01 __init__.py:190] Automatically detected platform cuda.
INFO 02-16 12:26:01 __init__.py:190] Automatically detected platform cuda.

Check GPU Memory Usage#

Service Access#

Create a Service:

Terminal window
apiVersion: v1
kind: Service
metadata:
name: sglang-api-svc
labels:
app: sglang
spec:
selector:
leaderworkerset.sigs.k8s.io/name: sglang
role: leader
ports:
- protocol: TCP
port: 8000
targetPort: http
name: http
type: NodePort

Deploy the Service:

kubectl apply -f deepseek-r1-svc.yaml -n deepseek

Curl Test#

Terminal window
curl -X POST http://ip:port/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "/model",
"messages": [
{
"role": "user",
"content": "你是什么模型"
}
],
"stream": false,
"temperature": 0.8
}'

Deploy Open WebUI#

Below is the YAML manifest; not much additional explanation is needed here.

Terminal window
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: open-webui-data-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: nfs-client
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: open-webui-deployment
spec:
replicas: 1
selector:
matchLabels:
app: open-webui
template:
metadata:
labels:
app: open-webui
spec:
containers:
- name: open-webui
image: ghcr.sakiko.de/open-webui/open-webui:main
imagePullPolicy: Always
ports:
- containerPort: 8080
env:
- name: OPENAI_API_BASE_URL
value: "http://ip:port/v1" # 替换为SGLang API
- name: ENABLE_OLLAMA_API # 禁用 Ollama API,只保留 OpenAI API
value: "False"
volumeMounts:
- name: open-webui-data
mountPath: /app/backend/data
volumes:
- name: open-webui-data
persistentVolumeClaim:
claimName: open-webui-data-pvc
---
apiVersion: v1
kind: Service
metadata:
name: open-webui-service
spec:
type: ClusterIP
ports:
- port: 3000
targetPort: 8080
selector:
app: open-webui
Multi‑Node Private Deployment of DeepSeek-r1:671B Full Version on K8s + SGLang
https://catcat.blog/en/deepseek-r1-671b-k8ssglang-install.html
作者
猫猫博客
发布于
2025-02-19
许可协议
CC BY-NC-SA 4.0