Multi‑Node Private Deployment of DeepSeek-r1:671B Full Version on K8s + SGLang

Application Prospects#

As the trillion-parameter DeepSeek-r1 model delivers breakthrough results in complex tasks like code generation and mathematical reasoning, enterprise-grade private deployment demands are growing exponentially. In today’s market, Ollama—with its lightweight architecture and cross-platform compatibility (supporting the full NVIDIA/AMD GPU stack and mainstream model formats)—offers developers an out‑of‑the‑box local debugging solution. However, its single-node architecture and naive scheduling strategy result in more than a 30% performance gap in throughput compared with specialized inference frameworks such as vLLM and SGLang under production-grade, high-concurrency inference workloads.

Solution Overview#

This article uses the full DeepSeek-r1-671B model as the baseline and dives into a cloud-native inference acceleration architecture built on Kubernetes + SGLang. By combining the LeaderWorkerSet controller for distributed workload orchestration and the Volcano batch scheduling system for preemptive GPU resource allocation, we build an enterprise-grade deployment solution with the following characteristics:

Performance Leap: RadixAttention in SGLang boosts KV cache reuse by 60%+
Elastic Topology: Supports dynamic scaling for multi-node, multi-GPU setups (compatible with heterogeneous H100/A100 clusters)
Production Ready: Integrated Prometheus + Grafana real-time inference monitoring, with TP99 latency controlled within 200 ms

Why Choose the SGLang Inference Engine#

SGLang vs Ollama: Key Capability Comparison Matrix#

Capability	SGLang (Production-Grade Engine)	Ollama (Developer Tool)
Architecture Design	✅ Distributed inference architecture (multi-node, multi-GPU collaboration)	❌ Single-node only (local GPU only)
Performance	🔥 300%+ throughput boost (RadixAttention optimizations)	⏳ Suited for low-concurrency scenarios (naive scheduling strategy)
Production Readiness	📊 Built-in Prometheus metrics + circuit breaking & degradation	❌ No monitoring or high availability guarantees
Scalability	⚡ Dynamic scaling + heterogeneous cluster management (K8s/Volcano integration)	❌ Fixed resource configuration (no cluster support)
Enterprise Features	🔒 Commercial SLA support + custom OP development	❌ Community-only maintenance
Typical Scenarios	Trillion-parameter model deployment in production (e-commerce/finance high-concurrency workloads)	Local debugging for individual developers (fast small-model validation)

Why SGLang?#

Crushing Performance Advantage
- Ollama single-GPU QPS ≤ 20, SGLang distributed cluster QPS ≥ 200 (10x improvement)
- In 32k context scenarios, SGLang inference latency stays stable within 300 ms, while Ollama frequently runs into OOM
Cost Efficiency
- With KV cache reuse, cluster resource utilization reaches 85%+ (Ollama only 40%–50%)
- Supports FP8 quantization, reducing hardware costs by up to 60% for the same throughput
Risk Control
- Ollama lacks circuit breaking/degradation mechanisms; burst traffic can easily cause cascading failures
- SGLang has built-in multi-level traffic control, ensuring uninterrupted SLA for core business workloads

Decision Recommendations:

Choose SGLang when you need to handle online production traffic, run models with tens of billions of parameters and above, and maximize resource utilization.
Choose Ollama only for personal learning/research, quick validation of small models, and local testing with no SLA requirements.

Through architecture-level optimizations and production-hardened design, SGLang achieves a generation-level lead over Ollama in performance, stability, and scalability.

Environment Preparation#

In this setup we deploy the full DeepSeek-r1:671B model.

Hardware Configuration

Server	Count	CPU (cores)	Memory (TB)	OS Version
NVIDIA A800 80GB	2	128	2	Ubuntu 22.04.5 LTS

Software Stack

Software Version Notes
Kubernetes v1.30.6 Container orchestration engine
GPU Operator v24.9.1 Automated GPU driver management and configuration
Volcano v1.9.0 Scheduling engine
NVIDIA Driver 550.127.05 GPU driver
NVIDIA-Fabric Manager 550.127.05 NVSwitch interconnect
CUDA 12.4 CUDA
MLNX_OFED 24.10-0.7.0.0 IB driver
NCCL 2.21.5 Multi-GPU communication
SGLang v0.4.3.post2-cu124 LLM inference engine
LeaderWorkerSet v0.5.1 PodGroup Deploy API
open-webui v0.5.14 AI chat UI tool

Model Preparation#
Option 1: Download via HuggingFace
Repo: https://huggingface.co/deepseek-ai/DeepSeek-R1

Software	Version	Notes
Kubernetes	v1.30.6	Container orchestration engine
GPU Operator	v24.9.1	Automated GPU driver management and configuration
Volcano	v1.9.0	Scheduling engine
NVIDIA Driver	550.127.05	GPU driver
NVIDIA-Fabric Manager	550.127.05	NVSwitch interconnect
CUDA	12.4	CUDA
MLNX_OFED	24.10-0.7.0.0	IB driver
NCCL	2.21.5	Multi-GPU communication
SGLang	v0.4.3.post2-cu124	LLM inference engine
LeaderWorkerSet	v0.5.1	PodGroup Deploy API
open-webui	v0.5.14	AI chat UI tool

Option 2: Download via ModelScope (recommended for users in mainland China)
Repo: https://modelscope.cn/models/deepseek-ai/DeepSeek-R1/files

1
1、安装ModelScope
2
pip3 install modelscope
3

4
2、下载完整模型repo
5
mkdir /mnt/catcat_data/model/DeepSeek-R1 -p
6
nohup modelscope download --model deepseek-ai/DeepSeek-R1 --local_dir /mnt/catcat_data/model/DeepSeek-R1 &

In practice, the full model takes about 642 GB on Linux.

Deployment#

Deploying the LWS API#

GitHub repo: https://github.com/kubernetes-sigs/lws

The main benefits of using the LWS API include:

Simplified deployment of distributed inference: LWS exposes a declarative API. You only need to define the configurations for Leaders and Workers; the Kubernetes controller then automatically manages their lifecycle. This lets you deploy complex distributed inference workloads without manually managing dependencies and replica counts between Leaders and Workers.
Seamless horizontal scaling: As mentioned above, distributed inference services need multiple Pods to work together. When scaling out, these Pods must be treated as an atomic group. LWS can integrate seamlessly with K8s HPA and be used directly as the HPA scaling target, enabling group-based scaling of the inference service.
Topology-aware scheduling: In distributed inference, different Pods need to exchange large volumes of data. To minimize communication latency, the LWS API incorporates topology-aware scheduling so that Leader and Worker Pods are scheduled onto nodes that are as close as possible in the RDMA network topology.

1
安装 LWS API 的 CRD
2
VERSION=v0.5.1
3
kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws">https://github.com/kubernetes-sigs/lws/releases/download/$VERSION/manifests.yaml
4

5
检查LWS 资源
6
kubectl get pods -n lws-system
7
kubectl get svc -n lws-system
8
kubectl api-resources |grep -i lws

Deploy DeepSeek-R1#

1
apiVersion: leaderworkerset.x-k8s.io/v1
2
kind: LeaderWorkerSet
3
metadata:
4
  name: sglang
5
  labels:
6
    app: sglang
7
spec:
8
  replicas: 1
9
  startupPolicy: LeaderCreated
10
  rolloutStrategy:
11
    type: RollingUpdate
12
    rollingUpdateConfiguration:
13
      maxSurge: 0
14
      maxUnavailable: 2
15
  leaderWorkerTemplate:
16
    size: 2
17
    restartPolicy: RecreateGroupOnPodRestart
18
    leaderTemplate:
19
      metadata:
20
        labels:
21
          role: leader
22
      spec:
23
        containers:
24
          - name: sglang-head
25
            image: lmsysorg/sglang:v0.4.3.post2-cu124
26
            imagePullPolicy: IfNotPresent
27
            workingDir: /sgl-workspace
28
            command: ["sh", "-c"]
29
            args:
30
            - >
31
              cd /sgl-workspace && python3 -m sglang.launch_server
32
              --model-path /mnt/catcat_data/model/DeepSeek-R1
33
              --served-model-name deepseek-r1
34
              --tp 16
35
              --dist-init-addr $LWS_LEADER_ADDRESS:20000
36
              --nnodes $LWS_GROUP_SIZE
37
              --node-rank 0
38
              --trust-remote-code
39
              --context-length 131072
40
              --enable-metrics
41
              --host 0.0.0.0
42
              --port 8000
43
            env:
44
              - name: GLOO_SOCKET_IFNAME
45
                value: eth0
46
              - name: NCCL_IB_HCA
47
                value: "mlx5_0,mlx5_1,mlx5_4,mlx5_5"
48
              - name: NCCL_P2P_LEVEL
49
                value: "NVL"
50
              - name: NCCL_IB_GID_INDEX
51
                value: "0"
52
              - name: NCCL_IB_CUDA_SUPPORT
53
                value: "1"
54
              - name: NCCL_IB_DISABLE
55
                value: "0"
56
              - name: NCCL_SOCKET_IFNAME
57
                value: "eth0"
58
              - name: NCCL_DEBUG
59
                value: "INFO"
60
              - name: NCCL_NET_GDR_LEVEL
61
                value: "2"
62
              - name: POD_NAME
63
                valueFrom:
64
                  fieldRef:
65
                    fieldPath: metadata.name
66
              - name: SGLANG_USE_MODELSCOPE
67
                value: "true"
68
            ports:
69
            - containerPort: 8000
70
              name: http
71
              protocol: TCP
72
            - containerPort: 20000
73
              name: distributed
74
              protocol: TCP
75
            resources:
76
              limits:
77
                cpu: "128"
78
                memory: "1Ti"
79
                nvidia.com/gpu: "8"
80
                rdma/ib: "4"
81
              requests:
82
                cpu: "128"
83
                memory: "1Ti"
84
                nvidia.com/gpu: "8"
85
                rdma/ib: "4"
86
            securityContext:
87
              capabilities:
88
                add:
89
                - IPC_LOCK
90
                - SYS_PTRACE
91
            volumeMounts:
92
              - mountPath: /mnt/catcat_data/model
93
                name: model-volume
94
              - mountPath: /dev/shm
95
                name: shm-volume
96
              - name: localtime
97
                mountPath: /etc/localtime
98
                readOnly: true
99
            readinessProbe:
100
              tcpSocket:
101
                port: 8000
102
              initialDelaySeconds: 120
103
              periodSeconds: 30
104
        volumes:
105
          - name: model-volume
106
            hostPath:
107
              path: /mnt/catcat_data/model
108
          - name: shm-volume
109
            emptyDir:
110
              sizeLimit: 512Gi
111
              medium: Memory
112
          - name: localtime
113
            hostPath:
114
              path: /etc/localtime
115
              type: File
116
        schedulerName: volcano
117
    workerTemplate:
118
      metadata:
119
        name: sglang-worker
120
      spec:
121
        containers:
122
          - name: sglang-worker
123
            image: lmsysorg/sglang:v0.4.3.post2-cu124
124
            imagePullPolicy: IfNotPresent
125
            workingDir: /sgl-workspace
126
            command: ["sh", "-c"]
127
            args:
128
            - >
129
              cd /sgl-workspace && python3 -m sglang.launch_server
130
              --model-path /mnt/catcat_data/model/DeepSeek-R1
131
              --served-model-name deepseek-r1
132
              --tp 16
133
              --dist-init-addr $LWS_LEADER_ADDRESS:20000
134
              --nnodes $LWS_GROUP_SIZE
135
              --node-rank $LWS_WORKER_INDEX
136
              --trust-remote-code
137
              --context-length 131072
138
              --enable-metrics
139
              --host 0.0.0.0
140
              --port 8000
141
            env:
142
              - name: GLOO_SOCKET_IFNAME
143
                value: eth0
144
              - name: NCCL_IB_HCA
145
                value: "mlx5_0,mlx5_1,mlx5_4,mlx5_5"
146
              - name: NCCL_P2P_LEVEL
147
                value: "NVL"
148
              - name: NCCL_IB_GID_INDEX
149
                value: "0"
150
              - name: NCCL_IB_CUDA_SUPPORT
151
                value: "1"
152
              - name: NCCL_IB_DISABLE
153
                value: "0"
154
              - name: NCCL_SOCKET_IFNAME
155
                value: "eth0"
156
              - name: NCCL_DEBUG
157
                value: "INFO"
158
              - name: NCCL_NET_GDR_LEVEL
159
                value: "2"
160
              - name: SGLANG_USE_MODELSCOPE
161
                value: "true"
162
              - name: LWS_WORKER_INDEX
163
                valueFrom:
164
                  fieldRef:
165
                    fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
166
            ports:
167
            - containerPort: 8000
168
              name: http
169
              protocol: TCP
170
            - containerPort: 20000
171
              name: distributed
172
              protocol: TCP
173
            resources:
174
              limits:
175
                cpu: "128"
176
                memory: "1Ti"
177
                nvidia.com/gpu: "8"
178
                rdma/ib: "4"
179
              requests:
180
                cpu: "128"
181
                memory: "1Ti"
182
                nvidia.com/gpu: "8"
183
                rdma/ib: "4"
184
            securityContext:
185
              capabilities:
186
                add:
187
                - IPC_LOCK
188
                - SYS_PTRACE
189
            volumeMounts:
190
              - mountPath: /mnt/catcat_data/model
191
                name: model-volume
192
              - mountPath: /dev/shm
193
                name: shm-volume
194
              - name: localtime
195
                mountPath: /etc/localtime
196
                readOnly: true
197
        volumes:
198
          - name: model-volume
199
            hostPath:
200
              path: /mnt/catcat_data/model
201
          - name: shm-volume
202
            emptyDir:
203
              sizeLimit: 512Gi
204
              medium: Memory
205
          - name: localtime
206
            hostPath:
207
              path: /etc/localtime
208
              type: File
209
        schedulerName: volcano

1
kubectl apply -f deepseek-r1-lws-sglang.yaml
2

3
kubectl get lws -n deepseek
4
NAME     AGE
5
sglang   1h
6

7
kubectl get pods -n deepseek |grep sglang
8
sglang-0                                 1/1     Running   0          2h
9
sglang-0-1                               1/1     Running   0         2h

1
##查看日志
2
~# kubectl logs -n deepseek sglang-0
3
[2025-02-16 12:25:49] server_args=ServerArgs(model_path='deepseek-ai/DeepSeek-R1', tokenizer_path='deepseek-ai/DeepSeek-R1', tokenizer_mode='auto', load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='deepseek-ai/DeepSeek-R1', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=30000, mem_fraction_static=0.81, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=4096, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=0.3, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=8, stream_interval=1, stream_output=False, random_seed=773491082, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='sglang_storage', enable_cache_report=False, dp_size=8, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=True, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, return_hidden_states=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, enable_flashinfer_mla=False)
4
Downloading Model to directory: /root/.cache/modelscope/hub/deepseek-ai/DeepSeek-R1
5
Downloading Model to directory: /root/.cache/modelscope/hub/deepseek-ai/DeepSeek-R1
6
INFO 02-16 12:25:53 __init__.py:190] Automatically detected platform cuda.
7
INFO 02-16 12:25:53 __init__.py:190] Automatically detected platform cuda.
8
INFO 02-16 12:26:01 __init__.py:190] Automatically detected platform cuda.
9
INFO 02-16 12:26:01 __init__.py:190] Automatically detected platform cuda.
10
INFO 02-16 12:26:01 __init__.py:190] Automatically detected platform cuda.
11
INFO 02-16 12:26:01 __init__.py:190] Automatically detected platform cuda.
12
INFO 02-16 12:26:01 __init__.py:190] Automatically detected platform cuda.
13
INFO 02-16 12:26:01 __init__.py:190] Automatically detected platform cuda.
14
INFO 02-16 12:26:01 __init__.py:190] Automatically detected platform cuda.
15
INFO 02-16 12:26:01 __init__.py:190] Automatically detected platform cuda.

Check GPU Memory Usage#

Service Access#

Create a Service:

1
apiVersion: v1
2
kind: Service
3
metadata:
4
  name: sglang-api-svc
5
  labels:
6
    app: sglang
7
spec:
8
  selector:
9
      leaderworkerset.sigs.k8s.io/name: sglang
10
      role: leader
11
  ports:
12
    - protocol: TCP
13
      port: 8000
14
      targetPort: http
15
      name: http
16
  type: NodePort

Deploy the Service:

kubectl apply -f deepseek-r1-svc.yaml -n deepseek

Curl Test#

1
curl -X POST http://ip:port/v1/chat/completions -H "Content-Type: application/json" -d '{
2
    "model": "/model",
3
    "messages": [
4
        {
5
            "role": "user",
6
            "content": "你是什么模型"
7
        }
8
    ],
9
    "stream": false,
10
    "temperature": 0.8
11
}'

Deploy Open WebUI#

Below is the YAML manifest; not much additional explanation is needed here.

1
apiVersion: v1
2
kind: PersistentVolumeClaim
3
metadata:
4
  name: open-webui-data-pvc
5
spec:
6
  accessModes:
7
    - ReadWriteOnce
8
  resources:
9
    requests:
10
      storage: 100Gi
11
  storageClassName: nfs-client
12

13
---
14
apiVersion: apps/v1
15
kind: Deployment
16
metadata:
17
  name: open-webui-deployment
18
spec:
19
  replicas: 1
20
  selector:
21
    matchLabels:
22
      app: open-webui
23
  template:
24
    metadata:
25
      labels:
26
        app: open-webui
27
    spec:
28
      containers:
29
      - name: open-webui
30
        image: ghcr.sakiko.de/open-webui/open-webui:main
31
        imagePullPolicy: Always
32
        ports:
33
        - containerPort: 8080
34
        env:
35
        - name: OPENAI_API_BASE_URL
36
          value: "http://ip:port/v1"   # 替换为SGLang API
37
        - name: ENABLE_OLLAMA_API # 禁用 Ollama API，只保留 OpenAI API
38
          value: "False"
39
        volumeMounts:
40
        - name: open-webui-data
41
          mountPath: /app/backend/data
42
      volumes:
43
      - name: open-webui-data
44
        persistentVolumeClaim:
45
          claimName: open-webui-data-pvc
46

47
---
48
apiVersion: v1
49
kind: Service
50
metadata:
51
  name: open-webui-service
52
spec:
53
  type: ClusterIP
54
  ports:
55
    - port: 3000
56
      targetPort: 8080
57
  selector:
58
    app: open-webui