DeepSeek-671B Distributed Deployment

Tue, 06 Jan 2026 11:16:30 +0800

1. Overview

a. This guide describes the deployment of the DeepSeek-671B model across two servers, each equipped with 8x NVIDIA L20 GPUs. The technology stack utilizes Docker for containerization, the vLLM high-performance inference engine, and the Ray distributed computing framework.

b. Official Documentation: vLLM-Distributed

c. The official tutorial involves complex steps requiring frequent switching between multiple SSH sessions. To simplify the process, this article consolidates and optimizes the official workflow into a systematic, one-stop deployment guide.

L20 8-GPU Server Deep Dive: Integrated Deployment Guide for Multimodal AI Systems (LLM + VLM + RAG + ASR + Dify + MinerU)

Mon, 05 Jan 2026 16:56:40 +0800

Overview

This guide provides a step-by-step walkthrough for deploying a full-stack multimodal AI system on a single server equipped with 8x NVIDIA L20 GPUs. The stack includes LLM, VLM, Embedding/Reranker (RAG), ASR, Dify (LLM Orchestration Agent Platform), and MinerU (PDF Extraction).

VRAM Estimation for LLMs

Key Strategy: Since Large Language Model (LLM) performance correlates more strongly with parameter scale (B) than with quantization levels, we prioritize models with higher parameter counts. For this deployment, we selected the int4 AWQ versions of Qwen3-235B and GLM-4.5V-106B to maximize overall intelligence and performance within the available VRAM.

LLM on 🌲Treetopia🌲

DeepSeek-671B Distributed Deployment

1. Overview

L20 8-GPU Server Deep Dive: Integrated Deployment Guide for Multimodal AI Systems (LLM + VLM + RAG + ASR + Dify + MinerU)

Overview

VRAM Estimation for LLMs