Overview

This guide provides a step-by-step walkthrough for deploying a full-stack multimodal AI system on a single server equipped with 8x NVIDIA L20 GPUs. The stack includes LLM, VLM, Embedding/Reranker (RAG), ASR, Dify (LLM Orchestration Agent Platform), and MinerU (PDF Extraction).

VRAM Estimation for LLMs

formula

parameter

Key Strategy: Since Large Language Model (LLM) performance correlates more strongly with parameter scale (B) than with quantization levels, we prioritize models with higher parameter counts. For this deployment, we selected the int4 AWQ versions of Qwen3-235B and GLM-4.5V-106B to maximize overall intelligence and performance within the available VRAM.

Estimated VRAM Occupancy for Qwen3-235B: qwen3

Estimated VRAM Occupancy for GLM-4.5V-106B: glm4.5v

Note: Models can be downloaded via HuggingFace or ModelScope.


1. Large Language Model (LLM)

Prerequisites: Docker installed, vLLM image (v0.11.0 or later), and Qwen3-235B-AWQ weights.

Deployment (Docker Compose):

version: '3'
services:
  qwen3-235b-instruct:
    image: vllm/vllm-openai:v0.11.0
    container_name: qwen3-235b-instruct
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
              device_ids: ['0','1','2','3'] # Dedicated 4 GPUs
    volumes:
      - /your/path/to/qwen3-235b-instruct-awq:/root/models
    ports:
      - "10000:8000"
    shm_size: '2g'
    command: >
      --model /root/models
      --host 0.0.0.0
      --port 8000
      --trust-remote-code 
      --served-model-name LLM
      --gpu-memory-utilization 0.75
      --enable-auto-tool-choice
      --tool-call-parser hermes
      --max-model-len 65536
      --tensor-parallel-size 4
      --api-key your-api-key

2. Vision Language Model (VLM)

Prerequisites: Docker installed, vLLM image (v0.11.0+), and GLM-4.5V-106B-AWQ weights.

Deployment (Docker Compose):

version: '3'
services:
  glm-4.5v-106b:
    image: vllm/vllm-openai:v0.11.0
    container_name: glm-4.5v-106b
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
              device_ids: ['4', '5', '6', '7'] # Dedicated 4 GPUs
    command: >
      --model /root/models
      --host 0.0.0.0
      --port 8000
      --trust-remote-code
      --enable-auto-tool-choice
      --enable-expert-parallel 
      --served-model-name VLM
      --tool-call-parser glm45
      --reasoning-parser glm45
      --gpu-memory-utilization 0.45
      --max-model-len 42000
      --tensor-parallel-size 4
      --api-key your-api-key
    volumes:
      - /your/path/to/glm-4.5v-awq:/root/models
    ports:
      - "10001:8000"

3. Embedding & Reranker (RAG)

These models are lightweight and can share GPU resources with other services.

  qwen3-4b-embedding:
    image: vllm/vllm-openai:v0.11.0
    container_name: qwen3-4b-embedding
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
              device_ids: ['4','5','6','7']
    volumes:
      - /your/path/to/qwen3-4b-embedding:/root/models
    ports:
      - "10004:8000"
    shm_size: '1g'
    command: >
      --model /root/models
      --host 0.0.0.0
      --port 8000
      --trust-remote-code 
      --served-model-name EMBEDDING
      --gpu-memory-utilization 0.08
      --max-model-len 8192
      --tensor-parallel-size 4
      --api-key your-api-key
      
  qwen3-4b-reranker:
    image: vllm/vllm-openai:v0.11.0
    container_name: qwen3-4b-reranker
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
              device_ids: ['4','5','6','7']
    volumes:
      - /your/path/to/qwen3-4b-reranker:/root/models
    ports:
      - "10005:8000"
    shm_size: '1g'
    command: >
      --model /root/models
      --host 0.0.0.0
      --port 8000
      --trust-remote-code 
      --served-model-name RERANKER
      --task score
      --gpu-memory-utilization 0.08
      --max-model-len 8192
      --tensor-parallel-size 4
      --api-key your-api-key

4. Dify (Orchestration Platform)

Clone the Dify Repository.

Configure the environment variables. We strongly recommend changing the default Nginx proxy ports to avoid conflicts:

cd ./docker
cp .env.example .env

# Edit .env and modify these parameters:
# EXPOSE_NGINX_PORT=10080
# EXPOSE_NGINX_SSL_PORT=10443

docker compose up -d

5. Automatic Speech Recognition (ASR)

Prerequisites: Conda environment and FunASR model weights.

conda activate your-conda-env

# Required Models:
# 1. ASR: speech_paraformer-large_asr_nat-zh-cn-16k
# 2. VAD: speech_fsmn_vad_zh-cn-16k
# 3. Punctuation: punc_ct-transformer_zh-cn-common
# 4. Speaker Diarization: speech_campplus_sv_zh-cn_16k

Minimal Inference Script (asr_minimal.py):

import argparse
from funasr import AutoModel

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("audio", type=str)
    args = parser.parse_args()

    model = AutoModel(
        model="./speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404", # Main ASR
        vad_model="./speech_fsmn_vad_zh-cn-16k-common-pytorch",              # Voice Activity Detection
        punc_model="./punc_ct-transformer_zh-cn-common-vocab272727-pytorch", # Punctuation Restoration
        spk_model="./speech_campplus_sv_zh-cn_16k-common",                   # Speaker Diarization
    )
    
    result = model.generate(input=args.audio)
    print("Transcription Result:")
    print(result[0]["text"])

if __name__ == "__main__":
    main()

6. MinerU (Document Parsing)

Prerequisites: MinerU v2.5 Docker image.

services:
  mineru-server:
    image: mineru:v2.5
    container_name: mineru-server
    ports:
      - 30000:30000
    environment:
      MINERU_MODEL_SOURCE: local
    ipc: host
    entrypoint: mineru-vllm-server
    command:
      --host 0.0.0.0
      --port 30000
      --data-parallel-size 2 
      --gpu-memory-utilization 0.1
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['4','5']
              capabilities: [gpu]

  mineru-api:
    image: mineru:v2.5
    container_name: mineru-api
    profiles: ["api"]
    ports:
      - 10018:8000
    environment:
      MINERU_MODEL_SOURCE: local
    entrypoint: mineru-api
    command:
      --host 0.0.0.0
      --port 8000
      --data-parallel-size 2 
      --gpu-memory-utilization 0.1
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['6','7']
              capabilities: [gpu]

  mineru-gradio:
    image: mineru:v2.5
    container_name: mineru-gradio
    profiles: ["gradio"]
    ports:
      - 30002:7860
    environment:
      MINERU_MODEL_SOURCE: local
    entrypoint: mineru-gradio
    command:
      --server-name 0.0.0.0
      --server-port 7860
      --enable-vllm-engine true
      --data-parallel-size 2
      --gpu-memory-utilization 0.1
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["6","7"] 
              capabilities: [gpu]