Breaking the Memory and Throughput Bottlenecks of Diffusion Model Inference

Diffusion models for image and video generation—such as Wan and Qwen Image—have been surging in capability, but their adoption in production is often blocked by a simple reality: the models don’t fit. A single Qwen-Image-2512 pipeline demands ~58 GB of GPU memory in BF16, overflowing even an A100-40GB. On consumer GPUs like the RTX 4090 (24 GB), a common workaround is CPU offloading, but offloading also introduces significant additional inference latency.

In this post, we introduce disaggregated deployment in LightX2V, a three-stage architecture that splits the inference pipeline into Encoder, Transformer (DiT), and Decoder microservices. The stages are connected via the Mooncake RDMA engine. Combined with a novel decentralized queue scheduler, this approach delivers:

32.2x Text Encoder speedup on RTX 4090 (offload eliminated)
Up to 1.54x DiT per-step speedup
Up to 1.89x throughput (QPS) on 8 GPUs
< 0.2% network overhead (Mooncake RDMA)

Table of contents:

Background: The Memory Wall in Diffusion Inference
Three-Stage Disaggregated Architecture
Mooncake Engine: Near-Zero Communication Overhead with RDMA
Theoretical Analysis: Optimal Encoder-Transformer Ratios
Decentralized Queue Scheduling
Benchmark Results
Conclusion

Background: The Memory Wall in Diffusion Inference

Modern diffusion pipelines are composed of several heavyweight components that must all reside in GPU memory simultaneously. The table below shows the parameter counts and memory footprints of the major components:

Component	Parameters	BF16 Weight
Qwen-Image-2512 DiT	20.43B	40.9 GB
Qwen2.5-VL (Text Encoder)	8.29B	16.6 GB
Qwen Image VAE	0.13B	0.3 GB
Wan2.1 DiT-14B	14.0B	28.0 GB
T5-XXL (Wan Text Encoder)	~4.7B	9.4 GB
CLIP ViT-H/14 (Wan I2V)	~0.63B	1.3 GB
Wan VAE	~0.10B	0.2 GB

When these components are loaded together, the total memory requirement quickly exceeds what most GPUs can provide:

Configuration	Weight + Activation	RTX 4090 (24 GB)	A100 (40 GB)	A100 (80 GB)	H100 (80 GB)
Qwen-2512 BF16 Baseline	~62–68 GB	OOM	OOM	Tight	OK
Qwen-2512 BF16 Disagg Encoder	~18–20 GB	OK	OK	OK	OK
Wan 14B BF16 Baseline	~42–48 GB	OOM	OOM	OK	OK
Wan 14B BF16 Disagg Encoder	~11–13 GB	OK	OK	OK	OK
Wan 14B INT8 Disagg DiT	~17–20 GB	OK	OK	OK	OK

The standard workaround on memory-constrained GPUs is cpu_offload, which swaps model weights between CPU and GPU memory on-the-fly. While this makes inference possible, it comes at a severe performance cost: on the RTX 4090, the Qwen-2512 Text Encoder latency balloons from 0.40 s (disaggregated, no offload needed) to 12.89 s (baseline with offload)—a 32x slowdown. The larger the model, the greater this latency increase.

Three-Stage Disaggregated Architecture

LightX2V splits the monolithic pipeline into three independent services, each loading only its own subset of model weights:

┌──────────┐  HTTP POST   ┌──────────────────┐ Phase1 RDMA  ┌───────────────┐ Phase2 RDMA  ┌──────────────┐
│  Client  │ ──────────→  │     Encoder      │ ──────────→  │  Transformer  │ ──────────→  │   Decoder    │
└──────────┘              │ (Text/Image/VAE) │              │    (DiT)      │              │ (VAE Decode) │
                          │   ~17–20 GB      │              │  ~28–41 GB    │              │   ~0.3 GB    │
                          └──────────────────┘              └───────────────┘              └──────────────┘

Encoder (disagg_mode="encoder"): Loads Text Encoder, Image Encoder (for I2V/I2I), and VAE Encoder. Sends feature tensors to Transformer via Mooncake Phase1.
Transformer (disagg_mode="transformer"): Loads only the DiT model. Receives Phase1 data, runs denoising, sends latents to Decoder via Mooncake Phase2.
Decoder (disagg_mode="decode"): Loads only the VAE Decoder. Receives latents, decodes to pixels, and saves output.

Measured Peak Memory

Model	Mode	GPU	Peak Memory
Qwen-2512 BF16	Baseline	H100	~58 GB
Qwen-2512 BF16+offload	Disagg Encoder	H100	~18 GB
Qwen-2512 BF16+offload	Disagg Transformer	H100	~40 GB
Wan T2V 14B	Baseline	H100	~39 GB
Wan T2V 14B	Disagg Encoder	H100	~11 GB
Wan T2V 14B	Disagg Transformer	H100	~28 GB
Wan I2V 14B 480P	Baseline	H100	~48 GB
Wan I2V 14B 480P	Disagg Encoder	H100	~13 GB
Wan I2V 14B 480P	Disagg Transformer	H100	~32 GB

By splitting components across GPUs, each node’s memory footprint drops to a fraction of the baseline. Critically, the Encoder stage fits comfortably on an RTX 4090 without any offloading, unlocking the full 32x Text Encoder speedup.

Mooncake Engine: Near-Zero Communication Overhead with RDMA

A natural concern with disaggregated deployment is communication overhead. LightX2V integrates the Mooncake Transfer Engine, which provides zero-copy RDMA transport with microsecond-level latency.

End-to-End Latency Breakdown

We profiled a single Qwen-2512 T2I request (50 steps) on H100 to measure where time is spent:

Stage	Latency	% of Total
Encoder: Text Encoder computation	~0.26 s	1.0%
Encoder: Phase1 send (serialize + RDMA)	~0.025 s	0.1%
Transformer: DiT inference (50 steps)	~25.3 s	96.2%
Transformer: Phase2 send (serialize + RDMA)	~0.024 s	0.09%
Decoder: VAE Decode	~0.31 s	1.2%
Network total (Phase1 + Phase2)	~0.05 s	0.2%
Pipeline total	~26.3 s

The network overhead is dominated by serialization (~4 ms) and Mooncake RDMA transfer (~20 ms) per phase. On H100 with InfiniBand (400 GB/s theoretical bandwidth), even 100 MB transfers complete in sub-millisecond time.

Transfer Sizes Across Models

Transfer Phase	Data Content	Typical Size	Model
Phase1 (Wan T2V)	context (512×4096 BF16)	~4 MB	Wan2.1-14B
Phase1 (Wan I2V)	context + clip_out + vae_enc	~20 MB	Wan2.1-14B
Phase1 (Qwen T2I)	prompt_embeds (4096×3584 BF16)	~28 MB	Qwen-2512
Phase2 (Wan 480P 81f)	latent (16×21×60×104 BF16)	~42 MB	Wan2.1-14B
Phase2 (Wan 720P 81f)	latent (16×21×90×160 BF16)	~97 MB	Wan2.1-14B
Phase2 (Qwen T2I 16:9)	latent (16×1×104×58 BF16)	~0.2 MB	Qwen-2512

Cross-Model Network Overhead

Model	Encoder	DiT Total	VAE Decoder	Pipeline Total	Network Overhead
Qwen-2512 T2I 50 step (H100)	0.30 s	22.04 s	~2.7 s	~25.0 s	< 0.1%
Wan 14B T2V 50 step (H100)	0.89 s	252.8 s	2.38 s	~253.5 s	< 0.01%
Wan 14B I2V 40 step (H100)	3.17 s	207.7 s	2.22 s	~210.9 s	< 0.02%

The communication cost is negligible across all models and configurations. Mooncake’s RDMA integration ensures that disaggregation introduces virtually zero latency overhead.

Aspect	RDMA	TCP
Transport	Zero-copy, kernel bypass	Kernel network stack
CPU overhead	Very low	Higher
Latency	Microseconds	Milliseconds
Hardware	InfiniBand / RoCE NIC	Any network
Recommendation	Production, multi-node	Single-node testing, no RDMA hardware

Theoretical Analysis: Optimal Encoder-Transformer Ratios

For simplicity, the Encoder and Decoder are currently deployed on the same GPU, sharing compute resources.

Throughput Model

Consider a system with E Encoder GPUs and T Transformer GPUs (the Decoder shares a GPU with the Encoder since its footprint is negligible). Let t_e be the Encoder processing time per request and t_t be the Transformer processing time per request.

The throughput of each stage is:

\[R_e = \frac{E}{t_e}, \quad R_t = \frac{T}{t_t}\]

The system throughput is bounded by the bottleneck:

\[R_{\text{system}} = \min(R_e, R_t)\]

To maximize resource utilization, we balance the two stages:

\[R_e = R_t \implies \frac{E}{t_e} = \frac{T}{t_t} \implies \boxed{\frac{T}{E} = \frac{t_t}{t_e}}\]

The optimal Transformer-to-Encoder ratio equals the ratio of their per-request processing times.

How Disaggregation Changes the Ratio

A critical insight is that disaggregation itself dramatically shifts this ratio by accelerating the Encoder stage (eliminating offload overhead):

Scenario	Text Encoder	DiT (8 steps)	DiT:Enc Ratio
Baseline (4090 + offload)	12.89 s	23.2 s	1.8 : 1
Disagg (4090)	0.40 s	15.0 s	37.5 : 1
Baseline (4090 + offload)	12.89 s	287.5 s (50 steps)	22.3 : 1
Disagg (4090)	0.40 s	188 s (50 steps)	470 : 1

By reducing Encoder time by 32x, disaggregation makes the ratio far more extreme, underscoring the importance of allocating as many GPUs as possible to the Transformer stage.

Ratio vs. Number of Inference Steps

The step count is the primary factor driving the optimal ratio (Qwen-2512, 4090):

Steps	T_encoder	T_DiT	DiT:Encoder Ratio	Recommended
50	0.4 s	188 s	470 : 1	Max Transformer GPUs
8	0.4 s	15.04 s	37.6 : 1	~38 : 1
4	0.4 s	7.52 s	18.8 : 1	~19 : 1
2	0.4 s	3.76 s	9.4 : 1	~10 : 1
1	0.4 s	1.88 s	4.7 : 1	~5 : 1

Multi-Scale Optimal Ratios

8 GPUs (common single-node setup):

Model	Steps	Theoretical Ratio	Optimal Config	Analysis
Qwen-2512	8	37.5 : 1	7T : 1E	1 Encoder @ 2.5 req/s » 7T bottleneck @ 0.47 req/s
Qwen-2512	50	470 : 1	7T : 1E	Encoder is never the bottleneck

16 GPUs (8-step distilled model):

Config	T Throughput	E Throughput	Bottleneck	E Utilization
15T : 1E	1.0 req/s	2.5 req/s	Transformer	40%
14T : 2E	0.93 req/s	5.0 req/s	Transformer	18.7% (wasteful)

Conclusion: 15:1 is optimal—adding a second Encoder wastes a GPU.

32 GPUs (50-step model):

31T : 1E yields 0.165 req/s throughput. The single Encoder (2.5 req/s capacity) runs at only 6.6% utilization—still not the bottleneck.

800 GPUs (8-step distilled model)—the first scale where Encoder matters:

Config	T Throughput	E Throughput	System Throughput	E Utilization	T Utilization
790T : 10E	52.67 req/s	25 req/s	25 req/s	100%	47.5%
780T : 20E	52 req/s	50 req/s	50 req/s	100%	96.2%
779T : 21E	51.93 req/s	52.5 req/s	51.93 req/s	98.9%	99.8%

The theoretical optimum is $E = N / (1 + t_t/t_e) = 800 / 38.5 \approx 21$.

Practical Configuration Guide

Scale	8-step Distilled	50-step Full
8 GPUs	7T : 1E	7T : 1E
16–64 GPUs	1 Encoder per 30–40 GPUs	1 Encoder per 60–100 GPUs
100+ GPUs	$E = \lceil N / (1 + t_t/t_e) \times 1.2 \rceil$	Same formula

When to add more Encoders:

Measured Encoder utilization exceeds 80%
P95 latency is significantly higher than P50 (queuing is severe)
Concurrent request count exceeds the number of Transformer workers

Otherwise, always prioritize adding more Transformer GPUs.

Decentralized Queue Scheduling

In the standard three-stage deployment, the client must send requests to Decoder → Transformer → Encoder in sequence, which is operationally complex and introduces scheduling overhead. LightX2V introduces a decentralized queue scheduler that simplifies this to a single HTTP POST:

┌──────────┐  HTTP POST   ┌──────────┐ Phase1 RDMA ┌─────────────┐ Phase2 RDMA ┌──────────┐
│  Client  │ ──────────→  │ Encoder  │ ──────────→ │ Transformer │ ──────────→ │ Decoder  │
└──────────┘              │ (GPU 0)  │             │ (GPU 1/2/3) │             │ (GPU 0)  │
                          └──────────┘             └─────────────┘             └──────────┘
                                ↑                        ↑                          ↑
                          lightx2v.server          pull worker ×N              pull worker
                          HTTP port 8002
                                │
                          ┌──────────┐
                          │Controller│  ← RDMA metadata ring buffer (always-on)
                          └──────────┘

Key Design Decisions

Controller: Maintains three RDMA ring buffers (request / phase1 / phase2) for metadata dispatch. Loads no models and performs no inference.
Encoder: Runs as an HTTP service, performs Text Encoder inference, then writes dispatch metadata to the Phase1 RDMA ring.
Transformer and Decoder: Run as pull-based workers that consume from their respective RDMA rings automatically.
Multiple Transformer workers can be deployed on different GPUs. The client specifies disagg_phase1_receiver_engine_rank to target a specific worker, enabling round-robin or explicit routing.

Comparison with Standard Three-Stage

Aspect	Standard Three-Stage	Decentralized Queue
Client calls	Must POST to Decoder → Transformer → Encoder	Single POST to Encoder HTTP
Transformer	HTTP server, one request at a time	Pull worker, multiple instances in parallel
Decoder	HTTP server	Pull worker, auto-consumes Phase2
Request routing	Client explicitly specifies all endpoints	Encoder writes RDMA ring, workers pull by rank
Result retrieval	Poll Decoder HTTP	Poll Encoder HTTP
Scaling	Fixed 1:1:1 ratio	Flexible N Transformer : M Encoder

Impact on Throughput and Tail Latency

The decentralized scheduler eliminates sequential request dispatch overhead and reduces queue contention. On 8× RTX 4090 with 7:1 ratio:

Scheduling	QPS	P50	P95	P99
Centralized three-stage (7:1)	0.24	17 s	25 s	28 s
Decentralized queue (7:1)	0.34	17 s	20 s	22 s

The decentralized scheduler improves QPS by 1.42x over the centralized three-stage approach (and 1.89x over baseline) while significantly reducing tail latency—P95 drops from 25 s to 20 s, and P99 from 28 s to 22 s.

Benchmark Results

How to Run (Qwen-2512 T2I Decentralized Deployment)

Taking a 4-GPU (1 Encoder + 3 Transformer) Qwen-2512 T2I decentralized deployment as an example, a single command starts all services:

git clone git@github.com:ModelTC/LightX2V.git
cd LightX2V
bash scripts/server/disagg/qwen/start_qwen_t2i_disagg_decentralized.sh

The script launches Controller → Encoder → Decoder → Transformer×3 (5 processes total) with the default GPU layout:

Role	GPU	Notes
Controller	CPU	RDMA ring buffer only, no GPU needed
Encoder + Decoder	GPU 0	Shared GPU (both have small memory footprint)
Transformer 1	GPU 1	DiT worker
Transformer 2	GPU 2	DiT worker
Transformer 3	GPU 3	DiT worker

GPU assignments and the number of Transformers can be customized via environment variables:

GPU_ENCODER=0 GPU_DECODER=0 \
GPU_TRANSFORMER_1=1 GPU_TRANSFORMER_2=2 GPU_TRANSFORMER_3=3 \
NUM_TRANSFORMERS=3 \
bash scripts/server/disagg/qwen/start_qwen_t2i_disagg_decentralized.sh

Once the services are up, send requests and query results via the Controller’s HTTP API:

# Send a generation request
curl -X POST http://localhost:8080/v1/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "A cute cat on a table", "seed": 42, "aspect_ratio": "16:9"}'

# Check result status
curl http://localhost:8080/v1/status/{room_id}

Configuration Walkthrough

Decentralized deployment configs are in configs/disagg/qwen/, with one JSON file per role. Below are the key fields for each:

Controller (qwen_image_t2i_disagg_controller.json):

{
  "disagg_mode": "controller",
  "disagg_config": {
    "protocol": "rdma",
    "rdma_buffer_slots": 128,
    "rdma_buffer_slot_size": 4096,
    "rdma_request_handshake_port": 5566,
    "rdma_phase1_handshake_port": 5567,
    "rdma_phase2_handshake_port": 5568
  }
}

The Controller loads no model weights—it only initializes three RDMA ring buffers (request / phase1 / phase2) for metadata dispatch. rdma_buffer_slots controls queue depth, and rdma_*_handshake_port specifies the RDMA handshake port for each ring.

Encoder (qwen_image_t2i_disagg_encoder_decentralized.json):

{
  "disagg_mode": "encoder",
  "disagg_config": {
    "decentralized_queue": true,
    "sender_engine_rank": 0,
    "receiver_engine_rank": 1,
    "rdma_phase1_host": "127.0.0.1",
    "rdma_phase1_handshake_port": 5567
  }
}

decentralized_queue: true activates the decentralized scheduling mode. After inference, the Encoder writes feature metadata into the Phase1 RDMA ring for Transformer workers to pull.

Transformer (qwen_image_t2i_disagg_transformer_decentralized.json):

{
  "disagg_mode": "transformer",
  "disagg_config": {
    "decentralized_queue": true,
    "transformer_engine_rank": 1,
    "decoder_engine_rank": 4,
    "rdma_phase1_handshake_port": 5567,
    "rdma_phase2_handshake_port": 5568
  }
}

The Transformer runs as a pull worker, consuming tasks from the Phase1 ring. After DiT inference, it writes latents to the Phase2 ring. When deploying multiple Transformers, each worker uses a different transformer_engine_rank (handled automatically by the startup script).

Decoder (qwen_image_t2i_disagg_decoder_decentralized.json):

{
  "disagg_mode": "decode",
  "disagg_config": {
    "decoder_engine_rank": 2,
    "rdma_phase2_handshake_port": 5568
  }
}

The Decoder loads only the VAE Decoder (~0.3 GB), receiving latents from the Phase2 ring and decoding them into the final image.

Per-Stage Speedup on Memory-Constrained GPUs

The most dramatic speedups occur on memory-constrained GPUs where baseline inference requires CPU offloading. On RTX 4090 with Qwen-2512 (BF16 + block offload):

Component	Baseline (offload)	Disagg (no offload on Encoder)	Speedup
Text Encoder	12.89 s	0.40 s	32.22x
DiT per step (50 steps)	5.75 s/step	3.76 s/step	1.53x
DiT per step (8 steps, distilled)	2.90 s/step	1.88 s/step	1.54x

The Text Encoder speedup comes from eliminating weight offloading entirely—the Encoder node’s ~17 GB footprint fits on a 4090 without any offloading. The DiT speedup comes from exclusive VRAM bandwidth: in baseline mode, DiT competes with other components for PCIe and memory bandwidth during offload transfers.

For Wan2.1-I2V-14B on RTX 4090 (BF16, 40 steps, block offload):

Metric	Baseline (1×4090)	Disagg (2×4090)
DiT per step (480P)	24.24 s/step	19.02 s/step
DiT per step (720P)	90.71 s/step	62.80 s/step
Text Encoder	2.14 s	0.20 s
Image Encoder (480P)	0.57 s	0.28 s

As models grow larger, the offload penalty increases proportionally, making disaggregation’s advantage more pronounced. Qwen-2512 (58 GB baseline) shows a 32x Text Encoder speedup, compared to 10x for the smaller Wan 14B (39 GB baseline).

Multi-GPU Throughput Scaling

We benchmarked 100 requests of Qwen-2512 T2I on RTX 4090 GPUs, comparing baseline (all components on each GPU) against disaggregated deployment with different DiT:Encoder ratios.

4 GPUs — Centralized Three-Stage Scheduling:

Mode	DiT:Enc	Concurrency	Total Time	P50	P95	P99	QPS
Baseline	4:0	4	1079 s	41 s	55 s	61 s	0.092
Disagg	2:2	2	1037 s	35 s	62 s	63 s	0.096
Disagg	3:1	3	705 s	35 s	38 s	39 s	0.15

8 GPUs — Decentralized Queue Scheduling:

Mode	DiT:Enc	Concurrency	Total Time	P50	P95	P99	QPS
Baseline	8:0	8	550 s	22 s	28 s	30 s	0.18
Disagg	4:4	4	497 s	20 s	23 s	25 s	0.20
Disagg	5:3	5	402 s	19 s	20 s	23 s	0.25
Disagg	6:2	6	335 s	18 s	20 s	22 s	0.30
Disagg	7:1	7	294 s	17 s	20 s	22 s	0.34

RTX 4090, Qwen-2512 T2I, BF16 + block offload, 100 requests. Decentralized scheduling for disagg rows.

At the optimal 7:1 ratio on 8 GPUs, disaggregated deployment achieves 0.34 QPS vs baseline 0.18 QPS—a 1.89x throughput improvement. The decentralized scheduler also significantly reduces tail latency: P95 drops from 28 s (baseline) to 20 s (disagg 7:1), and P99 from 30 s to 22 s.

Sensitivity to Inference Parameters

Prompt length has minimal impact on the DiT:Encoder ratio, since Encoder latency stays well below DiT latency across all lengths (Qwen-2512, H100):

Prompt Length	Encoder Latency	DiT 4-step Total	Ratio
16 tokens	35 ms	1702 ms	49:1
256 tokens	53 ms	1735 ms	33:1
1024 tokens	102 ms	1855 ms	18:1
4096 tokens	81 ms	1877 ms	23:1

Resolution affects DiT latency but not Encoder latency, widening the ratio for larger outputs (Qwen-2512, 4090 + offload, 50 steps):

Resolution	Aspect Ratio	DiT Total	DiT Per-Step
1664×928	16:9	188 s	3.76 s
1328×1328	1:1	204 s	4.07 s
1472×1140	4:3	215 s	4.29 s

Conclusion

In this post, we presented disaggregated deployment in LightX2V—a three-stage architecture that physically separates Encoder, Transformer, and Decoder onto independent GPU nodes. By integrating Mooncake’s RDMA transport (< 0.2% overhead) and a decentralized queue scheduler, we achieve:

Memory decoupling: Each node loads only its own component, enabling models like Qwen-2512 (58 GB) to run on RTX 4090 (24 GB) without offloading.
Massive Encoder acceleration: Eliminating offload yields a 32x Text Encoder speedup on memory-constrained GPUs.
Flexible scaling: Optimal Encoder:Transformer ratios follow $T:E = t_t : t_e$, with practical configs ranging from 7:1 (8 GPUs) to 779:21 (800 GPUs).
Production-ready throughput: 1.89x QPS improvement with decentralized scheduling on 8 GPUs.

As diffusion models continue to grow—20B, 50B, and beyond—the gap between monolithic and disaggregated deployment will only widen. The Encoder stage stays small, while the DiT stage grows with model parameters, making the case for disaggregation stronger with every generation.