Accelerating MinuerU Multimodal Inference with LightLLM

In LightLLM, multimodal inference is mainly divided into two stages: first, the input images are preprocessed and then passed through the vision encoder to obtain image embeddings; next, these image embeddings are concatenated with text embeddings and fed into the LLM for generation.

During the integration of MinuerU, we optimized the communication layer, reworked the ViT batching and scheduling logic, and streamlined the image preprocessing pipeline. These changes brought noticeable performance gains under various resolutions and hardware configurations.

MinuerU Multimodal Inference Workflow in LightLLM

Image preprocessing (resize, normalization, and other operations following the visual spec).
Use RPyC to call the remote ViT and generate image embeddings.
Embedding fusion: concatenate image embeddings with text embeddings.
LLM decoding: feed the combined sequence into the LLM for generation.

During integration, we observed that the TCP behavior of RPyC, as well as the strong coupling between ViT batch size and downstream scheduling, were major sources of latency—especially for small images and high-concurrency workloads.

Reducing RPyC Overhead with `TCP_NODELAY`

We noticed that default RPyC operations introduce about 20 ms of fixed delay due to TCP’s Nagle algorithm, which buffers small packets and delays sending. This adds unnecessary wait time to certain RPyC calls.

To avoid this, we explicitly enabled TCP_NODELAY on the RPyC connection.

After enabling TCP_NODELAY, on an H200 using 448×448 low-resolution images for batch inference, we observed that—on a fixed test set of around 1000 images—QPS improved from 30 req/s to 60 req/s, effectively doubling throughput and significantly reducing time-to-first-token latency.

This optimization is particularly beneficial in scenarios with many small requests or strict latency SLAs.

Optimizing ViT Batching and Scheduling Behavior

Previously, the ViT batch size was solely determined by the parameter visual_infer_batch_size: ViT would perform forward passes using that batch size, collect visual_infer_batch_size embeddings, and once the threshold was reached, execute infer_imgs and immediately dispatch corresponding requests downstream.

On GPUs with small memory (e.g., 4090D), visual_infer_batch_size can only be set to 1, as larger values easily cause OOM. This leads to:

Frequent small send_pyobj calls, increasing overhead.
ViT sending embedding batches of size 1 to the LLM side, causing the LLM’s prefill batch size to be fixed at 1 and underutilizing GPU compute.

We reworked the main loop and decoupled ViT batching from embedding dispatch. This allows:

Reduced small RPyC message frequency, lowering send_pyobj overhead.
Maintain high ViT utilization with visual_infer_batch_size, while using visual_send_batch_size to build efficient downstream prefill batches.
Lower end-to-end latency jitter under high concurrency.

Accelerating Image Preprocessing

We also found that when image resolution is large (e.g., 4K or 8K), image preprocessing becomes a significant contributor to overall multimodal inference latency. To address this, we streamlined several operations in the preprocessing pipeline.

In the transformers library, Qwen2-VL models perform preprocessing on the CPU. We found that operations such as resize are significantly slower on GPU than on CPU. For a 4K image, resize takes about 20 ms on CPU but only about 3 ms on GPU. Based on this, we moved some preprocessing steps onto the GPU, greatly reducing image preprocessing overhead.

Flash-Attention Kernel Selection

As is well known, attention is one of the most time-consuming components in large-model inference, and many vendors provide their own Flash-Attention implementations. On NVIDIA H-series GPUs, Tri Dao’s FlashAttention-3 is unquestionably the fastest. However, on the 4090 series, implementations differ between vendors.

We benchmarked several common open-source Flash-Attention implementations on the 4090. Using MinuerU-2.5 as an example, with: [SHAPE] B=1, H=16, L=7956, D=80, dtype=torch.bfloat16

Performance results:

sgl_kernel: 2.711 ms
xFormers: 2.791 ms
torch.sdpa: 2.906 ms

Therefore, we replaced the Flash-Attention implementation with the one from sgl_kernel (which uses Flashinfer’s Flash-Attention on 4090D).

Performance Evaluation

On RTX 4090D hardware, we compared LightLLM with vLLM using the same MinuerU model and comparable configurations.

RTX 4090D, 10 concurrent requests, test set with 1000 images

Metric	vLLM	LightLLM
QPS (req/s)	1.40	1.56
Prefill P50 (ms)	1140	640
Decode P50 (ms)	5.88	5.80

Overall, under comparable settings, MinuerU running on LightLLM achieves slightly higher QPS than vLLM, and the optimized communication, batching, and preprocessing strategies help stabilize and enhance end-to-end performance.

PREVIOUSAccelerating Token Generation with MTP (Multi-Token Prediction)

NEXTPrefix KV Cache Transfer Between DP Rankers