CVPR 2026 Project Page

MPDiT: Multi-Patch Global-to-Local Transformer Architecture

Efficient flow matching and diffusion generation by spending cheap large-patch tokens early and high-resolution local tokens only near the end.

Quan Dao1    Dimitris N. Metaxas1
1Rutgers University
Paper Code Slides Talk Video
Generated samples from MPDiT-XL
Figure 1. Class-conditional ImageNet samples generated by MPDiT-XL.

TL;DR

MPDiT is a global-to-local diffusion transformer. It runs most transformer blocks on coarse, large-patch tokens to capture global structure, then upsamples and uses only a few high-resolution blocks for local refinement. The design reduces GFLOPs by up to 50% while preserving strong generative quality.

Global first

Early blocks use patch size 4, reducing the latent token count from 256 to 64.

Local refinement

Only the final local blocks use patch size 2, recovering details at the end.

Better conditioning

Shared AdaIN, multi-token class conditioning, and FNO time embedding improve training convergence.

59.3GFLOPs for MPDiT-XL
7.36FID without CFG
2.05FID with CFG
49.9%GFLOPs vs dense XL

Method

Instead of applying the same token resolution in every transformer block, MPDiT treats the network as a coarse-to-fine hierarchy. The first stage models global context with larger patches. An upsample module then expands the sequence and injects a fine patch embedding skip connection before final high-resolution refinement.

MPDiT architecture overview
Figure 2. MPDiT architecture overview: global-local multi-patch transformer, shared-conditioning DiT block, upsample module, and FNO time embedding.

Results

MPDiT improves the quality-compute trade-off on ImageNet generation. The table below highlights representative ImageNet 256 results from the paper.

Table 1. Quantitative performance on ImageNet 256. Lower FID and GFLOPs are better.
ModelEpochsGFLOPsFIDsFIDISPrecisionRecall
SiT-B/28023.0234.846.5941.530.520.64
DiCo-B8016.8827.20-56.520.600.61
MPDiT-B8016.6024.746.3257.400.580.65
SiT-XL/280118.6618.045.0773.900.630.64
DiCo-XL8087.3011.67-100.420.710.61
MPDiT-XL8059.309.925.05102.790.700.64
DiG-XL/2-G24089.402.074.53278.950.820.60
MPDiT-XL-G24059.302.054.51278.730.820.61
Table 2. Model configuration and computational cost.
ModelNkModel Dim DGFLOPsGFLOPs ratio vs DiT
MPDiT-B12676816.672.1%
MPDiT-XL286115259.349.9%
Table 3. Ablation on MPDiT components after 80 epochs.
MethodParams (M)GFLOPsFID
DiT-B/2130.023.034.84
+ Shared AdaIN90.322.935.31
+ Multi-token class embedding101.924.328.56
+ FNO time embedding101.224.324.52
+ MPDiT, k = 6104.816.624.74
Table 4. Ablation on the number of local high-resolution blocks k.
ConfigurationMethodParams (M)GFLOPsFID
BDiT-B/2 dagger101.224.324.52
BMPDiT k = 4104.813.926.94
BMPDiT k = 6104.816.624.74
BMPDiT k = 8104.819.324.62
XLDiT-XL/2 dagger473.1125.59.22
XLMPDiT k = 4481.253.211.11
XLMPDiT k = 6481.259.39.92
XLMPDiT k = 8481.265.49.73

Qualitative samples

Additional ImageNet samples from the project assets.

ImageNet 512 qualitative results
Figure 3. ImageNet 512 qualitative samples generated by MPDiT.
Snail samples

Class 113 - snail

Loggerhead turtle samples

Class 33 - loggerhead turtle

Peacock samples

Class 84 - peacock

Box turtle samples

Class 37 - box turtle

Macaw samples

Class 88 - macaw

Golden retriever samples

Class 207 - golden retriever

Balloon samples

Class 417 - balloon

Mushroom samples

Class 947 - mushroom

Volcano samples

Class 980 - volcano

Bubble samples

Class 971 - bubble

Video talk

A short presentation video is included with the page assets.

BibTeX

@inproceedings{dao2026mpdit,
  title     = {MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model},
  author    = {Dao, Quan and Metaxas, Dimitris N.},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2026}
}