In online incremental learning, data continuously arrives with substantial distributional shifts, creating a significant challenge because previous samples have limited replay value when learning a new task. Prior research has typically relied on either a single adaptive centroid or multiple fixed centroids to represent each class in the latent space. However, such methods struggle when class data streams are inherently multimodal and require continual centroid updates. To overcome this, we introduce an online Mixture Model learning framework grounded in Optimal Transport theory (MMOT), where centroids evolve incrementally with new data. This approach offers two main advantages: (i) it provides a more precise characterization of complex data streams, and (ii) it enables improved class similarity estimation for unseen samples during inference through MMOT-derived centroids. Furthermore, to strengthen representation learning and mitigate catastrophic forgetting, we design a Dynamic Preservation strategy that regulates the latent space and maintains class separability over time. Experimental evaluations on benchmark datasets confirm the superior effectiveness of our proposed method.
@inproceedings{Tran_2026_CVPR,author={Tran*, Quyen and Nguyen*, Hai and Dao, Quan and Phan*, Hoang and Van, Linh and Than, Khoat and Phung, Dinh and Metaxas, Dimitris and Le, Trung},title={An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning},booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},month=jun,year={2026},pages={10851-10862},}
Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model
Quan Dao, and Dimitris Metaxas
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2026
Transformer architectures, particularly Diffusion Transformers (DiTs), have become widely used in diffusion and flow-matching models due to their strong performance compared to convolutional UNets. However, the isotropic design of DiTs processes the same number of patchified tokens in every block, leading to relatively heavy computation during training process. In this work, we introduce a multi-patch transformer design in which early blocks operate on larger patches to capture coarse global context, while later blocks use smaller patches to refine local details. This hierarchical design could reduces computational cost by up to 50% in GFLOPs while achieving good generative performance. In addition, we also propose improved designs for time and class embeddings that accelerate training convergence. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of our architectural choices.
@inproceedings{Dao_2026_CVPR,author={Dao, Quan and Metaxas, Dimitris},title={Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model},booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},month=jun,year={2026},pages={33000-33011},}
Discrete Noise Inversion for Next-scale Autoregressive Text-based Image Editing
Quan Dao*, Xiaoxiao He*, Ligong Han, Ngan Hoai Nguyen, Amin Heyrani Nobar, Faez Ahmed, Han Zhang, Viet Anh Nguyen, and Dimitris Metaxas
In European Conference on Computer Vision, Jun 2026
Discrete diffusion models have achieved success in tasks like image generation and masked language modeling but face limitations in controlled content editing. We introduce DICE (Discrete Inversion for Controllable Editing), the first approach to enable precise inversion for discrete diffusion models, including multinomial diffusion and masked generative models. By recording noise sequences and masking patterns during the reverse diffusion process, DICE enables accurate reconstruction and flexible editing of discrete data without the need for predefined masks or attention manipulation. We demonstrate the effectiveness of DICE across both image and text domains, evaluating it on models such as VQ-Diffusion, Paella, and RoBERTa. Our results show that DICE preserves high data fidelity while enhancing editing capabilities, offering new opportunities for fine-grained content manipulation in discrete spaces.
@inproceedings{dao2025discrete,title={Discrete Noise Inversion for Next-scale Autoregressive Text-based Image Editing},author={Dao*, Quan and He*, Xiaoxiao and Han, Ligong and Nguyen, Ngan Hoai and Nobar, Amin Heyrani and Ahmed, Faez and Zhang, Han and Nguyen, Viet Anh and Metaxas, Dimitris},booktitle={European Conference on Computer Vision},year={2026},}
2025
AutoEdit: Automatic Hyperparameter Tuning for Image Editing
Chau Pham, Quan Dao, Mahesh Bhosale, Yunjie Tian, Dimitris Metaxas, and David Doermann
In The Thirty-nine Annual Conference on Neural Information Processing Systems, Jun 2025
Recent advances in diffusion models have revolutionized text-guided image editing, yet existing editing methods face critical challenges in hyperparameter identification. To get the reasonable editing performance, these methods often require the user to brute-force tune multiple interdependent hyperparameters, such as inversion timesteps and attention modification, \textitetc. This process incurs high computational costs due to the huge hyperparameter search space. We consider searching optimal editing’s hyperparameters as a sequential decision-making task within the diffusion denoising process. Specifically, we propose a reinforcement learning framework, which establishes a Markov Decision Process that dynamically adjusts hyperparameters across denoising steps, integrating editing objectives into a reward function. The method achieves time efficiency through proximal policy optimization while maintaining optimal hyperparameter configurations. Experiments demonstrate significant reduction in search time and computational overhead compared to existing brute-force approaches, advancing the practical deployment of a diffusion-based image editing framework in the real world.
@inproceedings{pham2025autoedit,title={AutoEdit: Automatic Hyperparameter Tuning for Image Editing},author={Pham, Chau and Dao, Quan and Bhosale, Mahesh and Tian, Yunjie and Metaxas, Dimitris and Doermann, David},booktitle={The Thirty-nine Annual Conference on Neural Information Processing Systems},year={2025},}
Improved Training Technique for Latent Consistency Models
Quan Dao*, Khanh Doan*, Di Liu, Trung Le, and Dimitris Metaxas
In International Conference on Learning Representations, Jun 2025
Consistency models are a new family of generative models capable of producing high-quality samples in either a single step or multiple steps. Recently, consistency models have demonstrated impressive performance, achieving results on par with diffusion models in the pixel space. However, the success of scaling consistency training to large-scale datasets, particularly for text-to-image and video generation tasks, is determined by performance in the latent space. In this work, we analyze the statistical differences between pixel and latent spaces, discovering that latent data often contains highly impulsive outliers, which significantly degrade the performance of iCT in the latent space. To address this, we replace Pseudo-Huber losses with Cauchy losses, effectively mitigating the impact of outliers. Additionally, we introduce a diffusion loss at early timesteps and employ optimal transport (OT) coupling to further enhance performance. Lastly, we introduce the adaptive scaling-c
scheduler to manage the robust training process and adopt Non-scaling LayerNorm in the architecture to better capture the statistics of the features and reduce outlier impact. With these strategies, we successfully train latent consistency models capable of high-quality sampling with one or two steps, significantly narrowing the performance gap between latent consistency and diffusion models.
@inproceedings{dao2024improvedict,title={Improved Training Technique for Latent Consistency Models},author={Dao*, Quan and Doan*, Khanh and Liu, Di and Le, Trung and Metaxas, Dimitris},booktitle={International Conference on Learning Representations},year={2025},}
Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Text-to-Image Generation
Quan Dao*, Hao Phung*, Trung Dao, Dimitris Metaxas, and Anh Tran
In Association for the Advancement of Artificial Intelligence, Jun 2025
Flow matching has emerged as a promising framework for training generative models, demonstrating impressive empirical performance while offering relative ease of training compared to diffusion-based models. However, this method still requires numerous function evaluations in the sampling process. To address these limitations, we introduce a self-corrected flow distillation method that effectively integrates consistency models and adversarial training within the flow-matching framework. This work is a pioneer in achieving consistent generation quality in both few-step and one-step sampling. Our extensive experiments validate the effectiveness of our method, yielding superior results both quantitatively and qualitatively on CelebA-HQ and zero-shot benchmarks on the COCO dataset.
@inproceedings{dao2024self,title={Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Text-to-Image Generation},author={Dao*, Quan and Phung*, Hao and Dao, Trung and Metaxas, Dimitris and Tran, Anh},booktitle={Association for the Advancement of Artificial Intelligence},year={2025},}
2024
DICE: Discrete Inversion Enabling Controllable Editing for Multinomial Diffusion and Masked Generative Models
Xiaoxiao He*, Quan Dao*, Ligong Han, Song Wen, Minhao Bai, Di Liu, Han Zhang, Martin Renqiang Min, Felix Juefei-Xu, Chaowei Tan, and
1 more author
Discrete diffusion models have achieved success in tasks like image generation and masked language modeling but face limitations in controlled content editing. We introduce DICE (Discrete Inversion for Controllable Editing), the first approach to enable precise inversion for discrete diffusion models, including multinomial diffusion and masked generative models. By recording noise sequences and masking patterns during the reverse diffusion process, DICE enables accurate reconstruction and flexible editing of discrete data without the need for predefined masks or attention manipulation. We demonstrate the effectiveness of DICE across both image and text domains, evaluating it on models such as VQ-Diffusion, Paella, and RoBERTa. Our results show that DICE preserves high data fidelity while enhancing editing capabilities, offering new opportunities for fine-grained content manipulation in discrete spaces.
@article{he2024dice,title={DICE: Discrete Inversion Enabling Controllable Editing for Multinomial Diffusion and Masked Generative Models},author={He*, Xiaoxiao and Dao*, Quan and Han, Ligong and Wen, Song and Bai, Minhao and Liu, Di and Zhang, Han and Min, Martin Renqiang and Juefei-Xu, Felix and Tan, Chaowei and others},journal={arXiv preprint arXiv:2410.08207},year={2024},}
DiMSUM: Diffusion Mamba - A Scalable and Unified Spatial-Frequency Method for Image Generation
Hao Phung*, Quan Dao*, Trung Dao, Hoang Phan, Dimitris Metaxas, and Anh Tran
In The Thirty-eighth Annual Conference on Neural Information Processing Systems, Jun 2024
We introduce a novel state-space architecture for diffusion models, effectively harnessing spatial and frequency information to enhance the inductive bias towards local features in input images for image generation tasks. While state-space networks, including Mamba, a revolutionary advancement in recurrent neural networks, typically scan input sequences from left to right, they face difficulties in designing effective scanning strategies, especially in the processing of image data. Our method demonstrates that integrating wavelet transformation into Mamba enhances the local structure awareness of visual inputs and better captures long-range relations of frequencies by disentangling them into wavelet subbands, representing both low- and high-frequency components. These wavelet-based outputs are then processed and seamlessly fused with the original Mamba outputs through a cross-attention fusion layer, combining both spatial and frequency information to optimize the order awareness of state-space models which is essential for the details and overall quality of image generation. Besides, we introduce a globally-shared transformer to supercharge the performance of Mamba, harnessing its exceptional power to capture global relationships. Through extensive experiments on standard benchmarks, our method demonstrates superior results compared to DiT and DIFFUSSM, achieving faster training convergence and delivering high-quality outputs.
@inproceedings{phung2024dimsum,title={DiMSUM: Diffusion Mamba - A Scalable and Unified Spatial-Frequency Method for Image Generation},author={Phung*, Hao and Dao*, Quan and Dao, Trung and Phan, Hoang and Metaxas, Dimitris and Tran, Anh},booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},year={2024},}
A High-Quality Robust Diffusion Framework for Corrupted Dataset
Quan Dao*, Binh Ta*, Tung Pham, and Anh Tran
In European Conference on Computer Vision, Jun 2024
Developing image-generative models, which are robust to outliers in the training process, has recently drawn attention from the research community. Due to the ease of integrating unbalanced optimal transport (UOT) into adversarial framework, existing works focus mainly on developing robust frameworks for generative adversarial model (GAN). Meanwhile, diffusion models have recently dominated GAN in various tasks and datasets. However, according to our knowledge, none of them are robust to corrupted datasets. Motivated by DDGAN, our work introduces the first robust-to-outlier diffusion. We suggest replacing the UOT-based generative model for GAN in DDGAN to learn the backward diffusion process. Additionally, we demonstrate that the Lipschitz property of divergence in our framework contributes to more stable training convergence. Remarkably, our method not only exhibits robustness to corrupted datasets but also achieves superior performance on clean datasets.
@inproceedings{dao2025high,title={A High-Quality Robust Diffusion Framework for Corrupted Dataset},author={Dao*, Quan and Ta*, Binh and Pham, Tung and Tran, Anh},booktitle={European Conference on Computer Vision},pages={107--123},year={2024},organization={Springer},}
Flow matching is a recent framework to train generative models that exhibits impressive empirical performance while being relatively easier to train compared with diffusion-based models. Despite its advantageous properties, prior methods still face the challenges of expensive computing and a large number of function evaluations of off-the-shelf solvers in the pixel space. Furthermore, although latent-based generative methods have shown great success in recent years, this particular model type remains underexplored in this area. In this work, we propose to apply flow matching in the latent spaces of pretrained autoencoders, which offers improved computational efficiency and scalability for high-resolution image synthesis. This enables flow-matching training on constrained computational resources while maintaining their quality and flexibility. Additionally, our work stands as a pioneering contribution in the integration of various conditions into flow matching for conditional generation tasks, including label-conditioned image generation, image inpainting, and semantic-to-image generation. Through extensive experiments, our approach demonstrates its effectiveness in both quantitative and qualitative results on various datasets, such as CelebA-HQ, FFHQ, LSUN Church & Bedroom, and ImageNet. We also provide a theoretical control of the Wasserstein-2 distance between the reconstructed latent flow distribution and true data distribution, showing it is upper-bounded by the latent flow matching objective.
@article{dao2023flow,title={Flow Matching in Latent Space},author={Dao*, Quan and Phung*, Hao and Nguyen, Binh and Tran, Anh},journal={arXiv preprint arXiv:2307.08698},year={2023},}
Anti-DreamBooth: Protecting users from personalized text-to-image synthesis
Thanh Van Le*, Hao Phung*, Thuan Hoang Nguyen*, Quan Dao*, Ngoc Tran, and Anh Tran
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Oct 2023
Text-to-image diffusion models are nothing but a revolution, allowing anyone, even without design skills, to create realistic images from simple text inputs. With powerful personalization tools like DreamBooth, they can generate images of a specific person just by learning from his/her few reference images. However, when misused, such a powerful and convenient tool can produce fake news or disturbing content targeting any individual victim, posing a severe negative social impact. In this paper, we explore a defense system called Anti-DreamBooth against such malicious use of DreamBooth. The system aims to add subtle noise perturbation to each user’s image before publishing in order to disrupt the generation quality of any DreamBooth model trained on these perturbed images. We investigate a wide range of algorithms for perturbation optimization and extensively evaluate them on two facial datasets over various text-to-image model versions. Despite the complicated formulation of DreamBooth and Diffusion-based text-to-image models, our methods effectively defend users from the malicious use of those models. Their effectiveness withstands even adverse conditions, such as model or prompt/term mismatching between training and testing.
@inproceedings{van2023anti,title={Anti-DreamBooth: Protecting users from personalized text-to-image synthesis},author={Van Le*, Thanh and Phung*, Hao and Nguyen*, Thuan Hoang and Dao*, Quan and Tran, Ngoc and Tran, Anh},booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},month=oct,year={2023},}
Wavelet Diffusion Models Are Fast and Scalable Image Generators
Hao Phung*, Quan Dao*, and Anh Tran
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2023
Diffusion models are rising as a powerful solution for high-fidelity image generation, which exceeds GANs in quality in many circumstances. However, their slow training and inference speed is a huge bottleneck, blocking them from being used in real-time applications. A recent DiffusionGAN method significantly decreases the models’ running time by reducing the number of sampling steps from thousands to several, but their speeds still largely lag behind the GAN counterparts. This paper aims to reduce the speed gap by proposing a novel wavelet-based diffusion scheme. We extract low-and-high frequency components from both image and feature levels via wavelet decomposition and adaptively handle these components for faster processing while maintaining good generation quality. Furthermore, we propose to use a reconstruction term, which effectively boosts the model training convergence. Experimental results on CelebA-HQ, CIFAR-10, LSUN-Church, and STL-10 datasets prove our solution is a stepping-stone to offering real-time and high-fidelity diffusion models.
@inproceedings{Phung_2023_CVPR,author={Phung*, Hao and Dao*, Quan and Tran, Anh},title={Wavelet Diffusion Models Are Fast and Scalable Image Generators},booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},month=jun,year={2023},pages={10199-10208},}