Quan Dao

I am 2nd PhD student at Rutgers University under supervision of Distinguished Prof. Dimitris Metaxas. My research focuses on generative models, specifically diffusion models and visual autoregressive models, with a primary emphasis on fundamental research. For diffusion models, I concentrate on developing efficient and robust training methodologies. During my PhD, I was very lucky to do internship in Apple MLR. Previously, I was a Research Resident under the supervision of Dr. Tuan Anh Tran at QualcommAI Research, Vietnam (which was VinAI research) and spent 2 wonderful years there. I received a bachelor degree in computer science from Monash University in 2020.

news

Feb 28, 2025	AutoEdit got accepted at NeurIPS 2025. This paper proposes RL-based method to select hyperparameters for diffusion editing technique
Feb 28, 2025	DICE got accepted at CVPR 2025. This paper proposes editing technique for discrete diffusion model)
Jan 22, 2025	Improved Latent Consistency Model got accepted at ICLR 2025. This paper proposes series of novel techniques like Cauchy loss, OT coupling, adaptive robust scale scheduler and diff loss at early timestep to efficiently train latent consistency model from scatch. Our technique bridges the performance gap between LDM and LCM training. (this is the first work discovering the unstability of consistency model on latent space due to impulsive outlier.)
Dec 10, 2024	SCFlow got accepted at AAAI 2025. This is the first work attempting to distill flow matching model into one and few step generation. With SCFlow, we could achieve consistent one and few step generation, which means starting from a noise, no matter how many NFEs is used for sampling, the final generated image is indentical.
Sep 23, 2024	Yummy DimSUM got accepted at NeurIPS 2024. DimSUM proposes novel hybrid transformer-mamba architecture allowing faster convergence training of diffusion/flow matching model and also achieve SoTA image generation.
Jul 21, 2024	RDUOT got accepted at ECCV 2024. This paper combines UOT generative framework with diffusion noising to allow train fast-converged and robust generative framework.
Jul 13, 2023	Antidreambooth got accepted at ICCV 2023. AntiDreambooth adds small undistinguished noise to your images to break the malicous explotation of Dreambooth on your images.
Feb 26, 2023	My first paper Wavediff got accepted at CVPR 2023. Wavediff proposes the frequency-aware Unet architecture allowing fast converence training for DiffusionGAN framework.

selected publications

AutoEdit: Automatic Hyperparameter Tuning for Image Editing

Chau Pham, Quan Dao, Mahesh Bhosale, Yunjie Tian, Dimitris Metaxas, and David Doermann

In The Thirty-nine Annual Conference on Neural Information Processing Systems, 2025

Abs arXiv Bib

Recent advances in diffusion models have revolutionized text-guided image editing, yet existing editing methods face critical challenges in hyperparameter identification. To get the reasonable editing performance, these methods often require the user to brute-force tune multiple interdependent hyperparameters, such as inversion timesteps and attention modification, \textitetc. This process incurs high computational costs due to the huge hyperparameter search space. We consider searching optimal editing’s hyperparameters as a sequential decision-making task within the diffusion denoising process. Specifically, we propose a reinforcement learning framework, which establishes a Markov Decision Process that dynamically adjusts hyperparameters across denoising steps, integrating editing objectives into a reward function. The method achieves time efficiency through proximal policy optimization while maintaining optimal hyperparameter configurations. Experiments demonstrate significant reduction in search time and computational overhead compared to existing brute-force approaches, advancing the practical deployment of a diffusion-based image editing framework in the real world.
@inproceedings{pham2025autoedit, title = {AutoEdit: Automatic Hyperparameter Tuning for Image Editing}, author = {Pham, Chau and Dao, Quan and Bhosale, Mahesh and Tian, Yunjie and Metaxas, Dimitris and Doermann, David}, booktitle = {The Thirty-nine Annual Conference on Neural Information Processing Systems}, year = {2025}, }
Discrete Noise Inversion for Next-scale Autoregressive Text-based Image Editing

Quan Dao, Xiaoxiao He, Ligong Han, Ngan Hoai Nguyen, Amin Heyrani Nobar, Faez Ahmed, Han Zhang, Viet Anh Nguyen, and Dimitris Metaxas

arXiv preprint arXiv:2509.01984, 2025

Abs arXiv Bib

Discrete diffusion models have achieved success in tasks like image generation and masked language modeling but face limitations in controlled content editing. We introduce DICE (Discrete Inversion for Controllable Editing), the first approach to enable precise inversion for discrete diffusion models, including multinomial diffusion and masked generative models. By recording noise sequences and masking patterns during the reverse diffusion process, DICE enables accurate reconstruction and flexible editing of discrete data without the need for predefined masks or attention manipulation. We demonstrate the effectiveness of DICE across both image and text domains, evaluating it on models such as VQ-Diffusion, Paella, and RoBERTa. Our results show that DICE preserves high data fidelity while enhancing editing capabilities, offering new opportunities for fine-grained content manipulation in discrete spaces.
@article{dao2025discrete, title = {Discrete Noise Inversion for Next-scale Autoregressive Text-based Image Editing}, author = {Dao, Quan and He, Xiaoxiao and Han, Ligong and Nguyen, Ngan Hoai and Nobar, Amin Heyrani and Ahmed, Faez and Zhang, Han and Nguyen, Viet Anh and Metaxas, Dimitris}, journal = {arXiv preprint arXiv:2509.01984}, year = {2025}, }
Improved Training Technique for Latent Consistency Models

Quan Dao*, Khanh Doan*, Di Liu, Trung Le, and Dimitris Metaxas

In International Conference on Learning Representations, 2025

Abs arXiv Bib

Consistency models are a new family of generative models capable of producing high-quality samples in either a single step or multiple steps. Recently, consistency models have demonstrated impressive performance, achieving results on par with diffusion models in the pixel space. However, the success of scaling consistency training to large-scale datasets, particularly for text-to-image and video generation tasks, is determined by performance in the latent space. In this work, we analyze the statistical differences between pixel and latent spaces, discovering that latent data often contains highly impulsive outliers, which significantly degrade the performance of iCT in the latent space. To address this, we replace Pseudo-Huber losses with Cauchy losses, effectively mitigating the impact of outliers. Additionally, we introduce a diffusion loss at early timesteps and employ optimal transport (OT) coupling to further enhance performance. Lastly, we introduce the adaptive scaling-c scheduler to manage the robust training process and adopt Non-scaling LayerNorm in the architecture to better capture the statistics of the features and reduce outlier impact. With these strategies, we successfully train latent consistency models capable of high-quality sampling with one or two steps, significantly narrowing the performance gap between latent consistency and diffusion models.
@inproceedings{dao2024improvedict, title = {Improved Training Technique for Latent Consistency Models}, author = {Dao*, Quan and Doan*, Khanh and Liu, Di and Le, Trung and Metaxas, Dimitris}, booktitle = {International Conference on Learning Representations}, year = {2025}, }
DICE: Discrete Inversion Enabling Controllable Editing for Multinomial Diffusion and Masked Generative Models

Xiaoxiao He, Ligong Han, Quan Dao, Song Wen, Minhao Bai, Di Liu, Han Zhang, Martin Renqiang Min, Felix Juefei-Xu, Chaowei Tan, and 1 more author

arXiv preprint arXiv:2410.08207, 2024

Abs arXiv Bib

Discrete diffusion models have achieved success in tasks like image generation and masked language modeling but face limitations in controlled content editing. We introduce DICE (Discrete Inversion for Controllable Editing), the first approach to enable precise inversion for discrete diffusion models, including multinomial diffusion and masked generative models. By recording noise sequences and masking patterns during the reverse diffusion process, DICE enables accurate reconstruction and flexible editing of discrete data without the need for predefined masks or attention manipulation. We demonstrate the effectiveness of DICE across both image and text domains, evaluating it on models such as VQ-Diffusion, Paella, and RoBERTa. Our results show that DICE preserves high data fidelity while enhancing editing capabilities, offering new opportunities for fine-grained content manipulation in discrete spaces.
@article{he2024dice, title = {DICE: Discrete Inversion Enabling Controllable Editing for Multinomial Diffusion and Masked Generative Models}, author = {He, Xiaoxiao and Han, Ligong and Dao, Quan and Wen, Song and Bai, Minhao and Liu, Di and Zhang, Han and Min, Martin Renqiang and Juefei-Xu, Felix and Tan, Chaowei and others}, journal = {arXiv preprint arXiv:2410.08207}, year = {2024}, }
DiMSUM: Diffusion Mamba - A Scalable and Unified Spatial-Frequency Method for Image Generation

Hao Phung*, Quan Dao*, Trung Dao, Hoang Phan, Dimitris Metaxas, and Anh Tran

In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

Abs arXiv Bib

We introduce a novel state-space architecture for diffusion models, effectively harnessing spatial and frequency information to enhance the inductive bias towards local features in input images for image generation tasks. While state-space networks, including Mamba, a revolutionary advancement in recurrent neural networks, typically scan input sequences from left to right, they face difficulties in designing effective scanning strategies, especially in the processing of image data. Our method demonstrates that integrating wavelet transformation into Mamba enhances the local structure awareness of visual inputs and better captures long-range relations of frequencies by disentangling them into wavelet subbands, representing both low- and high-frequency components. These wavelet-based outputs are then processed and seamlessly fused with the original Mamba outputs through a cross-attention fusion layer, combining both spatial and frequency information to optimize the order awareness of state-space models which is essential for the details and overall quality of image generation. Besides, we introduce a globally-shared transformer to supercharge the performance of Mamba, harnessing its exceptional power to capture global relationships. Through extensive experiments on standard benchmarks, our method demonstrates superior results compared to DiT and DIFFUSSM, achieving faster training convergence and delivering high-quality outputs.
@inproceedings{phung2024dimsum, title = {DiMSUM: Diffusion Mamba - A Scalable and Unified Spatial-Frequency Method for Image Generation}, author = {Phung*, Hao and Dao*, Quan and Dao, Trung and Phan, Hoang and Metaxas, Dimitris and Tran, Anh}, booktitle = {The Thirty-eighth Annual Conference on Neural Information Processing Systems}, year = {2024}, }
Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Text-to-Image Generation

Quan Dao*, Hao Phung*, Trung Dao, Dimitris Metaxas, and Anh Tran

In Association for the Advancement of Artificial Intelligence, 2025

Abs arXiv Bib

Flow matching has emerged as a promising framework for training generative models, demonstrating impressive empirical performance while offering relative ease of training compared to diffusion-based models. However, this method still requires numerous function evaluations in the sampling process. To address these limitations, we introduce a self-corrected flow distillation method that effectively integrates consistency models and adversarial training within the flow-matching framework. This work is a pioneer in achieving consistent generation quality in both few-step and one-step sampling. Our extensive experiments validate the effectiveness of our method, yielding superior results both quantitatively and qualitatively on CelebA-HQ and zero-shot benchmarks on the COCO dataset.
@inproceedings{dao2024self, title = {Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Text-to-Image Generation}, author = {Dao*, Quan and Phung*, Hao and Dao, Trung and Metaxas, Dimitris and Tran, Anh}, booktitle = {Association for the Advancement of Artificial Intelligence}, year = {2025}, }
A High-Quality Robust Diffusion Framework for Corrupted Dataset

Quan Dao*, Binh Ta*, Tung Pham, and Anh Tran

In European Conference on Computer Vision, 2024

Abs arXiv Bib

Developing image-generative models, which are robust to outliers in the training process, has recently drawn attention from the research community. Due to the ease of integrating unbalanced optimal transport (UOT) into adversarial framework, existing works focus mainly on developing robust frameworks for generative adversarial model (GAN). Meanwhile, diffusion models have recently dominated GAN in various tasks and datasets. However, according to our knowledge, none of them are robust to corrupted datasets. Motivated by DDGAN, our work introduces the first robust-to-outlier diffusion. We suggest replacing the UOT-based generative model for GAN in DDGAN to learn the backward diffusion process. Additionally, we demonstrate that the Lipschitz property of divergence in our framework contributes to more stable training convergence. Remarkably, our method not only exhibits robustness to corrupted datasets but also achieves superior performance on clean datasets.
@inproceedings{dao2025high, title = {A High-Quality Robust Diffusion Framework for Corrupted Dataset}, author = {Dao*, Quan and Ta*, Binh and Pham, Tung and Tran, Anh}, booktitle = {European Conference on Computer Vision}, pages = {107--123}, year = {2024}, organization = {Springer}, }
Flow Matching in Latent Space

Quan Dao*, Hao Phung*, Binh Nguyen, and Anh Tran

arXiv preprint arXiv:2307.08698, 2023

Abs arXiv Bib

Flow matching is a recent framework to train generative models that exhibits impressive empirical performance while being relatively easier to train compared with diffusion-based models. Despite its advantageous properties, prior methods still face the challenges of expensive computing and a large number of function evaluations of off-the-shelf solvers in the pixel space. Furthermore, although latent-based generative methods have shown great success in recent years, this particular model type remains underexplored in this area. In this work, we propose to apply flow matching in the latent spaces of pretrained autoencoders, which offers improved computational efficiency and scalability for high-resolution image synthesis. This enables flow-matching training on constrained computational resources while maintaining their quality and flexibility. Additionally, our work stands as a pioneering contribution in the integration of various conditions into flow matching for conditional generation tasks, including label-conditioned image generation, image inpainting, and semantic-to-image generation. Through extensive experiments, our approach demonstrates its effectiveness in both quantitative and qualitative results on various datasets, such as CelebA-HQ, FFHQ, LSUN Church & Bedroom, and ImageNet. We also provide a theoretical control of the Wasserstein-2 distance between the reconstructed latent flow distribution and true data distribution, showing it is upper-bounded by the latent flow matching objective.
@article{dao2023flow, title = {Flow Matching in Latent Space}, author = {Dao*, Quan and Phung*, Hao and Nguyen, Binh and Tran, Anh}, journal = {arXiv preprint arXiv:2307.08698}, year = {2023}, }
Anti-DreamBooth: Protecting users from personalized text-to-image synthesis

Thanh Van Le*, Hao Phung*, Thuan Hoang Nguyen*, Quan Dao*, Ngoc Tran, and Anh Tran

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Oct 2023

Abs arXiv Bib

Text-to-image diffusion models are nothing but a revolution, allowing anyone, even without design skills, to create realistic images from simple text inputs. With powerful personalization tools like DreamBooth, they can generate images of a specific person just by learning from his/her few reference images. However, when misused, such a powerful and convenient tool can produce fake news or disturbing content targeting any individual victim, posing a severe negative social impact. In this paper, we explore a defense system called Anti-DreamBooth against such malicious use of DreamBooth. The system aims to add subtle noise perturbation to each user’s image before publishing in order to disrupt the generation quality of any DreamBooth model trained on these perturbed images. We investigate a wide range of algorithms for perturbation optimization and extensively evaluate them on two facial datasets over various text-to-image model versions. Despite the complicated formulation of DreamBooth and Diffusion-based text-to-image models, our methods effectively defend users from the malicious use of those models. Their effectiveness withstands even adverse conditions, such as model or prompt/term mismatching between training and testing.
@inproceedings{van2023anti, title = {Anti-DreamBooth: Protecting users from personalized text-to-image synthesis}, author = {Van Le*, Thanh and Phung*, Hao and Nguyen*, Thuan Hoang and Dao*, Quan and Tran, Ngoc and Tran, Anh}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = oct, year = {2023}, }
Wavelet Diffusion Models Are Fast and Scalable Image Generators

Hao Phung*, Quan Dao*, and Anh Tran

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2023

Abs arXiv Bib

Diffusion models are rising as a powerful solution for high-fidelity image generation, which exceeds GANs in quality in many circumstances. However, their slow training and inference speed is a huge bottleneck, blocking them from being used in real-time applications. A recent DiffusionGAN method significantly decreases the models’ running time by reducing the number of sampling steps from thousands to several, but their speeds still largely lag behind the GAN counterparts. This paper aims to reduce the speed gap by proposing a novel wavelet-based diffusion scheme. We extract low-and-high frequency components from both image and feature levels via wavelet decomposition and adaptively handle these components for faster processing while maintaining good generation quality. Furthermore, we propose to use a reconstruction term, which effectively boosts the model training convergence. Experimental results on CelebA-HQ, CIFAR-10, LSUN-Church, and STL-10 datasets prove our solution is a stepping-stone to offering real-time and high-fidelity diffusion models.
@inproceedings{Phung_2023_CVPR, author = {Phung*, Hao and Dao*, Quan and Tran, Anh}, title = {Wavelet Diffusion Models Are Fast and Scalable Image Generators}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = jun, year = {2023}, pages = {10199-10208}, }