Recent advances in large-scale text-to-image models have revolutionized creative fields by generating visually captivating outputs from textual prompts; however, while traditional photography offers precise control over camera settings to shape visual aesthetics—such as depth-of-field—current diffusion models typically rely on prompt engineering to mimic such effects. This approach often results in crude approximations and inadvertently altering the scene content.
In this work, we propose Bokeh Diffusion, a scene-consistent bokeh control framework that explicitly conditions a diffusion model on a physical defocus blur parameter. By grounding depth-of-field adjustments, our method preserves the underlying scene structure as the level of blur is varied. To overcome the scarcity of paired real-world images captured under different camera settings, we introduce a hybrid training pipeline that aligns in-the-wild images with synthetic blur augmentations.
Extensive experiments demonstrate that our approach not only achieves flexible, lens-like blur control but also supports applications such as real image editing via inversion.
Bokeh Diffusion combines three key components to produce lens-like bokeh without altering scene structure:
(1) Hybrid Dataset Pipeline: We merge real in-the-wild images (for realistic bokeh and diverse scenes) with synthetic blur augmentations (for constrastive pairs). This approach anchors defocus realism while ensuring robust examples for training.
(2) Defocus Blur Conditioning: We inject a physically interpretable blur parameter (ranging from 0 to 30) via decoupled cross-attention at the deeper layers of the U-Net. This preserves semantic features while controlling the defocus level.
(3) Grounded Self-Attention: We designate a “pivot” image to anchor scene layout, ensuring consistent object placement across different blur levels. This prevents unintended content shifts when adjusting defocus.
A user can directly sample an image at the desired bokeh level.
A pivot image is chosen to anchor the scene whose bokeh the user wants to adjust, achieving scene-consistent generation.
To introduce bokeh conditions in the baselines SD3.5 and FLUX, we add short descriptors to the text prompt.
Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis is a concurrent work that models intrinsic camera parameters across the temporal axis of a text-to-video diffusion model.
@article{fortes2025bokeh,
title = {Bokeh Diffusion: Defocus Blur Control in Text-to-Image Diffusion Models},
author = {Fortes, Armando and Wei, Tianyi and Zhou, Shangchen and Pan, Xingang},
journal = {arXiv preprint arXiv:2503.08434},
year = {2025},
}