Rongyao Fang 1 Chengqi Duan 2 ∗ Kun Wang 3 Linjiang Huang 6 Hao Li 1,4 Shilin Yan
Hao Tian 3 Xingyu Zeng 3 Rui Zhao 3 Jifeng Dai 4,5 Xihui Liu 2 Hongsheng Li 1 †
1 CUHK MMLab 2 HKU 3 SenseTime 4 Shanghai AI Laboratory 5 THU 6 BUAA Equal ContributionCorresponding Authors
Abstract
Current image generation and editing methods primarily process textual prompts as direct inputs without reasoning about visual composition and explicit operations. We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements. We define the formulation of GoT and construct large-scale GoT datasets containing over 9M samples with detailed reasoning chains capturing semantic-spatial relationships. To leverage the advantages of GoT, we implement a unified framework that integrates Qwen2.5-VL for reasoning chain generation with an end-to-end diffusion model enhanced by our novel Semantic-Spatial Guidance Module. Experiments show our GoT framework achieves excellent performance on both generation and editing tasks, with significant improvements over baselines. Additionally, our approach enables interactive visual generation, allowing users to explicitly modify reasoning steps for precise image adjustments. GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent. To facilitate future research, we make our datasets, code, and pretrained models publicly available at https://github.com/rongyaofang/GoT.
![[Uncaptioned image]](https://arxiv.org/html/2503.10639v1/x1.png)
Figure 1: Generation Chain-of-Thought (GoT) with Semantic-Spatial Reasoning. Our approach transforms input prompts into explicit reasoning chains with coordinates (middle), which guides vivid image generation and precise editing (right). This reasoning-based generation paradigm unifies spatial understanding across visual tasks: semantically-grounded visual generation (top), controllable interactive generation (middle), and localized image editing (bottom).
1 Introduction
Language provides the primary interface for expressing human intent in visual content generation. Traditional image generation systems 1 2 3, particularly diffusion models, process textual prompts by mapping semantic concepts to visual elements without explicit reasoning. These approaches struggle with complex scenes requiring precise spatial arrangements and object interactions that humans naturally consider when constructing scenes. Meanwhile, multimodal large language models (MLLMs) 4 5 6 excel at sophisticated reasoning tasks, including analyzing semantic structures, inferring relationships, grounding visual concepts, and processing detailed contexts through explicit reasoning chains. This gap between MLLMs’ advanced reasoning capabilities and the limited reasoning in current generation systems raises a key question: How can we integrate the reasoning mechanisms that have revolutionized language understanding into visual generation and editing?
Prior work attempted to leverage LLMs for image generation from different perspectives. One line of research 7 8 leverages LLMs as text encoders for better prompt interpretation. However, the reasoning capabilities of LLMs are not introduced. Another line of work develops multimodal LLMs to unify understanding and generation 9 10 11 12. Although they present unified models for different tasks, there is no evidence that generation benefits from strong understanding and reasoning abilities of the models. They merely combine independent tasks rather than truly fusing language reasoning with visual generation. Additionally, layout-based methods like GLIGEN 13, LayoutGPT 14, and RPG 15 incorporate LLMs for layout planning and diffusion models for layout-guided generation. However, these methods treat planning and generation as separate stages rather than integrating reasoning throughout the end-to-end process. Consequently, current image generation methods lack reasoning capabilities, emphasizing the need for a framework that seamlessly combines reasoning with visual generation and editing.
Inspired by chain-of-thought (CoT) reasoning of the LLMs, we introduce Generation Chain-of-Thought (GoT), a novel paradigm that enables visual generation to first output step-by-step reasoning in natural language before producing images. However, implementing GoT poses two significant challenges. First, different from CoT in LLMs, the reasoning chain for visual generation and editing requires both semantic and spatial information. It requires a new formulation and collecting training data in this new format. Second, existing diffusion-based models cannot leverage explicit language reasoning chains during visual generation. We need to design a framework supporting end-to-end language reasoning and visual generation.
To address the first challenge, we formulate GoT as a multimodal reasoning chain that integrates semantic and spatial analyses to enhance image generation and editing tasks. For visual generation, GoT provides precise control over object layout, relationships, and attributes, while for editing, it leverages semantic and spatial understanding to decompose user requests into coherent grounding and modification steps. We utilize advanced MLLMs and LLMs to construct complex annotation pipelines, which capture semantic-spatial interactions across diverse visual contexts. We assembled extensive datasets comprising 8.4M images for text-to-image generation (from Laion-Aesthetics 16, JourneyDB 17, and FLUX 3) and 920K examples for image editing (from OmniEdit 18 and SEED-Edit-Multiturn 19). This computationally intensive effort produced the first large-scale dataset of reasoning chains for image generation and editing.
To tackle the second challenge of architecture design supporting reasoning and generation, we construct a unified end-to-end framework. Our GoT framework integrates the reasoning capabilities of MLLMs with the high-fidelity generation qualities of diffusion models. The proposed framework leverages an MLLM to generate reasoning steps and visual tokens, providing explicit guidance that incorporates semantic relationships and spatial configurations. This guidance flows into our novel Semantic-Spatial Guidance Module (SSGM), which conditions the diffusion process to ensure that generated images are closely guided by the reasoning process. This design supports end-to-end training and inference for visual generation and editing guided by explicit reasoning chains.
By effectively integrating reasoning into visual generation, our GoT framework demonstrates significant improvements in both text-to-image generation quality and image editing accuracy. Additionally, GoT enables interactive generation, allowing users to control the generated image by directly modifying the explicit reasoning process according to their preferences. These advantages represent a substantial advancement in reasoning-guided visual synthesis.
The main contributions can be summarized as follows:
- We propose Generation Chain-of-Thought (GoT), a paradigm that integrates reasoning into visual generation and editing tasks, enabling explicit semantic and spatial reasoning for these tasks.
- We define the formulation of semantic and spatial reasoning chains for visual generation and editing, and collect the first large-scale GoT datasets comprising 8.4M image generation samples and 920K image editing samples.
- We develop a unified end-to-end framework that leverages multimodal language models and diffusion models, with a novel Semantic-Spatial Guidance Module that ensures generated images follow the reasoning process.
- Our experimental results demonstrate significant improvements in both text-to-image generation and editing.
2 Related Work

Figure 2: GoT Dataset Construction Process. Left: Text-to-image GoT annotation pipeline that labels detailed GoT with semantic content and spatial coordinates. Right: Editing GoT annotation pipeline that processes source image, target image, and instruction to generate entity-aware reasoning GoT with precise spatial grounding. Both pipelines leverage Qwen2-VL 46 and Qwen2.5 51 models for various stages of the annotation process.
2.1 Diffusion Models
Diffusion models have revolutionized visual content creation. Early approaches 20 21 demonstrated this paradigm’s potential, while Stable Diffusion 1 improved efficiency through latent space compression. Recent models 22 23 24 2 3 25 have further advanced photorealism through architectural innovations and larger-scale training. Various efforts to extend diffusion models’ capabilities include controllable generation methods 26 27 and instruction-based editing frameworks 28 29. While some researchers have explored unifying vision tasks 30 31, these primarily focus on traditional computer vision tasks rather than general image generation. Despite these advances, current models typically process prompts through direct mapping, using text encoders like CLIP 32 or T5 33 to condition the diffusion process via cross-attention 34. This approach treats text as a static representation without explicit reasoning about scene composition or object relationships. The fundamental limitation becomes evident when generating complex scenes with multiple objects and specific spatial arrangements, necessitating more sophisticated reasoning-based approaches.
2.2 Large Language Models and Reasoning
Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities through chain-of-thought (CoT) 35, enabling complex problem decomposition. This paradigm extends to MLLMs 36 5, which integrate visual and textual understanding. Some advanced works 6 37 have enhanced spatial understanding by grounding textual concepts to image regions, enabling analysis of object relationships. Despite these capabilities, MLLMs remain underutilized for visual generation. While models like Chameleon 9 and Emu2 38 incorporate image generation, they lack mechanisms to decompose user intent into semantic-spatial reasoning steps.
2.3 Layout-guided Image Generation and Editing
Recent research has explored layout-guided approaches for spatial control in visual synthesis. GLIGEN 13 incorporated bounding boxes through gated cross-attention layers, enhancing object placement. LayoutGPT 14 proposed a two-stage pipeline converting text into scene layouts before generation. RPG 15 advanced this through recurrent planning, alternating between layout refinement and synthesis. SmartEdit 39 adapts the LLaVA 40 model to specialize in image editing tasks. FlexEdit 41 employs an MLLM to comprehend the image content, mask, and user instructions. Despite these advances, existing approaches treat layouts as static constraints or sequential plans generated before synthesis, disconnecting spatial planning from generation.
3 Generation Chain-of-Thought (GoT)
During visual generation and editing, humans naturally reason about object relationships and spatial arrangements. In contrast, most current models process prompts without explicit reasoning, making it difficult to interpret complex human intentions for generating scenes with detailed object relationships and spatial configurations.
Motivated by chain-of-thought (CoT) in language models, we propose Generation Chain-of-Thought (GoT), shifting the visual generation from direct mapping to a reasoning-guided process. Unlike language generation, which operates primarily within a semantic space, visual generation requires an integrated understanding of both semantic relationships and spatial configurations. To address this complexity, GoT employs a multi-modal reasoning formulation that bridges conceptual understanding and spatial reasoning. This formulation incorporates explicit coordinate information in format (x1,y1),(x2,y2) with range [0,1000), ensuring precise management over the placement of visual elements. This unified semantic-spatial reasoning chain enables fine-grained control of object placement, attributes, and inter-object relationships, ultimately supporting robust and coherent visual generation.
To illustrate the formulation of GoT, Fig. 1 presents examples of both text-to-image generation and editing tasks. For text-to-image, GoT generates a detailed reasoning chain specifying precise coordinates of elements. This explicit spatial reasoning enables a proper arrangement of all constituents while maintaining their semantic relationships, resulting in a coherent and visually appealing composition.
The image editing example in Fig. 1 demonstrates how GoT handles manipulation tasks through structured reasoning. When tasked with replace the giant leaf with an umbrella, GoT first analyzes the scene and then plans edits with precise coordinates. Finally, GoT describes what the image shows after editing. This decomposition into sequential steps with explicit spatial reasoning streamlines complex manipulations, contrasting with traditional editing methods that lack spatial awareness and reasoning.
GoT endows image generation and editing with reasoning benefits. By decomposing complex instructions into clearly defined, sequential steps, GoT delivers results that more accurately fulfill human requests. Its transparent process explains the intermediate reasoning behind each change and enables both image generation and editing within a unified system.
Implementing GoT requires two key components:
- A Comprehensive Dataset: This dataset must consist of detailed reasoning chains that align with visual content, capturing both semantic relationships and spatial configurations. Such data provide the necessary foundation for the reasoning process.
- A Compatible Visual Generation Model: The model needs to accommodate chain input to integrate semantic analysis and spatial reasoning, ensuring effective execution of the reasoning steps derived from the dataset.
In the following sections, we elaborate on these components and discuss how they contribute to the robust performance of the GoT framework.
4 GoT Dataset: Semantic-Spatial Reasoning Chains for Visual Generation and Editing

Figure 3: GoT Framework with Semantic-Spatial Guidance. Left: Our dual-task framework handling both text-to-image generation (T2I) and image editing. Right: The SSGM Diffusion Module, which combines spatial layouts guidance G s subscript 𝐺 𝑠 G_{s} italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, reference image guidance r 𝑟 G_{r} italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, and semantic guidance t 𝑡 G_{t} italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to generate the final image with precise content and spatial control.
Based on the formulation presented previously, we construct large-scale training datasets using advanced LLMs and MLLMs. Our GoT dataset features meticulously crafted semantic-spatial reasoning chains for both generation and editing tasks, with each sample containing instructions, reasoning chain annotations, and corresponding images. The construction requires careful design of task-specific annotation pipelines to ensure quality. The prompts used in the pipelines are attached in Appendix Sec. 11.
4.1 Automated Data Creation Pipeline
As illustrated in Fig. 2, our annotation pipeline demonstrates the multiple stages of processing required to generate these high-quality annotations. For text-to-image, we utilize Qwen2-VL 42 to generate concise prompts that serve as text-to-image generation prompts and detailed visual descriptions that form the semantic component of GoT. Qwen2.5 43 then performs object entity extraction, followed by Qwen2-VL establishing spatial relationships through object grounding. The detailed visual descriptions merged with precise object groundings together constitute the complete GoT annotation for text-to-image generation.
For the image editing pipeline, we employ Qwen2-VL to generate comprehensive descriptions of source and target images, precisely localize editing regions through bounding boxes, and generate detailed descriptions of edited objects after cropping. We then leverage Qwen2.5 with carefully designed in-context prompting to synthesize coherent GoT reasoning chains, ensuring logical flow and completeness of the editing process. From this pipeline, we derive concise editing instructions as editing inputs while using the detailed semantic-spatial reasoning steps as GoT annotations. For the complex multi-turn editing dataset, we developed a related but more sophisticated protocol with Qwen2-VL and Qwen2.5 to obtain intricate step-by-step reasoning chains with multiple spatial coordinates and transformation descriptions, capturing complex editing sequences.
4.2 Dataset Construction
For text-to-image generation, we construct dataset from three sources: Laion-Aesthetics-High-Resolution (LAHR) 16 with 3.77M samples filtered for images larger than 512 pixels, JourneyDB 17 with 4.09M samples, and 600K FLUX.1-generated 3 images using LAHR prompts. The final datasets yield rich annotations: LAHR-GoT samples with prompts averaging 110.81 characters, GoT descriptions averaging 811.56 characters, and 3.78 bounding boxes per image. Similarly, JourneyDB-GoT annotations average 149.78 characters for prompts, 906.01 characters for GoT descriptions, and 4.09 boxes image.
For the single-turn image editing dataset, we build on OmniEdit 18, a premier open-source image editing dataset with high-fidelity images, processing 736,691 samples covering editing operations (addition, removal, swap, changing expression/color/weather/lighting, and style transfer). The multi-turn image editing dataset is built upon SEED-Edit-Multiturn 19, resulting in 180,190 samples.
The entire data creation process demanded substantial computational resources, requiring 100 NVIDIA A100 GPUs for over a month. This comprehensive approach ensures our dataset provides the robust foundation necessary for training models capable of sophisticated image generation and editing tasks.
5 GoT Framework: Reasoning-guided Visual Generation and Editing
We present the GoT framework, a unified end-to-end approach embedding reasoning-guided processes into visual generation and editing tasks. GoT integrates two primary components: a semantic-spatial aware MLLM generating structured reasoning chains with spatial information, and a multi-guided diffusion model leveraging these reasoning outputs through our proposed Semantic-Spatial Guidance Module (SSGM) in an end-to-end manner. This design ensures that generated images precisely follow logical reasoning steps, allowing detailed control over both semantic content and spatial relationships.
5.1 Semantic-Spatial MLLM Design
Our framework utilizes a state-of-the-art MLLM Qwen2.5-VL-3B as the backbone, chosen for its outstanding visual understanding and grounding capabilities. This MLLM functions as a reasoning engine, handling both generation and editing tasks through a unified architecture.
| Method | Architecture | Overall | Single Obj. | Two Obj. | Counting | Colors | Position | Attr. Binding |
| Frozen Text Encoder Mapping Methods | ||||||||
| SDv1.5 37 | Unet+CLIP | 0.43 | 0.97 | 0.38 | 0.35 | 0.76 | 0.04 | 0.06 |
| SDv2.1 37 | Unet+CLIP | 0.50 | 0.98 | 0.51 | 0.44 | 0.85 | 0.07 | 0.17 |
| SD-XL 32 | Unet+CLIP | 0.55 | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 |
| DALLE-2 36 | Unet+CLIP | 0.52 | 0.94 | 0.66 | 0.49 | 0.77 | 0.10 | 0.19 |
| SD3 (d=24) 6 | MMDIT+CLIP+T5 | 0.62 | 0.98 | 0.74 | 0.63 | 0.67 | 0.34 | 0.36 |
| LLMs/MLLMs Enhanced Methods | ||||||||
| LlamaGen 42 | Autoregressive | 0.32 | 0.71 | 0.34 | 0.21 | 0.58 | 0.07 | 0.04 |
| Chameleon 44 | Autoregressive | 0.39 | - | - | - | - | - | - |
| LWM 26 | Autoregressive | 0.47 | 0.93 | 0.41 | 0.46 | 0.79 | 0.09 | 0.15 |
| SEED-X 13 | Unet+Llama | 0.49 | 0.97 | 0.58 | 0.26 | 0.80 | 0.19 | 0.14 |
| Emu3-Gen 47 | Autoregressive | 0.54 | 0.98 | 0.71 | 0.34 | 0.81 | 0.17 | 0.21 |
| Janus 50 | Autoregressive | 0.61 | 0.97 | 0.68 | 0.30 | 0.84 | 0.46 | 0.42 |
| JanusFlow 27 | Autoregressive | 0.63 | 0.97 | 0.59 | 0.45 | 0.83 | 0.53 | 0.42 |
| GoT Framework | Unet+Qwen2.5-VL | 0.64 | 0.99 | 0.69 | 0.67 | 0.85 | 0.34 | 0.27 |
Table 1: Evaluation of text-to-image generation on GenEval benchmark 44. Obj.: Object. Attr.: Attribution.
As shown in Fig. 3, the MLLM’s pipeline begins with task-specific input handling. For editing tasks, it processes reference images through the vision encoder to understand the source content. For both generation and editing, the MLLM produces GoT reasoning chains, capturing object attributes, relationships, modifications, and bounding box information. Following reasoning chain generation, the model processes an image start token followed by special [IMG] tokens, generating semantic guidance embeddings that encapsulate information from the previous reasoning chains. Additionally, the spatial guidance is derived by parsing and converting the explicit spatial information in the generated reasoning chains.
This semantic-spatial aware design enables the MLLM to direct the SSGM Diffusion Module with precise control over content and layout. During training, the MLLM receives supervision through two pathways: cross-entropy loss on GoT reasoning tokens and gradient signals backpropagated from the end-to-end SSGM diffusion module through semantic guidance .
5.2 Semantic-spatial Guided Diffusion Generation
The end-to-end diffusion module builds upon SDXL’s 24 architecture, incorporating an innovative triple-guidance mechanism that integrates semantic understanding, spatial awareness, and reference knowledge through our Semantic-Spatial Guidance Module (SSGM). In SSGM, the semantic guidance pathway enhances the diffusion model by channeling MLLM-generated embeddings through cross-attention layers, replacing conventional CLIP embeddings for more precise semantic control.
For spatial guidance in SSGM, we extract coordinate information from the generated GoT to create color-coded masks where each object or editing region receives a distinct color based on a predefined order in the GoT sequence. These colored masks are processed through a VAE encoder 45 and averaged to produce spatial latent features , which are concatenated with the diffusion model’s latent representations, enabling precise spatial control during both generation and editing tasks.
Following InstructPix2Pix 28, we incorporate reference image guidance as the third SSGM pathway. For editing tasks, the source image serves as a reference, while for text-to-image generation, we use a black reference image for architectural consistency. This design enables a seamless transition between generation and editing tasks without architectural modifications. All references are processed through the VAE encoder to extract visual features .

Figure 4: Text-to-Image samples generated by our model. The GoT framework can plan object placement based on the input caption and generate highly aligned and aesthetic images accordingly.
5.3 Multi-Guidance Strategy
We employ a classifier-free guidance strategy integrating semantic, spatial, and reference image guidance. During diffusion, the score estimation is calculated through a weighted combination:
where is the noisy latent, denotes semantic guidance embeddings, indicates spatial guidance features, and represents reference image features. Guidance scales , , and control the strength of each guidance type, while denotes null conditioning. During training, we randomly sample conditioning combinations with a probability of 5%, excluding the fully-conditioned case , to enhance robustness. Optimal guidance parameters are introduced in Sec. 6.
5.4 Training Procedure
Our training process implements a two-phase approach: pretraining using LAHR-GoT, JourneyDB-GoT, and OmniEdit-GoT datasets (60,000 steps), followed by finetuning with FLUX-GoT, OmniEdit-GoT, and SEED-Edit-MultiTurn-GoT (10,000 steps). We employ low-rank adaptation (LoRA) 46 to efficiently update the Qwen2.5-VL decoder’s parameters while fully optimizing the SDXL-based diffusion module. The process operates end-to-end, jointly optimizing the MLLM GoT cross-entropy token loss and diffusion MSE loss with equal weighting , demonstrating robustness without complex hyperparameter tuning.
6 Experiments
We evaluate GoT framework on text-to-image generation, interactive image generation, and image editing. Experiments show quantitative improvements and qualitative benefits of our reasoning-guided approach, with ablation studies validating our design choices.
6.1 Text-to-Image Generation
6.1.1 Quantitative Results

Figure 5: Samples on interactive generation with GoT framework. By modifying GoT content (description and bounding box position), user can customize their text-to-image process with: 1. Object replacement 2. Object position adjustment 3. Object attribute modification.
Tab. 1 presents a evaluation of text-to-image generation (T2I) on GenEval 44. The comparison spans two main categories of models: those employing frozen text encoders for direct prompt-to-image generation (primarily diffusion-based approaches) and those leveraging LLMs or MLLMs to enhance the generation process. On T2I task, GoT framework adopts and , and more discussions on tuning are shown in Appendix Sec. 9.2.
As shown in Tab. 1, our framework achieves the highest overall score of 0.64, outperforming both frozen text encoder methods and LLM/MLLM-enhanced approaches. GoT framework excels particularly in single object (0.99), counting tasks (0.67), and color tasks (0.85), demonstrating the effectiveness of our reasoning-guided generation paradigm. While methods like JanusFlow 47 perform better in position and attribute binding tasks, GoT framework’s balanced performance across all metrics validates that incorporating explicit reasoning mechanisms enhances compositional generation abilities.
Among the LLM/MLLM-enhanced methods, our approach outperforms recent systems like Janus 11 and JanusFlow 47 in overall performance despite their advantages in specific areas. This suggests that while autoregressive models excel in certain spatial tasks, our GoT framework’s structured reasoning provides more consistent performance across diverse generation requirements.
6.1.2 Qualitative Results
In addition to the outstanding compositional text-to-image generation capability, GoT framework also exhibits high generation quality. In Fig. 4, we showcase the generation results of our model across a diverse set of prompts. We present samples from compositional prompts containing multiple objects, incorporating object attributes, relationships, and relative spatial positions. Our model effectively plans the placement of different objects, producing coherent and aesthetically pleasing images.
6.2 Interactive Generation
In our experiments, we further demonstrate the interactive capabilities of the GoT framework, as illustrated in Fig. 5. This approach enables user control over the generation process by modifying the GoT content, including both textual descriptions and bounding box positions. Users can customize their text-to-image generation through three primary interaction types: object replacement, object position adjustment, and object attribute modification. The examples showcase how the framework maintains overall scene coherence while precisely implementing the requested changes. This interactive flexibility provides an interpretable and manipulable interface for text-to-image generation that traditional black-box systems lack, allowing for precise control over the output without requiring expertise.
| Method | Params. | Emu-Edit | ImagenHub | Reason-Edit | |
|---|---|---|---|---|---|
| CLIP-I | CLIP-T | GPT-4o Eval. | GPT-4o Eval. | ||
| IP2P 5 | 0.9B+0.1B | 0.834 | 0.219 | 0.308 | 0.286 |
| MagicBrush 53 | 0.9B+0.1B | 0.838 | 0.222 | 0.513 | 0.334 |
| MGIE 10 | 0.9B+7B | 0.783 | 0.253 | 0.392 | 0.264 |
| Emu-Edit 40 | - | 0.859 | 0.231 | - | - |
| SEED-X 13 | 2.8B+14B | 0.825 | 0.272 | 0.166 | 0.239 |
| SmartEdit † 17 | 0.9B+7B | - | - | - | 0.572 |
| CosXL-Edit 4 | - | 0.860 | 0.274 | 0.464 | 0.325 |
| GoT Framework | 2.8B+3B | 0.864 | 0.276 | 0.533 | 0.561 |
Table 2: Quantitative comparison on image editing benchmarks. † denotes that SmartEdit mainly supports removing and replacing operation and is not designed for general editing operations.
6.3 Image Editing

Figure 6: Qualitative results of image editing. Our GoT framework demonstrates superior performance in settings that require semantic-spatial reasoning. Red bounding boxes indicate the coordinates predicted by MLLM within the GoT framework.
6.3.1 Quantitative Results
As shown in Tab. 2, we evaluate our GoT framework against state-of-the-art image editing methods across multiple benchmarks. On Emu-Edit benchmark 29, GoT framework achieves the highest scores for both CLIP-I (0.864) and CLIP-T (0.276) metrics, outperforming previous methods including CosXL-Edit 48 and Emu-Edit 29. Since CLIP-I and CLIP-T cannot fully reflect editing accuracy, we also evaluated using GPT-4o 36, which aligns better with human evaluation 49. On ImagenHub 50, our approach attains the highest score of 0.533. On the reasoning-based Reason-Edit benchmark 39, our model achieves a strong score of 0.561, second only to SmartEdit (0.572) 39, which is specially designed for reasoning removing and replacing operations. This demonstrates our method’s strong editing ability, especially in complex reasoning settings. GoT framework shows consistently superior performance while maintaining competitive parameter efficiency (2.8B+3B) compared to approaches like SEED-X (2.8B+14B) 51. In the editing task, GoT framework adopts , , and . The evaluation prompt of GPT-4o is shown in Appendix Sec. 11.1.
6.3.2 Qualitative Results
We present qualitative comparison of image editing with other models in Fig. 6. Our approach demonstrates superior performance across diverse editing scenarios that require semantic-spatial reasoning. The examples highlight our framework’s distinctive capabilities: First, our model accurately identifies and localizes objects referenced through indirect descriptions. Second, our approach handles complex spatial instructions effectively, such as removing specific signage or adding delicate elements to precise locations. Third, our framework excels at multi-step editing operations, as demonstrated in the bottom example. The red bounding boxes visible in our results indicate the coordinates predicted by the MLLM within the GoT framework, providing interpretable insight into how our system reasons about spatial relationships during the editing process.
6.4 Ablation Study on Framework Design
We conduct an ablation study to analyze the impact of different components in our framework. Table 3 presents the results of our study, where we progressively integrate different components into the baseline and evaluate their effects on GenEval and ImagenHub benchmarks.
The baseline model leverages Qwen2.5-VL-3B and SDXL but does not incorporate GoT reasoning chains. It is trained with FLUX-GoT and OmniEdit-GoT for 10,000 steps. Adding GoT reasoning chains to the baseline model enables the LLM to achieve stronger semantic guidance capabilities. The reasoning process helps LLM plan for guidance in generation.
Introducing the Semantic-Spatial Guidance Module (SSGM) further enhances model performance, particularly in image editing. SSGM provides spatial control over the diffusion model, ensuring that object placement aligns more accurately with the reasoning process. This enables fine-grained editing, as reflected by the significant improvement in the ImagenHub evaluation. However, in GenEval, only the position category is notably affected by SSGM, which explains the relatively minor performance gain.
Our final framework, which includes GoT reasoning, SSGM, and an extensive 60,000-step pretraining phase, achieves the highest scores, demonstrating the significant benefits of prolonged pretraining and the full model design. The ablation study confirms that each added component contributes positively to the overall performance, validating our framework design choices.
| Method | GoT | SSGM | Pretrain | GenEval | ImagenHub |
|---|---|---|---|---|---|
| Baseline | 0.38 | 0.176 | |||
| + GoT | ✓ | 0.40 | 0.181 | ||
| + SSGM | ✓ | ✓ | 0.42 | 0.370 | |
| GoT Framework | ✓ | ✓ | ✓ | 0.64 | 0.533 |
Table 3: Ablation study of our GoT framework on GenEval overall and ImagenHub GPT-4o eval.
7 Conclusion
We presented Generation Chain-of-Thought (GoT), a paradigm that integrates MLLM reasoning capabilities into visual generation through explicit semantic-spatial reasoning chains. Our approach transforms visual generation from direct mapping into a reasoning-guided process with precise spatial control, addressing limitations in existing methods that lack explicit understanding of object relationships and arrangements. Through large-scale dataset construction (9M+ examples), a novel Semantic-Spatial Guidance Module, and an end-to-end training framework, GoT achieves state-of-the-art performance on text-to-image generation and editing benchmarks while enabling unprecedented interactive control through modifiable reasoning chains. By bridging the gap between human reasoning and visual creation, GoT introduces a more intuitive and powerful approach to visual synthesis that aligns with natural cognitive processes.
References
Supplementary Material
8 Training Details
We pretrain our model for 60,000 steps on LAHR-GoT, JourneyDB-GoT, and OmniEdit-GoT. We adopt a cosine learning rate scheduler with 500 warmup steps and a maximum learning rate of .
During the fine-tuning stage, we train the model on FLUX-GoT, OmniEdit-GoT, and SEED-Edit-MultiTurn-GoT for 10,000 steps. In this phase, we set the warmup steps to 200 and the maximum learning rate to .
For both stages, we use the Adam optimizer with , , and . We also apply a weight decay of 0.05 during training. The number of batch size is set to 128.
The LLM is fine-tuned using LoRA with , LoRA alpha set to 32, and a LoRA dropout rate of 0.05. For diffusion, we introduce a noise offset of 0.1.
9 Visualization Results
9.1 Qualitative Analysis of Image Editing and Interactive Generation
We provide additional examples to demonstrate the capabilities of the GoT framework. Figure 7 illustrates the image editing performance of our model. Additionally, we present the corresponding GoT content generated alongside each sample. Further examples of interactive generation using our model are shown in Figure 8.
9.2 Visualization of Multi-Guidance Strategy Hyperparameter Selection
We analyze the effect of hyperparameter selection in the Multi-Guidance Strategy on the generated images, as depicted in Figure 9. The definitions of these hyperparameters are provided in Section 5.3.
10 GoT Format and Examples
This section presents examples of the GoT format in our dataset. The GoT structure varies across different tasks, including text-to-image (T2I) generation, single-turn editing, and multi-turn editing.
For text-to-image generation, Figure 10 showcases examples from FLUX-GoT, JourneyDB-GoT, and LAHR-GoT. Our GoT format represents the structured planning process of the upstream model in generating image content. It provides a detailed breakdown of the various components within an image and their spatial relationships. To enhance spatial understanding, we append location information to key objects within the GoT representation.
Figure 11 illustrates the GoT format for image editing within our dataset. For single-turn editing, GoT represents the reasoning plan of the upstream model for a specific editing action. It consists of a description of the source image, the object to be modified, the specific editing operation, and the resulting edited image. This structured process ensures a step-by-step transformation, beginning with the original image, identifying the target object, applying the specified modification, and generating the edited image.
For multi-turn editing, GoT follows a more complex structure, as it must encapsulate the breakdown of an instruction into a sequence of consecutive steps. In practice, we first generate a description of the source image, then decompose the multi-turn instruction into a series of step-by-step editing commands. At each step, GoT operates as a single-turn editing process, specifying the object to be modified along with the corresponding transformation. Finally, the process concludes with a description of the fully edited image.
Furthermore, for image editing tasks, positional information is appended to each object to enhance spatial comprehension.
11 Prompts for Evaluation and Dataset Construction
11.1 Prompts for Evaluating Image Editing Performance
We provide the prompts used for evaluating image editing performance with GPT-4o in Figure 12. We are using GPT-4o-2024-11-20. The final score is the average of the minimum value of the two scores for each sample.
11.2 Prompts for Text-to-Image Data Construction
Figures 13, 14, and 16 present the key prompts utilized in text-to-image data preparation.
11.3 Prompts for Image Editing Data Construction
Figures 15–20 illustrate the key-step prompts employed in image editing data preparation.

Figure 7: More samples on image editing with the GoT content generated by our model.

Figure 8: More examples on interactive generation.

Figure 9: Visualization on Multi-Guidance Strategy Hyper-parameter Selection. The above are text-to-image samples generated by GoT framework under different hyper-parameters.

Figure 10: Examples of GoT dataset for text-to-image generation, including FLUX-GoT, JourneyDB-GoT, and Laion-Aesthetics-High-Resolution-GoT.

Figure 11: Examples of GoT dataset for image editing, including OmniEdit-GoT for single-turn editing and SEED-Edit-Multiturn-GoT for multi-turn editing.
Figure 12: Prompt for GPT4-o image editing evaluation. We are using GPT-4o-2024-11-20. The final score is the average of the minimum value of the two scores for each sample.
Figure 13: Prompt for detailed recaption for text-to-image data.
Figure 14: Prompt for identifying objects in text-to-image caption.
Figure 15: An example of prompt for parsing the edited object. This is used when the task type is ’replace’.
Figure 16: Prompt for grounding object. This works for both text-to-image and image editing data.
Figure 17: Prompt for image description for image editing data.
Figure 18: Prompt for cropped image object description for image editing.
Figure 19: Prompt for reinstruction for image editing data.
Figure 20: In-context assembling GoT prompt for image editing data.
Footnotes
-
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. ↩ ↩2
-
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first international conference on machine learning, 2024. ↩ ↩2
-
Black Forest Labs. Flux. https://github.com/black-forest-labs/flux, 2024. ↩ ↩2 ↩3 ↩4
-
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. ↩
-
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. ↩ ↩2
-
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024a. ↩ ↩2
-
Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655, 2023. ↩
-
Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina-t2x stronger and faster with next-dit. arXiv preprint arXiv:2406.18583, 2024. ↩
-
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024. ↩ ↩2
-
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024b. ↩
-
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation, 2024a. URL https://arxiv. org/abs/2410.13848, 2024. ↩ ↩2
-
Rongyao Fang, Chengqi Duan, Kun Wang, Hao Li, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Hongsheng Li, and Xihui Liu. Puma: Empowering unified mllm with multi-granular visual generation. arXiv preprint arXiv:2410.13861, 2024. ↩
-
Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22511–22521, 2023. ↩ ↩2
-
Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual planning and generation with large language models. Advances in Neural Information Processing Systems, 36:18225–18250, 2023. ↩ ↩2
-
Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, and CUI Bin. Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. In Forty-first International Conference on Machine Learning, 2024b. ↩ ↩2
-
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in neural information processing systems, 35:25278–25294, 2022. ↩ ↩2
-
Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, et al. Journeydb: A benchmark for generative image understanding. Advances in neural information processing systems, 36:49659–49678, 2023. ↩ ↩2
-
Cong Wei, Zheyang Xiong, Weiming Ren, Xinrun Du, Ge Zhang, and Wenhu Chen. Omniedit: Building image editing generalist models through specialist supervision. arXiv preprint arXiv:2411.07199, 2024. ↩ ↩2
-
Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, and Ying Shan. Seed-data-edit technical report: A hybrid dataset for instructional image editing. arXiv preprint arXiv:2405.04007, 2024a. ↩ ↩2
-
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr, 2021. ↩
-
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021. ↩
-
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022. ↩
-
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022. ↩
-
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. ↩ ↩2
-
Kaiyi Huang, Chengqi Duan, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench++: An enhanced and comprehensive benchmark for compositional text-to-image generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. ↩
-
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023b. ↩
-
Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI conference on artificial intelligence, pages 4296–4304, 2024. ↩
-
Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023. ↩ ↩2
-
Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024. ↩ ↩2 ↩3
-
Yulu Gan, Sungwoo Park, Alexander Schubert, Anthony Philippakis, and Ahmed M Alaa. Instructcv: Instruction-tuned text-to-image diffusion models as vision generalists. arXiv preprint arXiv:2310.00390, 2023. ↩
-
Rongyao Fang, Shilin Yan, Zhaoyang Huang, Jingqiu Zhou, Hao Tian, Jifeng Dai, and Hongsheng Li. Instructseq: Unifying vision tasks with instruction-conditioned multi-modal sequence generation. arXiv preprint arXiv:2311.18835, 2023. ↩
-
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021. ↩
-
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020. ↩
-
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. ↩
-
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. ↩
-
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. ↩ ↩2
-
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023. ↩
-
Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14398–14409, 2024b. ↩
-
Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, et al. Smartedit: Exploring complex instruction-based image editing with multimodal large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8362–8371, 2024. ↩ ↩2 ↩3
-
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023. ↩
-
Trong-Tung Nguyen, Duc-Anh Nguyen, Anh Tran, and Cuong Pham. Flexedit: Flexible and controllable diffusion-based object-centric image editing. arXiv preprint arXiv:2403.18605, 2024. ↩
-
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024a. ↩
-
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024a. ↩
-
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36:52132–52152, 2023. ↩ ↩2
-
Diederik P Kingma, Max Welling, et al. Auto-encoding variational bayes, 2013. ↩
-
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. ↩
-
Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Liang Zhao, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. arXiv preprint arXiv:2411.07975, 2024. ↩ ↩2
-
Frederic Boesel and Robin Rombach. Improving image editing models with generative data refinement. In The Second Tiny Papers Track at ICLR 2024, 2024. ↩
-
Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. arXiv preprint arXiv:2312.14867, 2023a. ↩
-
Max Ku, Tianle Li, Kai Zhang, Yujie Lu, Xingyu Fu, Wenwen Zhuang, and Wenhu Chen. Imagenhub: Standardizing the evaluation of conditional image generation models. arXiv preprint arXiv:2310.01596, 2023b. ↩
-
Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024b. ↩