Jinxiang Lai 2∗, Zexin Lu 1∗, Jiajun He 1, Rongwei Quan 1, Wenzhe Zhao 1, Qinyu Yang 1, Qi Chen 1, Qin Lin 1†, Chuyue Li 1, Tao Gao 1, Yuhao Shan 1, Shuai Shao 1,
Song Guo 2§, Qinglin Lu 1†
1 Tencent Hunyuan, 2 Hong Kong University of Science and Technology
∗ Equal contribution, § Corresponding Author, † Project lead
Abstract
Visual content creation tasks demand a nuanced understanding of design conventions and creative workflows-capabilities challenging for general models, while workflow-based agents lack specialized knowledge for autonomous creative planning. To overcome these challenges, we propose VisionCreator, a native visual-generation agentic model that unifies Understanding, Thinking, Planning, and Creation (UTPC) capabilities within an end-to-end learnable framework. Our work introduces four key contributions: (i) VisGenData-4k and its construction methodology using metacognition-based VisionAgent to generate high-quality creation trajectories with explicit UTPC structures; (ii) The VisionCreator agentic model, optimized through Progressive Specialization Training (PST) and Virtual Reinforcement Learning (VRL) within a high-fidelity simulated environment, enabling stable and efficient acquisition of UTPC capabilities for complex creation tasks; (iii) VisGenBench, a comprehensive benchmark featuring 1.2k test samples across diverse scenarios for standardized evaluation of multi-step visual creation capabilities; (iv) Remarkably, our VisionCreator-8B/32B models demonstrate superior performance over larger closed-source models across multiple evaluation dimensions. Overall, this work provides a foundation for future research in visual-generation agentic systems.
1 Introduction
AI-assisted visual content creation has revolutionized workflows from professional design to social media. The field has evolved from single-image generation 1 2 3 4 5 to complex multi-modal synthesis 6 7 8 9 10, demanding systems that can understand creative intent, plan multi-step operations, and autonomously execute intricate workflows. As shown in Fig.2, current approaches to autonomous visual creation can be categorized into three main paradigms, each with distinct limitations: (a) General-purpose Unified Multimodal Models (UMM) 11 12 13 14 leverage large-scale pre-training to achieve impressive visual understanding, but lack the domain-specific knowledge required for autonomous creative planning and struggle to decompose complex objectives without extensive prompt engineering. (b) Workflow-specific Agent 6 7 8 employ predefined pipelines for specific domains like movie generation or story creation, but their rigid architectures cannot adapt to diverse creative tasks or handle unexpected outcomes during execution. (c) Workflow-guided Agent 9 10 orchestrate external tools through carefully designed prompts and coordination logic, leveraging general language models to interpret requests and sequence operations. However, this approach faces several limitations: (i) Reliance on prompt engineering rather than learned domain knowledge, limiting creative understanding; (ii) Explicitly programmed coordination logic that restricts adaptability to diverse tasks; and (iii) Inability to be jointly optimized end-to-end for creative task performance.

Figure 1: Human Evaluation results on VisGenBench-Image and VisGenBench-Video.
To overcome these limitations, we propose VisionCreator, a native visual-generation agentic model that unifies Understanding, Thinking, Planning, and Creation (UTPC) capabilities in an end-to-end learnable framework, as shown in Fig.2 (d). Unlike existing approaches that rely on predefined workflows or external template workflows, our native architecture intrinsically integrates the capabilities of Understanding design conventions and user intent, Thinking through complex creative constraints, Planning multi-step execution trajectories, and Creation of high-quality and diverse visual creation tasks. However, realizing this new paradigm faces several critical challenges:
① Data Bottleneck: Currently, no comprehensive datasets exist for training agents to perform visual content creation through tool invocation. The lack of high-quality trajectories prevents supervised learning of the UTPC capabilities.
② Task Complexity: How to develop models that can handle the full spectrum of visual creation challenges, which encompass (i) Diverse task types, (ii) Varying difficulty levels from basic generation to advanced composition, and (iii) Complex creation tasks requiring 20+ execution steps? Existing approaches face significant limitations: specialized systems excel in narrow domains but fail to generalize across diverse tasks, while general models lack the depth for sophisticated creative reasoning and struggle with long-horizon consistency and adaptive strategy adjustment.
③ Training Difficulty: How to establish an effective and efficient training paradigm for such a native agent? The conventional SFT+RL framework faces significant obstacles: (i) SFT phase struggles to balance general capability preservation with domain-specific specialization, often leading to catastrophic forgetting or insufficient expertise; (ii) Direct online RL training with real tools incurs prohibitive costs and instability due to expensive API invocation and limited concurrency. Furthermore, designing accurate reward signals for multi-step creative trajectories is particularly challenging, as imperfect reward functions are highly vulnerable to reward hacking.

Figure 2: Framework comparisons. (a) UMM. (b) Workflow-specific Agent. (c) Workflow-guided Agent. (d) Our Native VisionCreator.
To address these challenges, we propose:
(i) VisGenData-4k with UTPC Structure: We design a metacognition-based VisionAgent to generate a comprehensive dataset following the UTPC structure, featuring diverse visual creation tasks across multiple difficulty levels. Through rigorous human quality inspection, we meticulously filter and retain only the highest-quality data samples. The resulting VisGenData-4k provides diverse and high-quality execution trajectories that explicitly capture Understanding of design conventions, Thinking through creative constraints, Planning of multi-step trajectories, and Creation of visual content, offering rich supervision signals for complex creative workflows.
(ii) Progressive Specialization Training (PST): We introduce a novel Progressive Specialization Training methodology that cultivates UTPC capabilities through two-stage optimization. PST effectively addresses the generalization-specialization trade-off by first establishing robust Understanding and Thinking capacities through general foundation learning, followed by targeted domain specialization to enhance Planning and Creation expertise. This progressive strategy not only prevents catastrophic forgetting of general abilities but also efficiently identifies optimal data composition for stagewise specialization, enabling the model to develop comprehensive UTPC capacities while maintaining strong cross-domain reasoning abilities.
(iii) Virtual VisGenEnv Construction: We construct VisGenEnv, a virtual environment for VRL. It features 36 tools with high-fidelity simulation of their behaviors. Multimodal outputs are simulated by returning random samples from a media database, providing correct physical attributes. This design enables effective learning of workflow planning through accurate tool behavior simulation.
(iv) Virtual Reinforcement Learning (VRL) with LtrReward: We develop an innovative Virtual Reinforcement Learning (VRL) paradigm that conducts the entire reinforcement learning using Long Trajectory Reasoning Reward (LtrReward) within the high-fidelity VisGenEnv. This approach bypasses the prohibitive cost of thousands of GPUs by leveraging simulated tool-call behaviors and functional logic, enabling stable and scalable learning of high-quality planning and action trajectories. Moreover, we provide a theoretical analysis that establishes formal guarantees on sim-to-real transfer and real-world performance improvement.
Finally, we introduce VisGenBench, a comprehensive benchmark designed for evaluating visual generation agentic models that operate through multi-step tool invocation to accomplish complex image and video creation tasks. Our benchmark encompasses: (i) Comprehensive Test Suite - featuring 1.2k test samples including 400 image-generation tasks and 800 video-generation tasks; (ii) Diverse Applications - spanning 10 evaluation dimensions across 35+ real-world scenarios; (iii) Standardized Protocol - ensuring reproducible evaluation through structured scoring rubrics.
Overall, our contributions are: (i) The VisionCreator, a novel native visual-generation agentic model that unifies UTPC capabilities in an end-to-end learnable framework; (ii) VisGenData-4k and its construction framework using metacognition-based VisionAgent to generate high-quality creation trajectories with UTPC structures; (iii) A progressive training methodology combining Progressive Specialization Training (PST) and Virtual Reinforcement Learning (VRL) with LtrReward, enabling stable and efficient learning of complex creation trajectories entirely within a virtual environment VisGenEnv; (iv) VisGenBench benchmark with 1.2k test samples across diverse scenarios for standardized evaluation of multi-step visual creation capabilities.
2 Related Works
2.1 Image Generation
Current image generation models primarily fall into two categories: Autoregressive 15 16 17 18 19 20 models and Diffusion 1 2 3 4 5 models. While these models provide powerful single-step image generation capabilities, they primarily focus on the Creation aspect of visual content generation. Our VisionCreator builds upon these fundamental generation technologies but extends them by integrating comprehensive Understanding, Thinking, and Planning capabilities. This allows our agent to not only generate individual images but also reason about complex creative requirements and plan multi-step visual creation workflows that leverage these underlying generation models as tools.
2.2 Video Generation
Video generation methods build on image models by adding time-based processing. Approaches like Make-A-Video 21 and SVD 22 extend image generation to video, while newer architectures like DiT 23 and MMDiT 24 in models such as CogvideoX 25 show progress in handling longer videos. These video generation tools are important for our agent’s creation ability, but they work separately. Our VisionCreator connects these tools through planning and reasoning to handle complete creation tasks.
2.3 Visual Generation Agents
Current approaches to autonomous visual creation include three main agent paradigms: (i) Workflow-specific Agents (e.g., MovieAgent 6, Captain Cinema 7, MM-StoryAgent 8) employ predefined pipelines for specialized domains but lack adaptability to diverse creative tasks. (ii) ComfyUI Workflow Generation methods (e.g., ComfyAgent 9, ComfyMind 10, ComfyUI-R1 26, ComfyGPT 27) specialize in generating ComfyUI-format workflows, which limits their visual creation capability in general API scenarios. (iii) Workflow-guided Agents 9 10 orchestrate external tools through prompt engineering but face limitations in creative understanding depth and end-to-end optimization. These limitations motivate our native visual-generation agent that intrinsically integrates UTPC capabilities in an end-to-end learnable framework.

Figure 3: VisGenData-4k construction pipeline.
3 VisGenData-4k with UTPC Structure
To tackle the data bottleneck in training visual creation agents, we design VisionAgent, a dataset generation framework based on a Metacognition paradigm. To construct a high-quality VisGenData dataset, VisionAgent employs commercial proprietary models (such as GPT-5, GPT-4o, Veo3, Sora2, etc) for multimodal data generation, and we further filter low-quality trajectories with algorithms and human experts. As shown in Fig. 3, the construction pipeline is as follows: (i) VisionAgent first generates 16k trajectories from 20k queries covering 42 scenarios. (ii) With the rigorous LtrReward and VLM-Grader methods, we remove 10k low-quality trajectories and obtain 6k candidate trajectories. (iii) These subsequently undergo a manual review by human experts, where 2k undesired trajectories are filtered out, resulting in high-quality 4k trajectories.

Figure 4: VisionAgent framework for dataset generation.
3.1 VisionAgent for Dataset Construction
As shown in Fig.4, our VisionAgent generates high-quality execution trajectories that capture the complete reasoning process for complex visual creation tasks. VisionAgent with metacognition achieves a 72% task success rate, representing a 30% improvement over the baseline method that relies solely on thinking.
Dual-Agent Architecture. Our framework employs a dual-agent architecture that separates task understanding from execution reasoning: (1) TaskAgent: Serves as the task classifier and router. It analyzes user inputs and performs fine-grained task classification across the 21 distinct task types, then selects appropriate predefined workflow templates and tools pool for specific task categories. (2) MetaAgent: Functions as the core reasoning engine with metacognitive capabilities. It receives both the selected workflow and tools pool as inputs, then executes structured reasoning through four standardized reasoning types defined in metacognition.

Figure 5: VisGenData-4k dataset statistics.
Metacognitive Reasoning Process. The metacognition defines four reasoning types: the
Reference Workflow Integration. We incorporate 15 predefined workflow templates as best-practice guides, ensuring planning remains flexible yet stays on track. These workflows provide domain-specific execution patterns for various visual creation tasks, representing 15 distinct application scenarios from storyboards to animated short films.
3.2 Dataset Composition and Statistics
Fig.5 shows the statistics of our VisGenData-4k, which exhibits the following key features: (1) Diverse Task Types: Encompassing 21 distinct task types (including storyboards, marketing posters, product marketing videos, animated short films, etc), this diversity is crucial for training agents to handle a broad range of real-world creative demands, significantly enhancing their adaptability and practical applicability. (2) Complex Trajectory Structure: With a mean of 15 steps and 64% of trajectories exceeding 20 steps, this complexity is crucial for training agents to decompose and plan long-horizon tasks, fostering robust problem-solving capabilities in visual creation. (3) Rich Contextual Information: The substantial token length (mean: 29k, 43% over 32k) equips agents with the ability to process and utilize extensive contextual cues, significantly enhancing their capacity for detailed and context-aware generation.
4 Agentic Post-Training
4.1 Agentic Framework
As shown in Fig. 6, VisionCreator is formulated as a unified agent that integrates Understanding, Thinking, Planning, and Creation (UTPC) capabilities to accomplish complex visual generation tasks. Formally, we model the agent as a policy operating over long-horizon multimodal trajectories: , where denotes multimodal observations (textual instructions, intermediate tool feedback, and virtual visual states), and denotes agent actions including reasoning tokens, planning steps, and tool invocations. The training process follows a two-stage agentic post-training paradigm: (1) Progressive Specialization Training (PST), which initializes a strong policy prior via supervised learning over expert UTPC trajectories. (2) Virtual Reinforcement Learning (VRL), which further optimizes long-horizon planning and tool-use strategies through large-scale exploration in a simulated environment.
4.2 Progressive Specialization Training
The goal of Progressive Specialization Training (PST) is to learn an initial policy that simultaneously preserves general reasoning competence while acquiring domain-specific visual creation ability, thereby enabling a functional visual content creation agent rather than a narrowly tuned generator. Let the supervised dataset be , where contains large-scale general reasoning and tool-use trajectories, and contains expert-curated visual creation trajectories (VisGenData-4k). Standard supervised fine-tuning (SFT) minimizes
However, naive single-stage SFT exhibits two fundamental failure modes. Training only on leads to catastrophic forgetting of general reasoning and planning ability, resulting in nearly zero agent competence; empirically, Tab. 4 shows performance dropping to 0.007, indicating the model is unable to function as a visual creation agent. Conversely, one-stage mixed SFT on avoids catastrophic forgetting but yields suboptimal specialization, since the dominance of suppresses learning of visual-creation behaviors and degrades downstream agent performance. These observations reveal a necessary condition for visual agents:
which neither naive SFT strategies can satisfy simultaneously.
PST resolves this conflict through a controlled two-stage curriculum that induces a gradual distribution shift. In Stage 1 (general foundation learning),
establishing robust reasoning, planning, and tool-use capabilities while lightly anchoring the policy to the visual generation agent domain. In Stage 2 (targeted specialization),
the increased effective influence of drives specialization toward visual content creation, while continued exposure to prevents catastrophic forgetting. Overall, PST learns a structured initialization
which constrains downstream reinforcement learning (RL) to a policy region that already satisfies both general competence and visual specialization. Experimental results further validate the necessity of PST. Compared with one-stage SFT, PST achieves substantially stronger performance on visual creation agent tasks, demonstrating that progressive specialization is essential for learning effective UTPC behaviors. Moreover, PST provides a significantly better initialization for RL: the initial reward score before RL training increases from 0.64 (one-stage SFT) to 0.87 (PST), a gain of +0.23. This improved starting point directly translates into optimization efficiency—RL convergence is accelerated by approximately 50%. These findings confirm that PST not only improves final agent capability, but also fundamentally reduces the difficulty of downstream reinforcement learning.

Figure 6: Our Native VisionCreator framework.
4.3 Virtual Reinforcement Learning
Building upon the robust foundation established by PST, we refine the model’s UTPC capabilities through Virtual Reinforcement Learning (VRL) based on the GRPO algorithm. To enable scalable long-horizon learning without invoking real-world tools, we first construct a high-fidelity virtual environment VisGenEnv that simulates the behavior of visual creation tools. Within this environment, LtrReward components are designed to supervise agent trajectories and guide both planning and execution. To understand that policies learned under these rewards transfer effectively to real-world scenarios, we provide a theoretical analysis of VRL. Building upon these insights, we then introduce a plan-driven reward that integrates planning and execution signals to optimize robust long-horizon visual creation performance.

Figure 7: Comparison of the real environment and our virtual VisGenEnv environment, with an example of using a video generation tool.
4.3.1 Virtual VisGenEnv Environment
To enable scalable long-horizon learning without invoking real-world tools, we first construct a high-fidelity virtual environment called VisGenEnv. This environment serves as a sandbox where the agent can safely explore planning and tool-use strategies, laying the foundation for subsequent reward design and theoretical analysis. VisGenEnv integrates a comprehensive suite of 36 visual creation tools (see Appendix for full list). The core of its design lies in a procedural simulation that accurately replicates the functional logic and behavioral patterns of real tools, including state transitions, parameter validation, and output specifications such as image resolution and video duration. To simulate multimodal outputs, the environment returns media files randomly sampled from a database while ensuring physically correct attributes consistent with tool specifications. This high-fidelity simulation of tool behaviors enables the agent to effectively learn the causal structure of the workflow and master robust planning policies through extensive practice within the virtual setting.
Training agent models by reinforcement learning in the real environment is prohibitively expensive. As illustrated in Fig. 7, supporting a training batch size of 24 with 4 rollouts (i.e., 96 concurrent rollouts in total) quickly becomes computationally intractable. Video tools are particularly costly: each instance requires 8 GPUs and roughly 30 seconds per video, meaning 96 concurrent rollouts would require GPUs. Deploying multiple real image and video generation tools would require several thousand GPUs, while our virtual environment VisGenEnv enables long-horizon exploration with only a few GPUs—thus saving thousands of GPU resources.

Figure 8: LtrReward Components.
4.3.2 LtrReward Components
With the virtual environment in place, as shown in Fig. 8, we now define LtrReward components (i.e., virtual reward applicable to VisGenEnv) as reward signals that guide the agent’s learning, which consist of Plan Reward and Fine-grained Reward .
Plan Reward evaluates the overall quality of the task plan using a proposed vPlanJudger, an expert-informed LLM evaluator that leverages a curated repository of expert reference plans to provide in-context guidance. By performing cross-referenced reasoning between the candidate plan and expert-authored strategies, the vPlanJudger computes a multidimensional alignment score focusing on five key facets: (1) Requirement Fulfillment, a binary check on whether the output’s modality and quantity align with the user request; (2) Logical Coherence, verifying the causal validity of sub-task sequencing; (3) Pragmatic Executability, ensuring each step is grounded within the available toolset or LLM capabilities to avoid hallucinatory actions; (4) Decomposition Atomicity, which evaluates whether the plan is partitioned into actionable atomic tasks; and (5) Expert-Guided Optimality, which rewards task-specific best practices such as identity consistency for multi-shot content, beat-aligned audio-visual synchronization, and the strategic minimization of complexity.
The Fine-grained Reward integrates both rule-based and effect-based signals to ensure structurally valid execution and successful task realization. Specifically: (1) Rule-based components include Format Compliance , which validates UTPC structural correctness via parsing of tags, ordering, content, and JSON validity; Tool Invocation , which scores execution success with graded penalties for intermediate or final failures; and Visual Consistency , which rewards appropriate use of reference-based generation when consistency is required. (2) Effect-based components include Result Achievement , which verifies output constraints such as image count and video duration within tolerance bounds, and Trajectory Coherence , which evaluates alignment between planning intent and executed actions through an LLM-evaluator. Together, these rewards provide trajectory-level supervision that encourages correct agentic structure, reliable tool usage, and coherent visual creation outcomes.
4.3.3 Theoretical Foundations of Virtual Reinforcement Learning
Based on the constructed virtual environment and the LtrReward components, we provide a theoretical analysis to explain the effectiveness of VRL when transferred to real-world execution. The theoretical legitimacy of VRL rests on its ability to maintain policy efficacy despite the intrinsic discrepancies between virtual simulation and real-world execution. Specifically, VRL operates under a Rollout Gap, where the agent lacks real visual feedback to rectify its trajectory, and an Objective Inconsistency, caused by substituting the vision reward (which measures perceptual quality across multiple visual dimensions) with a structural proxy . To evaluate how these discrepancies affect policy transfer, we model the sim-to-real transition as a function of four synergistic variables: (i) Tool Capability (), quantifying the reliability of the generative engine; (ii) PST Prior (), anchoring the agent’s initial reasoning within a distribution derived from real expert data; (iii) Plan Sufficiency (), measuring the causal link between logical correctness and visual quality; and (iv) Result Reward (), ensuring the structural completion of tasks.
The following theorems establish the mathematical foundation of VRL: Theorem 4.1 provides an error bound guarantee, proving that the sim-to-real gap remains controllable under the joint constraint of these variables; Theorem 4.2 characterizes the real-world performance gain as a competition between Causal Improvement and Transfer Loss, showing that VRL yields non-negative improvement whenever the causal reward gain dominates the bounded sim-to-real error.
Theorem 4.1 (Virtual-to-Real Error Bound).
Let and be the expected returns of policy in real and virtual environments. And are environment-specific scaling factors. The transfer error is bounded by:
Theorem 4.1 quantifies how the sim-to-real divergence is suppressed: (i) Dynamics Gap is minimized by , ensuring virtual procedural logic mirrors real API behavior; (ii) Action Bias Bound is constrained by the PST prior, which prevents policy drift in the absence of real visual feedback by maintaining consistency with expert decision-making; (iii) Goal Alignment Error is mitigated by the coupling of and , ensuring the virtual completion objective serves as a reliable proxy for real-world success.
Theorem 4.2 (Real-World Improvement of VRL).
Under the error bound , the real-world performance gain depends on the dominance of Causal Improvement over Transfer Loss:
where is the effectiveness coefficient, and denotes the anchoring strength of the PST prior in constraining policy exploration. Virtual reward consisting of and , and denotes the expected increment of virtual reward, representing the agent’s logic optimization in planning and execution.
The practical transferability of VRL is validated by the convergence behavior in our experiments, where the agent achieves an average virtual reward exceeding 95%. This saturation of total virtual reward indicates that the Causal Improvement term is maximized, providing a substantial logical buffer to offset transfer discrepancies. By substituting these empirical results into Theorem 4.1, we observe that the Action Bias Bound is strictly suppressed by the PST prior, while the Goal Alignment Error is mitigated by the coupling of and , remaining stable as the agent masters structural completion. Consequently, the Transfer Loss is primarily governed by the Dynamics Gap . This reveals a critical insight: VRL efficacy is fundamentally a function of generative tool quality. As increases—meaning the underlying visual creation tools become more reliable and follow procedural logic more closely—the transfer loss diminishes, allowing the massive logical gains from virtual training to translate effectively into superior real-world visual quality. Therefore, we derive the following corollary:
Corollary 4.3 (Fidelity-Anchored Transfer).
Provided the virtual reward reaches a near-optimal level, the real-world gain of VRL is monotonically non-decreasing with respect to .
4.3.4 Plan-Driven Reward Design
Theorems 4.1 and 4.2 indicate that real-world improvement critically depends on planning quality. Motivated by this insight, we adopt a plan-driven reward that enforces causal dependency between planning and execution:
Here, measures plan correctness, while captures execution-level structural validity. The multiplicative coupling ensures that execution alone cannot achieve high reward without a valid plan, and maximal reward is obtained only when a correct plan is faithfully executed. This mechanism directly aligns with Theorem 4.2, promoting robust long-horizon planning and tool-use strategies within virtual training.
5 Experiment
5.1 VisGenBench
Existing video generation benchmark VBench-2.0 28 has made significant contributions to evaluating the quality of individual-generated videos. But it lacks the capability to evaluate multi-step visual creation trajectories that involve complex tool invocation and long-horizon planning. While ComfyBench 9 attempts to assess multi-step trajectories, it is specifically designed for ComfyUI 29 and evaluates agent performance based solely on ComfyUI execution success, making it unsuitable for general API-based tool invocation scenarios. To address this critical gap, we introduce VisGenBench, a comprehensive benchmark designed for evaluating visual generation agentic models that operate through multi-step tool invocation to accomplish complex image and video creation tasks.
Table 1: Test dataset composition of VisGenBench, with 400 image tasks and 800 video tasks.
| Type | Content | Content | Object | Scene | Style | Variety | Visual | Video | Video |
| Creative | Match | Consistency | Consistency | Consistency | Amount | Duration | Storyboard | ||
| Image Tasks | 50 | 50 | 50 | 50 | 50 | 50 | 100 | – | – |
| Video Tasks | 50 | 50 | 50 | 50 | 50 | 50 | 100 | 200 | 200 |
| Total | 100 | 100 | 100 | 100 | 100 | 100 | 200 | 200 | 200 |
5.1.1 Test Dataset Composition
As shown in Tab. 1, the VisGenBench consists of a total of 1.2k test samples, including 400 image-generation tasks and 800 video-generation tasks. Each task is designed to reflect multi-step creation trajectories, requiring to generation of many images and videos. The benchmark spans 10 evaluation dimensions and covers 35+ real-world application scenarios, encompassing domains such as advertising, storytelling, entertainment, animation, etc.
5.1.2 Evaluation Framework
The VisGenBench evaluation framework integrates both objective and subjective assessments to measure an agent’s ability to perform multi-step visual generation tasks.
Objective Evaluation Objective evaluation focuses on quantifiable and automatically measurable aspects of the generated content. Specifically, it consists of two components: (1) Success Rate: Measures whether the model successfully returns valid images/videos when requested by user. A generation containing the correct modality is counted as Success. (2) Basic Visual Attributes: Quantitative evaluation of the generated results, including visual quantity, video storyboard count, and video duration. These attributes are automatically assessed using standardized tools.
Subjective Evaluation Subjective aspects such as visual consistency, diversity, storytelling quality, and audio perception cannot be fully captured through traditional metrics. We therefore introduce a VLM-Grader with pre-defined fine-grained scoring rubrics, implemented using the Gemini2.5-Pro model. For each subjective evaluation dimension, we define a tailored meta evaluation list—a structured rubric containing detailed scoring items (e.g., character consistency, style coherence, narrative flow, audio synchronization, etc). Gemini2.5-Pro provides a meta-evaluation score for each meta-item, and the aggregated score forms the overall result for that dimension. To align automated scoring with human judgment, we calibrate Gemini2.5-Pro’s meta-evaluation intensity on VisGenBench. This ensures that both mean scores and relative rankings evaluated by Gemini2.5-Pro remain consistent with expert human assessments, achieving a human-aligned evaluation process.
Table 2: Comparisons on VisGenBench by VLM Evaluation. S-Rate: Success Rate, O-Score: Overall Score. The best and second-best results are highlighted.
| Method | Creative | Match | Object | Scene | Style | Variety | Amount | Duration | Storyboard | S-Rate | O-Score |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-5 | 0.683 | 0.641 | 0.593 | 0.579 | 0.638 | 0.232 | 0.620 | 0.263 | 0.660 | 0.863 | 0.577 |
| Gemini2.5-Pro | 0.777 | 0.802 | 0.625 | 0.602 | 0.573 | 0.345 | 0.540 | 0.376 | 0.700 | 0.933 | 0.627 |
| Qwen3-VL-8B-Tk | 0.104 | 0.078 | 0.100 | 0.065 | 0.109 | 0.014 | 0.160 | 0.034 | 0.040 | 0.142 | 0.085 |
| VisionCreator-8B | 0.651 | 0.661 | 0.645 | 0.638 | 0.595 | 0.211 | 0.480 | 0.429 | 0.580 | 0.925 | 0.581 |
Table 3: Comparisons on VisGenBench by Human Evaluation. All models use the new version system prompt, which differs from Tab.2. Overall Score = (Success Rate of Image Human Evaluation of Image Success Rate of Video Human Evaluation of Video) 2. The performance comparisons of all detailed human evaluation dimensions are shown in Fig. 1.
| Model | Success Rate | Human Evaluation | Overall Score | ||
| Image | Video | Image | Video | ||
| GPT-5 | 95.95% | 93.00% | 3.52 | 3.25 | 3.19 |
| Gemini2.5-Pro | 91.00% | 84.00% | 3.53 | 3.35 | 3.01 |
| Qwen3-VL-32B-Thinking | 97.00% | 93.00% | 3.47 | 3.23 | 3.18 |
| Qwen3-VL-32B-RL | 91.00% | 87.00% | 3.51 | 3.40 | 3.07 |
| Qwen3-VL-32B-SFT | 96.00% | 94.00% | 3.53 | 3.37 | 3.27 |
| VisionCreator-32B | 99.00% | 96.00% | 3.53 | 3.49 | 3.42 |
5.2 Results on VisGenBench by VLM Evaluation
As shown in Tab. 2, our VisionCreator-8B demonstrates remarkable performance that is highly competitive with much larger commercial models (GPT-5 and Gemini2.5-Pro), while significantly outperforming its base model Qwen3-VL-8B-Thinking. The key findings highlight several advantages of our approach: (1) Superior Success Rate and Reliability: VisionCreator-8B achieves an impressive success rate of 0.925, surpassing GPT-5 (0.863) and approaching Gemini2.5-Pro (0.933). This demonstrates the effectiveness of our UTPC framework in ensuring task completion reliability, a crucial requirement for practical visual creation applications. (2) Exceptional Consistency Performance: VisionCreator-8B achieves the highest scores in object consistency (0.645) and scene consistency (0.638) among all compared models, including the much larger Gemini2.5-Pro and GPT-5. This validates our model’s strong capability in maintaining visual coherence throughout multi-step creation processes, a core benefit of the native agentic architecture. (3) The results validate our core hypothesis: a specialized native visual creation agent, even with significantly fewer parameters, can achieve performance competitive with general-purpose commercial giants through targeted architectural design and training methodology. VisionCreator’s particular strengths in success rate and consistency metrics underscore the practical advantages of our UTPC framework for real-world visual content creation applications.
Table 4: Ablation study with VisionCreator-8B on VisGenBench-104 comparing different training strategies. VisGenBench-104 is a sampled subset of VisGenBench. Model configurations: RL1: PST + Result+Format reward); RL2: PST + Plan×(Result+Format) reward; RL3: Qwen3-VL + Plan×(Result+Format) reward; RL4: PST + Plan×Fine reward; v1: 3×VisGenData-4k; v2: 3×VisGenData-4k + General-1M; v3: 20×VisGenData-4k + General-1M; v4: PST + 3×VisGenData-4k + General-1%; v5: PST + 3×VisGenData-4k + General-5%; v6: PST + 3×VisGenData-4k + General-10%; v7: PST + 3×VisGenData-4k + General-20%.
| Method | Creative | Match | Object | Scene | Style | Variety | Amount | Duration | Storyboard | S-Rate | O-Score |
|---|---|---|---|---|---|---|---|---|---|---|---|
| RL1 | 0.534 | 0.817 | 0.694 | 0.547 | 0.579 | 0.249 | 1.000 | 0.397 | 0.625 | 0.904 | 0.634 |
| RL2 | 0.579 | 0.808 | 0.677 | 0.479 | 0.558 | 0.265 | 0.800 | 0.478 | 0.875 | 0.942 | 0.644 |
| RL3 | 0.671 | 0.674 | 0.621 | 0.622 | 0.555 | 0.217 | 0.800 | 0.513 | 0.750 | 0.885 | 0.631 |
| RL4 | 0.573 | 0.794 | 0.672 | 0.696 | 0.569 | 0.150 | 1.000 | 0.534 | 0.625 | 0.925 | 0.654 |
| v1 | 0.000 | 0.050 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.019 | 0.007 |
| v2 | 0.230 | 0.334 | 0.382 | 0.339 | 0.396 | 0.163 | 0.600 | 0.134 | 0.500 | 0.490 | 0.357 |
| v3 | 0.262 | 0.422 | 0.300 | 0.260 | 0.473 | 0.068 | 0.600 | 0.100 | 0.250 | 0.481 | 0.322 |
| v4 | 0.283 | 0.468 | 0.399 | 0.295 | 0.318 | 0.000 | 0.600 | 0.183 | 0.000 | 0.442 | 0.299 |
| v5 | 0.266 | 0.361 | 0.366 | 0.246 | 0.201 | 0.084 | 0.200 | 0.029 | 0.125 | 0.490 | 0.237 |
| v6 | 0.239 | 0.310 | 0.326 | 0.194 | 0.273 | 0.149 | 0.600 | 0.098 | 0.125 | 0.413 | 0.273 |
| v7 | 0.420 | 0.701 | 0.430 | 0.447 | 0.430 | 0.028 | 0.600 | 0.344 | 0.375 | 0.625 | 0.440 |
5.3 Results on VisGenBench by Human Evaluation
In addition to automated VLM-based evaluation (Tab. 2), we conduct a thorough human evaluation to assess the perceptual quality of multi-step visual creation tasks, including images and videos (Tab. 3), which shows that: (1) Overall Findings: VisionCreator-32B achieves the highest Overall Score of 3.42, surpassing both GPT-5 (3.19) and Gemini2.5-Pro (3.01). This indicates that our UTPC framework not only ensures task success in an automated setting but also delivers outputs that are qualitatively preferred by human evaluators. (2) Image vs. Video Performance: VisionCreator-32B excels across both modalities, with 99% image success and 96% video success, accompanied by strong human evaluation scores (3.53 for images, 3.49 for videos). This balanced performance highlights the model’s capability to maintain coherent multi-step planning and execution for both static and dynamic content. (3) Implications: The human evaluation corroborates trends observed in VLM-based metrics, validating that the model’s planning-driven reward design and VRL training not only improve automated success metrics but also enhance perceptual quality, consistency, and user satisfaction in real-world multi-step visual creation tasks.
5.4 Ablation Studies
We conduct ablation studies on sampled VisGenBench-104, where key findings from Tab. 4 include: (1) Effectiveness of PST. Our PST with v7 (PST + 3×VisGenData-4k + General-20%) achieves significant improvement over SFT with v2 (3×VisGenData-4k + General-1M) (0.440 vs. 0.357). Performance improves with increasing general data ratio (v4→v5→v6→v7), confirming balanced specialization prevents overfitting while maintaining generalization. (2) Data Configuration Strategies. Simply increasing specialized data scale does not guarantee improvement. v3 (20×VisGenData-4k + General-1M) underperforms v2 (3×VisGenData-4k + General-1M) (0.322 vs. 0.357), indicating excessive data repetition causes overfitting. Our PST strategy achieves better performance through optimized data ratios. (3) Virtual Reinforcement Learning. All VRL models substantially outperform SFT variants. RL4 (PST + Plan×Fine reward) improves Overall Score by 49% over the best PST model v7 (0.654 vs. 0.440), demonstrating VRL’s effectiveness. (4) Reward Function Designs. Building upon RL1, RL2 (PST + Plan×(Result+Format) reward) which incorporates additional plan reward, demonstrates improved performance with a higher Success Rate (0.942 vs. 0.904) and Overall Score (0.644 vs. 0.634). RL4 achieves the best Overall Score (0.654) and demonstrates strong comprehensive performance across multiple dimensions, proving fine-grained rewards enhance model capability. (5) Importance of Pre-training Foundation RL2 (PST + Plan×(Result+Format) reward) outperforms RL3 (Qwen3-VL + Plan×(Result+Format) reward) (0.644 vs. 0.631) despite identical rewards, with RL2 achieving a notably higher Success Rate of 0.942 compared to 0.885 for RL3, validating PST provides a stronger foundation for RL training.

Figure 9: Visualization comparisons of consistency.
6 Conclusion
We present VisionCreator, a native visual-generation agent that unifies Understanding, Thinking, Planning, and Creation (UTPC) in an end-to-end framework. Our contributions include: (1) VisGenData-4k with UTPC structures via metacognition-based VisionAgent; (2) Progressive Specialization Training and Virtual Reinforcement Learning for stable capability acquisition; (3) VisGenBench for multi-step visual creation evaluation. Experiments show VisionCreator outperforms larger closed-source models, validating our approach. This work establishes a foundation for visual-generation agentic systems and autonomous creative content generation.
Detailed Theoretical Derivations of VRL Theorems
This appendix provides detailed mathematical derivations and proofs for the two VRL theorems presented in the main text. The derivation process is divided into three stages: formal modeling and definitions, derivation of the error upper bound (Theorems 4.1), and analysis of performance improvement (Theorems 4.2).
Stage 1: Formal Modeling and Definitions
We first formalize the agent’s policy, environment, and rewards to establish the foundation for subsequent derivations.
1.1 Formalization of Environment and Policy
Definition A.1 (MDP Tuple): The real-world task is modeled as a Markov Decision Process (MDP), denoted as .
- : State space, containing multimodal observations (textual instructions, tool feedback, virtual visual states).
- : Action space, containing reasoning tokens, planning steps, and tool invocations.
- : Dynamic transition probability of the real environment.
- : Real reward function, measuring the perceptual quality of generated content (e.g., aesthetics, alignment).
- : Initial state distribution.
- : Discount factor.
Definition A.2 (Virtual Environment): The virtual environment is . Its core differences are:
- : Tool dynamics simulated by VisGenEnv, with fidelity quantified by the tool capability .
- : Virtual reward function, composed of and according to the plan-driven reward design. It is a structural proxy reward that substitutes for the computationally infeasible in the virtual environment.
Definition A.3 (Policy and Return): Let be a policy (mapping from states to actions). The expected discounted return of policy in environment is defined as:
Here, the trajectory is generated by , , and . For brevity, we denote and .
1.2 Key Variables and Core Assumptions
Definition A.4 (Key Variables):
- Tool capability : Measures how well the virtual environment dynamics approximate the real dynamics . indicates perfect simulation.
- PST prior : The initialization policy obtained through Progressive Specialization Training (PST). Its behavioral distribution on real expert data is denoted as .
- Plan sufficiency : Measures the strength of the causal link between a “logically correct” plan and the final “high-quality visual output”.
- Result reward : A subcomponent of that evaluates whether the task is structurally completed (e.g., number of images, video duration).
Assumption A.1 (Dynamic Difference Upper Bound): There exists a constant related to environment complexity such that for all state-action pairs ,
This assumption stems from the high-fidelity simulation design of VisGenEnv: higher tool capability leads to smaller differences between virtual and real transitions.
Assumption A.2 (Reward Proxy Error): The relationship between the real visual reward and the proxy reward is modulated by plan sufficiency and result reward . There exists a constant such that for meaningful trajectories (i.e., when planning logic is correct), the reward difference satisfies:
This assumption reflects the design philosophy of LtrReward: when planning is sufficient () and the task is perfectly completed structurally (), the real visual quality also tends to be high.
Assumption A.3 (KL Constraint on Policy Deviation): The policy trained in the virtual environment differs from the PST prior in the state-action distribution. This difference can be measured by the KL divergence , and its impact on the return difference is linearly bounded. That is, there exists a constant such that the related performance difference is constrained by it.
Definition A.5 (Sim-to-Real Error): For a given policy , its sim-to-real error is defined as:
Stage 2: Derivation of Theorems (Virtual-to-Real Error Upper Bound)
Theorems 4.1 (Virtual-to-Real Error Upper Bound) Restated: Under Assumptions A.1, A.2, A.3, for any policy (trained in the virtual environment, denoted as ), its sim-to-real error satisfies:
Proof:
We decompose the total error into three separately bounded components via the triangle inequality and constrain each using the above assumptions.
Step 2.1: Decompose Total Error Consider an intermediate environment , which uses the virtual environment dynamics but retains the real reward . Denote . Then:
Here represents the ideal return under dynamics and reward with the policy perfectly constrained by the PST prior (no deviation). We next upper bound each term.
Step 2.2: Bounding Term I (Dynamics Gap) Term I measures the return difference due to the difference between dynamic models and . According to Assumption A.1 and the Performance Difference Lemma, for any policy ,
Let and define , we obtain:
In the theorem statement, constant factors are absorbed into , so we have Term I .
Step 2.3: Bounding Term II (Reward Gap) Term II measures the difference between using the real reward and using the proxy reward (as the core part of ) under the same dynamics. According to Assumption A.2, for each step in the trajectory, the reward difference is bounded. Applying the Performance Difference Lemma (reward difference part) again yields:
Define , then Term II . In the theorem statement, is written as .
Step 2.4: Bounding Term III (Policy Bias) Term III measures the return loss due to the deviation of the virtually trained policy from the ideal PST prior . According to Assumption A.3, there exists a constant such that:
This assumption stems from the “anchoring” effect of the PST prior on the policy exploration space, preventing catastrophic policy drift in the absence of real visual feedback.
Step 2.5: Combining Error Upper Bounds Summing the upper bounds of Term I, II, and III, we obtain:
Relabeling constants , yields the form in Theorems 4.1.
Stage 3: Derivation of Theorems (Real-World Performance Improvement Lower Bound)
Theorems 4.2 (Real-World Improvement of VRL) Restated: Let be the initial policy after PST training, and be the policy optimized through Virtual Reinforcement Learning (VRL). Define the virtual optimization gain as . Then, under the error bound of Theorems 4.1, the real-world performance improvement satisfies:
where is the effectiveness coefficient, and denotes the Anchoring Strength of the PST prior in constraining policy exploration.
Proof:
Step 3.1: Establish Inequality Based on Error Bound From Theorems 4.1, for any policy , we have . Applying this inequality to and respectively:
Subtracting the second inequality from the first yields:
Since itself is trained on real expert data, its sim-to-real error is expected to be small (aligned during PST). Therefore, the lower bound of performance improvement is mainly affected by the error of . Conservatively setting the transfer loss term as gives:
Step 3.2: Relating Virtual Gain to Real Gain (Causal Improvement) The in inequality (1) is the gain in virtual reward. We need to relate it to real performance improvement. This relies on a core idea: optimizing “planning and execution logic” in the virtual environment, as long as the simulation is sufficiently credible, causally leads to improved real-world visual quality. Define the effectiveness coefficient , which quantifies the expected increment in real reward per unit increment in virtual reward. We model it as the product of three key factors:
- : Tool capability determines the probability of logical execution being reproduced in reality.
- : Plan sufficiency determines the strength of association between correct logic and high-quality output.
- : Anchoring strength of the PST prior, indicating the degree to which the policy remains in a “reasonable” distribution region during VRL optimization, with . Strong anchoring () ensures the optimization direction remains effective in the real world.
Therefore, we assume a monotonic relationship:
When , the logical improvement brought by virtual optimization can be partially translated into real-world improvement.
Step 3.3: Derive the Final Lower Bound Substituting into inequality (2) yields the lower bound stated in Theorems 4.2:
Step 3.4: Condition for Non-Negative Improvement From the inequality in Theorems 4.2, the sufficient condition for non-negative improvement in real-world performance (i.e., ) is directly obtained as:
This means that the Causal Improvement brought by virtual training must be sufficient to cover the Transfer Loss arising from simulation imperfections. This does not require or ; as long as their product combined with the anchoring strength is large enough to make sufficiently large, and VRL can effectively increase (as shown in experiments where virtual reward exceeds 95%), positive transfer is guaranteed.
Summary
Through formal modeling, this derivation decomposes the challenge of sim-to-real transfer into differences at the dynamic, reward, and policy levels, and quantifies their upper bounds using key variables such as tool capability, plan sufficiency, and PST prior. Theorems 4.1 shows that systematic error can be controlled by improving tool fidelity, strengthening PST anchoring, and optimizing plan-result alignment. Theorems 4.2 further proves that as long as virtual training can effectively enhance the agent’s logical capabilities (Causal Improvement) and this improvement outweighs the bounded systematic error (Transfer Loss), performance improvement in the real world is guaranteed. This provides a solid theoretical foundation for the application of virtual reinforcement learning in high-dimensional, long-horizon tasks such as visual creation.
Table 5: Human Evaluation of Detailed Dimensions on VisGenBench-Image (Score = Success Rate Human Evaluation Score)
| Model | Semantic | Style | Emotion | Subject | Design | Visual | Text | Creativity | Overall |
| Matching | Matching | Matching | Consistency | Integrity | Integrity | Quality | |||
| GPT-5 | 3.4883 | 3.6214 | 2.9656 | 3.6024 | 3.4408 | 3.4218 | 2.7565 | 3.4408 | 3.3458 |
| Gemini2.5-Pro | 3.3943 | 3.5399 | 2.8119 | 3.4034 | 3.2214 | 3.2669 | 2.7300 | 3.3215 | 3.2123 |
| Qwen3-VL-32B-Tk | 3.4435 | 3.7248 | 2.9876 | 3.5890 | 3.4047 | 3.4823 | 2.8130 | 3.4726 | 3.3659 |
| Qwen3-VL-32B-SFT | 3.3504 | 3.8016 | 2.8896 | 3.7632 | 3.4368 | 3.5040 | 2.8224 | 3.5232 | 3.3888 |
| VisionCreator-32B | 3.6432 | 3.8412 | 3.1581 | 3.7620 | 3.4452 | 3.6531 | 2.8809 | 3.5739 | 3.4947 |
Table 6: Human Evaluation of Detailed Dimensions on VisGenBench-Video (Score = Success Rate Human Evaluation Score)
Model Script Story- Content Subject Video Visual board Consistency Consistency Effect Motion GPT-5 3.1062 2.9202 3.1434 3.1713 3.0597 3.0039 Gemini2.5-Pro 2.9484 2.6796 3.0156 2.856 2.8728 2.6628 Qwen3-VL-32B-Thinking 3.069 2.9016 3.1713 3.162 2.9574 2.9388 Qwen3-VL-32B-SFT 3.6002 2.867 3.5814 3.3652 3.243 2.9328 VisionCreator-32B 3.5616 3.1872 3.5808 3.4176 3.4752 3.2256
Model Audio-Visual Music Dubbing Subtitle Transition Editing Overall GPT-5 3.0411 3.2643 2.8644 2.9788 2.8812 2.8392 2.814 Gemini2.5-Pro 2.8056 2.7888 2.8728 2.8812 2.8392 2.562 2.814 Qwen3-VL-32B-Thinking 2.9481 2.9202 3.1341 3.069 2.9295 2.8737 3.0039 Qwen3-VL-32B-SFT 3.1772 3.0644 3.2148 3.0174 3.0268 2.9328 3.1678 VisionCreator-32B 3.3792 3.2928 3.3888 3.3216 3.2352 3.1104 3.3504
Table 7: General-purpose Datasets.
| Category | Name | Quantity |
| NLP | DeepSeek-R1-Distill-110k 19 | 110k |
| LONGCOT-Refine-500K 26 | 500k | |
| alpaca-gpt4-data 24 25 | 100k | |
| Multimodal | M3IT 15 | 1592k |
| Tool Calling | function-calling-chatml 8 | 112k |
| xlam-function-calling-60k 39 | 60k | |
| ms-agent 21 | 600k | |
| ToolACE 20 | 11k | |
| ToolBench 27 | 123k | |
| AFM 17 | 76k |
Table 8: Task Distribution in VisGenData-4k Dataset
| Video Generation Tasks | Image Generation Tasks | ||
| No. | Task Type | No. | Task Type |
| 1. | Product marketing videos | 1. | Product images |
| 2. | Public service advertisements | 2. | Detail pages |
| 3. | Corporate promotion videos | 3. | Key Visual (KV) |
| 4. | Brand story videos | 4. | Landing pages / H5 graphics |
| 5. | Event promotion videos | 5. | Complete brand visual identity |
| 6. | Instructional videos | 6. | Banner graphics |
| 7. | Popular science documentaries | 7. | Official account cover images |
| 8. | Music videos (MV) | 8. | Xiaohongshu covers |
| 9. | Concert recordings | 9. | Marketing posters |
| 10. | Variety shows | 10. | Avatar design |
| 11. | Story videos | 11. | Static emoji generation |
| 12. | Video podcasts | 12. | ICON design |
| 13. | Picture books | 13. | LOGO design |
| 14. | Dynamic comics | 14. | Mini-game UI design |
| 15. | Animated short films | 15. | Character design |
| 16. | Animated movies | 16. | Character action design |
| 17. | Game adaptation films | 17. | Scene design |
| 18. | Game videos | 18. | Storyboards |
| 19. | Movies | 19. | Picture Book |
| 20. | Short dramas | 20. | Stylization |
| 21. | Story explanations | 21. | Realistic Photography |
Table 9: VisGenEnv integrates 36 visual creation tools.
| Tool Category | Tool Function | Tool Name |
| Text-to-Text | Storyboard Text Polishing (Claude) | tool_prompt_refine |
| Storyboard Generation (Claude) | tool_video_shot_gen | |
| Script Tool (Claude) | tool_video_script_gen | |
| Storyboard Polishing (Claude) | tool_storyboard_polish | |
| Script & Storyboard Polishing | tool_script_storyboard_merge | |
| Text-to-Video (Veo3) | tool_text2video_veo | |
| Text-to-Image | Text-to-Image (nano-banana) | tool_text2image_gemini |
| Text-to-Image (hunyuan) | tool_text2image_hunyuan | |
| Text-to-Image (ByteDance) | tool_text2image_seed | |
| Text-to-Image (GPT) | tool_text2image_gpt | |
| Text-to-Image (Qwen) | tool_text2image_qwen | |
| Image-to-Image | Image-to-Image (nano-banana) | tool_image_edit_gemini |
| Image-to-Image (Qwen) | tool_image_edit_qwen | |
| Image-to-Image (GPT) | tool_image_edit_gpt | |
| Image-to-Video | Image-to-Video (Keling) | tool_image2video_keling |
| Image-to-Video (Veo3) | tool_image2video_veo3 | |
| Audio Generation | Music Generation (Suno) | tool_music_suno |
| Video Sound Effect Generation | tool_sound_fx_gen | |
| TTS Generation | tool_tts_generation | |
| Video Composition (MoviePy) | tool_video_composite | |
| Video Clip - MoviePy Post-processing | tool_video_postprocess | |
| Video Generation Automation Pipeline | tool_video_auto_pipeline | |
| Beat Detection Tool | tool_beat_detect | |
| Video Editing (Trim) | tool_video_trim_edit | |
| Video Speed Change | tool_video_speed_adjust | |
| TTS + Composition Tool | tool_tts_composite | |
| Audio Editing | tool_audio_edit_cut | |
| Add Subtitles | tool_subtitle_add_text | |
| Multimodal | Video Understanding (Gemini2.5-Pro) | tool_video_analysis |
| Understanding | Audio Understanding (Gemini2.5-Pro) | tool_audio_analysis |
| Image Understanding (Gemini2.5-Pro) | tool_image_analysis | |
| Other | Tavily Search - Content Extraction | tool_search_content |
| Inspiration Search | tool_search_inspire | |
| Summary Tool | tool_content_summary | |
| To-Do List | tool_task_manager | |
| HTML Generation Tool | tool_html_builder |

Figure 10: Visualizations for scene consistency.

Figure 11: Visualizations for object consistency.

Figure 12: Demo page.
Footnotes
-
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. ↩ ↩2
-
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022. ↩ ↩2
-
Black Forest Labs. Flux. https://github.com/black-forest-labs/flux, 2024. ↩ ↩2
-
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022. ↩ ↩2
-
Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model. arXiv preprint arXiv:2412.00131, 2024. ↩ ↩2
-
Weijia Wu, Zeyu Zhu, and Mike Zheng Shou. Automated movie generation via multi-agent cot planning. arXiv preprint arXiv:2503.07314, 2025. ↩ ↩2 ↩3
-
Junfei Xiao, Ceyuan Yang, Lvmin Zhang, Shengqu Cai, Yang Zhao, Yuwei Guo, Gordon Wetzstein, Maneesh Agrawala, Alan Yuille, and Lu Jiang. Captain cinema: Towards short movie generation. arXiv preprint arXiv:2507.18634, 2025. ↩ ↩2 ↩3
-
Xuenan Xu, Jiahao Mei, Chenliang Li, Yuning Wu, Ming Yan, Shaopeng Lai, Ji Zhang, and Mengyue Wu. Mm-storyagent: Immersive narrated storybook video generation with a multi-agent paradigm across text, image and audio. arXiv preprint arXiv:2503.05242, 2025a. ↩ ↩2 ↩3
-
Xiangyuan Xue, Zeyu Lu, Di Huang, Zidong Wang, Wanli Ouyang, and Lei Bai. Comfybench: Benchmarking llm-based agents in comfyui for autonomously designing collaborative ai systems. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 24614–24624, 2025. ↩ ↩2 ↩3 ↩4 ↩5
-
Litao Guo, Xinli Xu, Luozhou Wang, Jiantao Lin, Jinsong Zhou, Zixin Zhang, Bolan Su, and Ying-Cong Chen. Comfymind: Toward general-purpose generation via tree-based planning and reactive feedback. arXiv preprint arXiv:2505.17908, 2025. ↩ ↩2 ↩3 ↩4
-
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683, 2025. ↩
-
Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Zichun Liao, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Omniflow: Any-to-any generation with multi-modal rectified flows. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 13178–13188, 2025a. ↩
-
Jinxiang Lai, Jie Zhang, Jun Liu, Jian Li, Xiaocheng Lu, and Song Guo. Spider: Any-to-many multimodal llm. arXiv preprint arXiv:2411.09439, 2024. ↩
-
Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners. arXiv preprint arXiv:2510.26583, 2025. ↩
-
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International conference on machine learning, pp. 1691–1703. PMLR, 2020. ↩
-
Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, and Yonglong Tian. Fluid: Scaling autoregressive text-to-image generative models with continuous tokens. arXiv preprint arXiv:2410.13863, 2024. ↩
-
Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. arXiv preprint arXiv:2412.04431, 2024. ↩
-
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. arXiv preprint arXiv:2404.02905, 2024. ↩
-
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024. ↩
-
Yatian Pang, Peng Jin, Shuo Yang, Bin Lin, Bin Zhu, Zhenyu Tang, Liuhan Chen, Francis EH Tay, Ser-Nam Lim, Harry Yang, et al. Next patch prediction for autoregressive visual generation. arXiv preprint arXiv:2412.15321, 2024. ↩
-
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022. ↩
-
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. ↩
-
William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205, 2023. ↩
-
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024. ↩
-
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024. ↩
-
Zhenran Xu, Yiyu Wang, Xue Yang, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, and Min Zhang. Comfyui-r1: Exploring reasoning models for workflow generation. arXiv preprint arXiv:2506.09790, 2025b. ↩
-
Oucheng Huang, Yuhang Ma, Zeng Zhao, Mingrui Wu, Jiayi Ji, Rongsheng Zhang, Zhipeng Hu, Xiaoshuai Sun, and Rongrong Ji. Comfygpt: A self-optimizing multi-agent system for comprehensive comfyui workflow generation. arXiv preprint arXiv:2503.17671, 2025. ↩
-
Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness. arXiv preprint arXiv:2503.21755, 2025. ↩
-
comfyanonymous. Comfyui. https://github.com/comfyanonymous/ComfyUI, 2023. GitHub repository. ↩