VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation

Jinxiang Lai ^2∗, Zexin Lu ^1∗, Jiajun He ¹, Rongwei Quan ¹, Wenzhe Zhao ¹, Qinyu Yang ¹, Qi Chen ¹, Qin Lin ^1†, Chuyue Li ¹, Tao Gao ¹, Yuhao Shan ¹, Shuai Shao ¹,
Song Guo ^2§, Qinglin Lu ^1†
¹ Tencent Hunyuan, ² Hong Kong University of Science and Technology
^∗ Equal contribution, ^§ Corresponding Author, ^† Project lead

Abstract

Visual content creation tasks demand a nuanced understanding of design conventions and creative workflows-capabilities challenging for general models, while workflow-based agents lack specialized knowledge for autonomous creative planning. To overcome these challenges, we propose VisionCreator, a native visual-generation agentic model that unifies Understanding, Thinking, Planning, and Creation (UTPC) capabilities within an end-to-end learnable framework. Our work introduces four key contributions: (i) VisGenData-4k and its construction methodology using metacognition-based VisionAgent to generate high-quality creation trajectories with explicit UTPC structures; (ii) The VisionCreator agentic model, optimized through Progressive Specialization Training (PST) and Virtual Reinforcement Learning (VRL) within a high-fidelity simulated environment, enabling stable and efficient acquisition of UTPC capabilities for complex creation tasks; (iii) VisGenBench, a comprehensive benchmark featuring 1.2k test samples across diverse scenarios for standardized evaluation of multi-step visual creation capabilities; (iv) Remarkably, our VisionCreator-8B/32B models demonstrate superior performance over larger closed-source models across multiple evaluation dimensions. Overall, this work provides a foundation for future research in visual-generation agentic systems.

1 Introduction

AI-assisted visual content creation has revolutionized workflows from professional design to social media. The field has evolved from single-image generation ¹ ² ³ ⁴ ⁵ to complex multi-modal synthesis ⁶ ⁷ ⁸ ⁹ ¹⁰, demanding systems that can understand creative intent, plan multi-step operations, and autonomously execute intricate workflows. As shown in Fig.2, current approaches to autonomous visual creation can be categorized into three main paradigms, each with distinct limitations: (a) General-purpose Unified Multimodal Models (UMM) ¹¹ ¹² ¹³ ¹⁴ leverage large-scale pre-training to achieve impressive visual understanding, but lack the domain-specific knowledge required for autonomous creative planning and struggle to decompose complex objectives without extensive prompt engineering. (b) Workflow-specific Agent ⁶ ⁷ ⁸ employ predefined pipelines for specific domains like movie generation or story creation, but their rigid architectures cannot adapt to diverse creative tasks or handle unexpected outcomes during execution. (c) Workflow-guided Agent ⁹ ¹⁰ orchestrate external tools through carefully designed prompts and coordination logic, leveraging general language models to interpret requests and sequence operations. However, this approach faces several limitations: (i) Reliance on prompt engineering rather than learned domain knowledge, limiting creative understanding; (ii) Explicitly programmed coordination logic that restricts adaptability to diverse tasks; and (iii) Inability to be jointly optimized end-to-end for creative task performance.

Refer to caption

Figure 1: Human Evaluation results on VisGenBench-Image and VisGenBench-Video.

To overcome these limitations, we propose VisionCreator, a native visual-generation agentic model that unifies Understanding, Thinking, Planning, and Creation (UTPC) capabilities in an end-to-end learnable framework, as shown in Fig.2 (d). Unlike existing approaches that rely on predefined workflows or external template workflows, our native architecture intrinsically integrates the capabilities of Understanding design conventions and user intent, Thinking through complex creative constraints, Planning multi-step execution trajectories, and Creation of high-quality and diverse visual creation tasks. However, realizing this new paradigm faces several critical challenges:

① Data Bottleneck: Currently, no comprehensive datasets exist for training agents to perform visual content creation through tool invocation. The lack of high-quality trajectories prevents supervised learning of the UTPC capabilities.

② Task Complexity: How to develop models that can handle the full spectrum of visual creation challenges, which encompass (i) Diverse task types, (ii) Varying difficulty levels from basic generation to advanced composition, and (iii) Complex creation tasks requiring 20+ execution steps? Existing approaches face significant limitations: specialized systems excel in narrow domains but fail to generalize across diverse tasks, while general models lack the depth for sophisticated creative reasoning and struggle with long-horizon consistency and adaptive strategy adjustment.

③ Training Difficulty: How to establish an effective and efficient training paradigm for such a native agent? The conventional SFT+RL framework faces significant obstacles: (i) SFT phase struggles to balance general capability preservation with domain-specific specialization, often leading to catastrophic forgetting or insufficient expertise; (ii) Direct online RL training with real tools incurs prohibitive costs and instability due to expensive API invocation and limited concurrency. Furthermore, designing accurate reward signals for multi-step creative trajectories is particularly challenging, as imperfect reward functions are highly vulnerable to reward hacking.

Refer to caption

Figure 2: Framework comparisons. (a) UMM. (b) Workflow-specific Agent. (c) Workflow-guided Agent. (d) Our Native VisionCreator.

To address these challenges, we propose:

(i) VisGenData-4k with UTPC Structure: We design a metacognition-based VisionAgent to generate a comprehensive dataset following the UTPC structure, featuring diverse visual creation tasks across multiple difficulty levels. Through rigorous human quality inspection, we meticulously filter and retain only the highest-quality data samples. The resulting VisGenData-4k provides diverse and high-quality execution trajectories that explicitly capture Understanding of design conventions, Thinking through creative constraints, Planning of multi-step trajectories, and Creation of visual content, offering rich supervision signals for complex creative workflows.

(ii) Progressive Specialization Training (PST): We introduce a novel Progressive Specialization Training methodology that cultivates UTPC capabilities through two-stage optimization. PST effectively addresses the generalization-specialization trade-off by first establishing robust Understanding and Thinking capacities through general foundation learning, followed by targeted domain specialization to enhance Planning and Creation expertise. This progressive strategy not only prevents catastrophic forgetting of general abilities but also efficiently identifies optimal data composition for stagewise specialization, enabling the model to develop comprehensive UTPC capacities while maintaining strong cross-domain reasoning abilities.

(iii) Virtual VisGenEnv Construction: We construct VisGenEnv, a virtual environment for VRL. It features 36 tools with high-fidelity simulation of their behaviors. Multimodal outputs are simulated by returning random samples from a media database, providing correct physical attributes. This design enables effective learning of workflow planning through accurate tool behavior simulation.

(iv) Virtual Reinforcement Learning (VRL) with LtrReward: We develop an innovative Virtual Reinforcement Learning (VRL) paradigm that conducts the entire reinforcement learning using Long Trajectory Reasoning Reward (LtrReward) within the high-fidelity VisGenEnv. This approach bypasses the prohibitive cost of thousands of GPUs by leveraging simulated tool-call behaviors and functional logic, enabling stable and scalable learning of high-quality planning and action trajectories. Moreover, we provide a theoretical analysis that establishes formal guarantees on sim-to-real transfer and real-world performance improvement.

Finally, we introduce VisGenBench, a comprehensive benchmark designed for evaluating visual generation agentic models that operate through multi-step tool invocation to accomplish complex image and video creation tasks. Our benchmark encompasses: (i) Comprehensive Test Suite - featuring 1.2k test samples including 400 image-generation tasks and 800 video-generation tasks; (ii) Diverse Applications - spanning 10 evaluation dimensions across 35+ real-world scenarios; (iii) Standardized Protocol - ensuring reproducible evaluation through structured scoring rubrics.

Overall, our contributions are: (i) The VisionCreator, a novel native visual-generation agentic model that unifies UTPC capabilities in an end-to-end learnable framework; (ii) VisGenData-4k and its construction framework using metacognition-based VisionAgent to generate high-quality creation trajectories with UTPC structures; (iii) A progressive training methodology combining Progressive Specialization Training (PST) and Virtual Reinforcement Learning (VRL) with LtrReward, enabling stable and efficient learning of complex creation trajectories entirely within a virtual environment VisGenEnv; (iv) VisGenBench benchmark with 1.2k test samples across diverse scenarios for standardized evaluation of multi-step visual creation capabilities.

2.1 Image Generation

Current image generation models primarily fall into two categories: Autoregressive ¹⁵ ¹⁶ ¹⁷ ¹⁸ ¹⁹ ²⁰ models and Diffusion ¹ ² ³ ⁴ ⁵ models. While these models provide powerful single-step image generation capabilities, they primarily focus on the Creation aspect of visual content generation. Our VisionCreator builds upon these fundamental generation technologies but extends them by integrating comprehensive Understanding, Thinking, and Planning capabilities. This allows our agent to not only generate individual images but also reason about complex creative requirements and plan multi-step visual creation workflows that leverage these underlying generation models as tools.

2.2 Video Generation

Video generation methods build on image models by adding time-based processing. Approaches like Make-A-Video ²¹ and SVD ²² extend image generation to video, while newer architectures like DiT ²³ and MMDiT ²⁴ in models such as CogvideoX ²⁵ show progress in handling longer videos. These video generation tools are important for our agent’s creation ability, but they work separately. Our VisionCreator connects these tools through planning and reasoning to handle complete creation tasks.

2.3 Visual Generation Agents

Current approaches to autonomous visual creation include three main agent paradigms: (i) Workflow-specific Agents (e.g., MovieAgent ⁶, Captain Cinema ⁷, MM-StoryAgent ⁸) employ predefined pipelines for specialized domains but lack adaptability to diverse creative tasks. (ii) ComfyUI Workflow Generation methods (e.g., ComfyAgent ⁹, ComfyMind ¹⁰, ComfyUI-R1 ²⁶, ComfyGPT ²⁷) specialize in generating ComfyUI-format workflows, which limits their visual creation capability in general API scenarios. (iii) Workflow-guided Agents ⁹ ¹⁰ orchestrate external tools through prompt engineering but face limitations in creative understanding depth and end-to-end optimization. These limitations motivate our native visual-generation agent that intrinsically integrates UTPC capabilities in an end-to-end learnable framework.

Refer to caption

Figure 3: VisGenData-4k construction pipeline.

3 VisGenData-4k with UTPC Structure

To tackle the data bottleneck in training visual creation agents, we design VisionAgent, a dataset generation framework based on a Metacognition paradigm. To construct a high-quality VisGenData dataset, VisionAgent employs commercial proprietary models (such as GPT-5, GPT-4o, Veo3, Sora2, etc) for multimodal data generation, and we further filter low-quality trajectories with algorithms and human experts. As shown in Fig. 3, the construction pipeline is as follows: (i) VisionAgent first generates 16k trajectories from 20k queries covering 42 scenarios. (ii) With the rigorous LtrReward and VLM-Grader methods, we remove 10k low-quality trajectories and obtain 6k candidate trajectories. (iii) These subsequently undergo a manual review by human experts, where 2k undesired trajectories are filtered out, resulting in high-quality 4k trajectories.

Refer to caption

Figure 4: VisionAgent framework for dataset generation.

3.1 VisionAgent for Dataset Construction

As shown in Fig.4, our VisionAgent generates high-quality execution trajectories that capture the complete reasoning process for complex visual creation tasks. VisionAgent with metacognition achieves a 72% task success rate, representing a 30% improvement over the baseline method that relies solely on thinking.

Dual-Agent Architecture. Our framework employs a dual-agent architecture that separates task understanding from execution reasoning: (1) TaskAgent: Serves as the task classifier and router. It analyzes user inputs and performs fine-grained task classification across the 21 distinct task types, then selects appropriate predefined workflow templates and tools pool for specific task categories. (2) MetaAgent: Functions as the core reasoning engine with metacognitive capabilities. It receives both the selected workflow and tools pool as inputs, then executes structured reasoning through four standardized reasoning types defined in metacognition.

Refer to caption

Figure 5: VisGenData-4k dataset statistics.

Metacognitive Reasoning Process. The metacognition defines four reasoning types: the phase maintains situational awareness and todo-list through continuous state evaluation; the phase constructs executable task sequences by decomposing objectives and dependencies; the <tool_call> phase invokes appropriate tools based on plan blueprints and analytical reasoning; and the phase verifies goal completion, forming a closed-loop execution process. This Metacognitive reasoning process guides the MetaAgent to generate the UTPC trajectory.

Reference Workflow Integration. We incorporate 15 predefined workflow templates as best-practice guides, ensuring planning remains flexible yet stays on track. These workflows provide domain-specific execution patterns for various visual creation tasks, representing 15 distinct application scenarios from storyboards to animated short films.

3.2 Dataset Composition and Statistics

Fig.5 shows the statistics of our VisGenData-4k, which exhibits the following key features: (1) Diverse Task Types: Encompassing 21 distinct task types (including storyboards, marketing posters, product marketing videos, animated short films, etc), this diversity is crucial for training agents to handle a broad range of real-world creative demands, significantly enhancing their adaptability and practical applicability. (2) Complex Trajectory Structure: With a mean of 15 steps and 64% of trajectories exceeding 20 steps, this complexity is crucial for training agents to decompose and plan long-horizon tasks, fostering robust problem-solving capabilities in visual creation. (3) Rich Contextual Information: The substantial token length (mean: 29k, 43% over 32k) equips agents with the ability to process and utilize extensive contextual cues, significantly enhancing their capacity for detailed and context-aware generation.

4 Agentic Post-Training

4.1 Agentic Framework

As shown in Fig. 6, VisionCreator is formulated as a unified agent that integrates Understanding, Thinking, Planning, and Creation (UTPC) capabilities to accomplish complex visual generation tasks. Formally, we model the agent as a policy $π_{θ}$ operating over long-horizon multimodal trajectories: $τ = (o_{0}, a_{0}, o_{1}, a_{1}, \dots, o_{T})$ , where $o_{t}$ denotes multimodal observations (textual instructions, intermediate tool feedback, and virtual visual states), and $a_{t}$ denotes agent actions including reasoning tokens, planning steps, and tool invocations. The training process follows a two-stage agentic post-training paradigm: (1) Progressive Specialization Training (PST), which initializes a strong policy prior via supervised learning over expert UTPC trajectories. (2) Virtual Reinforcement Learning (VRL), which further optimizes long-horizon planning and tool-use strategies through large-scale exploration in a simulated environment.

4.2 Progressive Specialization Training

The goal of Progressive Specialization Training (PST) is to learn an initial policy $π_{θ_{0}}$ that simultaneously preserves general reasoning competence while acquiring domain-specific visual creation ability, thereby enabling a functional visual content creation agent rather than a narrowly tuned generator. Let the supervised dataset be $D = D_{gen} \cup D_{vis}$ , where $D_{gen}$ contains large-scale general reasoning and tool-use trajectories, and $D_{vis}$ contains expert-curated visual creation trajectories (VisGenData-4k). Standard supervised fine-tuning (SFT) minimizes

L_{SFT} (θ) = E_{(o, a) \sim D} [- lo g π_{θ} (a ∣ o)] .

However, naive single-stage SFT exhibits two fundamental failure modes. Training only on $D_{vis}$ leads to catastrophic forgetting of general reasoning and planning ability, resulting in nearly zero agent competence; empirically, Tab. 4 shows performance dropping to 0.007, indicating the model is unable to function as a visual creation agent. Conversely, one-stage mixed SFT on $D_{gen} \cup D_{vis}$ avoids catastrophic forgetting but yields suboptimal specialization, since the dominance of $D_{gen}$ suppresses learning of visual-creation behaviors and degrades downstream agent performance. These observations reveal a necessary condition for visual agents:

General Competence Preservation + Strong Visual Agent Specialization,

which neither naive SFT strategies can satisfy simultaneously.

PST resolves this conflict through a controlled two-stage curriculum that induces a gradual distribution shift. In Stage 1 (general foundation learning),

D^{(1)} = D_{gen}^{500 K} \cup λ D_{vis},

establishing robust reasoning, planning, and tool-use capabilities while lightly anchoring the policy to the visual generation agent domain. In Stage 2 (targeted specialization),

D^{(2)} = D_{gen}^{200 K} \cup λ D_{vis},

the increased effective influence of $D_{vis}$ drives specialization toward visual content creation, while continued exposure to $D_{gen}$ prevents catastrophic forgetting. Overall, PST learns a structured initialization

π_{θ_{0}} \approx ar g θ min E_{(o, a) \sim D^{(1)} \to D^{(2)}} [- lo g π_{θ} (a ∣ o)] .

which constrains downstream reinforcement learning (RL) to a policy region that already satisfies both general competence and visual specialization. Experimental results further validate the necessity of PST. Compared with one-stage SFT, PST achieves substantially stronger performance on visual creation agent tasks, demonstrating that progressive specialization is essential for learning effective UTPC behaviors. Moreover, PST provides a significantly better initialization for RL: the initial reward score before RL training increases from 0.64 (one-stage SFT) to 0.87 (PST), a gain of +0.23. This improved starting point directly translates into optimization efficiency—RL convergence is accelerated by approximately 50%. These findings confirm that PST not only improves final agent capability, but also fundamentally reduces the difficulty of downstream reinforcement learning.

Refer to caption

Figure 6: Our Native VisionCreator framework.

4.3 Virtual Reinforcement Learning

Building upon the robust foundation established by PST, we refine the model’s UTPC capabilities through Virtual Reinforcement Learning (VRL) based on the GRPO algorithm. To enable scalable long-horizon learning without invoking real-world tools, we first construct a high-fidelity virtual environment VisGenEnv that simulates the behavior of visual creation tools. Within this environment, LtrReward components are designed to supervise agent trajectories and guide both planning and execution. To understand that policies learned under these rewards transfer effectively to real-world scenarios, we provide a theoretical analysis of VRL. Building upon these insights, we then introduce a plan-driven reward that integrates planning and execution signals to optimize robust long-horizon visual creation performance.

Refer to caption

Figure 7: Comparison of the real environment and our virtual VisGenEnv environment, with an example of using a video generation tool.

4.3.1 Virtual VisGenEnv Environment

To enable scalable long-horizon learning without invoking real-world tools, we first construct a high-fidelity virtual environment called VisGenEnv. This environment serves as a sandbox where the agent can safely explore planning and tool-use strategies, laying the foundation for subsequent reward design and theoretical analysis. VisGenEnv integrates a comprehensive suite of 36 visual creation tools (see Appendix for full list). The core of its design lies in a procedural simulation that accurately replicates the functional logic and behavioral patterns of real tools, including state transitions, parameter validation, and output specifications such as image resolution and video duration. To simulate multimodal outputs, the environment returns media files randomly sampled from a database while ensuring physically correct attributes consistent with tool specifications. This high-fidelity simulation of tool behaviors enables the agent to effectively learn the causal structure of the workflow and master robust planning policies through extensive practice within the virtual setting.

Training agent models by reinforcement learning in the real environment is prohibitively expensive. As illustrated in Fig. 7, supporting a training batch size of 24 with 4 rollouts (i.e., 96 concurrent rollouts in total) quickly becomes computationally intractable. Video tools are particularly costly: each instance requires 8 GPUs and roughly 30 seconds per video, meaning 96 concurrent rollouts would require $8 \times 96 = 768$ GPUs. Deploying multiple real image and video generation tools would require several thousand GPUs, while our virtual environment VisGenEnv enables long-horizon exploration with only a few GPUs—thus saving thousands of GPU resources.

Refer to caption

Figure 8: LtrReward Components.

4.3.2 LtrReward Components

With the virtual environment in place, as shown in Fig. 8, we now define LtrReward components $R_{vrt}$ (i.e., virtual reward applicable to VisGenEnv) as reward signals that guide the agent’s learning, which consist of Plan Reward $R_{plan}$ and Fine-grained Reward $R_{fine}$ .

Plan Reward $R_{plan}$ evaluates the overall quality of the task plan using a proposed vPlanJudger, an expert-informed LLM evaluator that leverages a curated repository of expert reference plans to provide in-context guidance. By performing cross-referenced reasoning between the candidate plan and expert-authored strategies, the vPlanJudger computes a multidimensional alignment score focusing on five key facets: (1) Requirement Fulfillment, a binary check on whether the output’s modality and quantity align with the user request; (2) Logical Coherence, verifying the causal validity of sub-task sequencing; (3) Pragmatic Executability, ensuring each step is grounded within the available toolset or LLM capabilities to avoid hallucinatory actions; (4) Decomposition Atomicity, which evaluates whether the plan is partitioned into actionable atomic tasks; and (5) Expert-Guided Optimality, which rewards task-specific best practices such as identity consistency for multi-shot content, beat-aligned audio-visual synchronization, and the strategic minimization of complexity.

The Fine-grained Reward $R_{fine}$ integrates both rule-based and effect-based signals to ensure structurally valid execution and successful task realization. Specifically: (1) Rule-based components include Format Compliance $R_{format}$ , which validates UTPC structural correctness via parsing of tags, ordering, content, and JSON validity; Tool Invocation $R_{tool}$ , which scores execution success with graded penalties for intermediate or final failures; and Visual Consistency $R_{cons}$ , which rewards appropriate use of reference-based generation when consistency is required. (2) Effect-based components include Result Achievement $R_{result}$ , which verifies output constraints such as image count and video duration within tolerance bounds, and Trajectory Coherence $R_{traj}$ , which evaluates alignment between planning intent and executed actions through an LLM-evaluator. Together, these rewards provide trajectory-level supervision that encourages correct agentic structure, reliable tool usage, and coherent visual creation outcomes.

4.3.3 Theoretical Foundations of Virtual Reinforcement Learning

Based on the constructed virtual environment and the LtrReward components, we provide a theoretical analysis to explain the effectiveness of VRL when transferred to real-world execution. The theoretical legitimacy of VRL rests on its ability to maintain policy efficacy despite the intrinsic discrepancies between virtual simulation and real-world execution. Specifically, VRL operates under a Rollout Gap, where the agent lacks real visual feedback to rectify its trajectory, and an Objective Inconsistency, caused by substituting the vision reward $R_{vision}$ (which measures perceptual quality across multiple visual dimensions) with a structural proxy $R_{result}$ . To evaluate how these discrepancies affect policy transfer, we model the sim-to-real transition as a function of four synergistic variables: (i) Tool Capability ( $C_{tool}$ ), quantifying the reliability of the generative engine; (ii) PST Prior ( $π_{pst}$ ), anchoring the agent’s initial reasoning within a distribution derived from real expert data; (iii) Plan Sufficiency ( $Φ_{plan}$ ), measuring the causal link between logical correctness and visual quality; and (iv) Result Reward ( $R_{result}$ ), ensuring the structural completion of tasks.

The following theorems establish the mathematical foundation of VRL: Theorem 4.1 provides an error bound guarantee, proving that the sim-to-real gap remains controllable under the joint constraint of these variables; Theorem 4.2 characterizes the real-world performance gain as a competition between Causal Improvement and Transfer Loss, showing that VRL yields non-negative improvement whenever the causal reward gain dominates the bounded sim-to-real error.

Theorem 4.1 (Virtual-to-Real Error Bound).

Let $J_{real} (π)$ and $J_{vrt} (π)$ be the expected returns of policy $π$ in real and virtual environments. And $δ, α, β$ are environment-specific scaling factors. The transfer error $E (π) = ∣ J_{real} (π) - J_{vrt} (π) ∣$ is bounded by:

E (π) \leq Dynamics Gap δ (1 - C_{tool}) + Action Bias Bound α \cdot D_{KL} (π_{vrt} ∣ π_{pst}) + Goal Alignment Error β (1 - Φ_{plan} \cdot R_{result})

Theorem 4.1 quantifies how the sim-to-real divergence is suppressed: (i) Dynamics Gap is minimized by $C_{tool}$ , ensuring virtual procedural logic mirrors real API behavior; (ii) Action Bias Bound is constrained by the PST prior, which prevents policy drift in the absence of real visual feedback by maintaining consistency with expert decision-making; (iii) Goal Alignment Error is mitigated by the coupling of $Φ_{plan}$ and $R_{result}$ , ensuring the virtual completion objective serves as a reliable proxy for real-world success.

Theorem 4.2 (Real-World Improvement of VRL).

Under the error bound $E$ , the real-world performance gain depends on the dominance of Causal Improvement over Transfer Loss:

J_{real} (π_{VRL}) - J_{real} (π_{pst}) \geq Causal Improvement Γ \cdot E_{π} [Δ R_{vrt}] - Transfer Loss E (π)

where $Γ = C_{tool} \cdot Φ_{plan} \cdot κ (π_{pst})$ is the effectiveness coefficient, and $κ (π_{pst})$ denotes the anchoring strength of the PST prior in constraining policy exploration. Virtual reward $R_{vrt}$ consisting of $R_{plan}$ and $R_{fine}$ , and $E_{π} [Δ R_{vrt}]$ denotes the expected increment of virtual reward, representing the agent’s logic optimization in planning and execution.

The practical transferability of VRL is validated by the convergence behavior in our experiments, where the agent achieves an average virtual reward exceeding 95%. This saturation of total virtual reward $R_{vrt}$ indicates that the Causal Improvement term is maximized, providing a substantial logical buffer to offset transfer discrepancies. By substituting these empirical results into Theorem 4.1, we observe that the Action Bias Bound is strictly suppressed by the PST prior, while the Goal Alignment Error is mitigated by the coupling of $Φ_{plan}$ and $R_{result}$ , remaining stable as the agent masters structural completion. Consequently, the Transfer Loss $E (π)$ is primarily governed by the Dynamics Gap $δ (1 - C_{tool})$ . This reveals a critical insight: VRL efficacy is fundamentally a function of generative tool quality. As $C_{tool}$ increases—meaning the underlying visual creation tools become more reliable and follow procedural logic more closely—the transfer loss diminishes, allowing the massive logical gains from virtual training to translate effectively into superior real-world visual quality. Therefore, we derive the following corollary:

Corollary 4.3 (Fidelity-Anchored Transfer).

Provided the virtual reward $R_{vrt}$ reaches a near-optimal level, the real-world gain of VRL is monotonically non-decreasing with respect to $C_{tool}$ .

4.3.4 Plan-Driven Reward Design

Theorems 4.1 and 4.2 indicate that real-world improvement critically depends on planning quality. Motivated by this insight, we adopt a plan-driven reward that enforces causal dependency between planning and execution:

R_{vrt} = R_{plan} \times R_{fine} [R_{tool} + R_{format} + R_{result} + R_{traj} + R_{cons}] .

Here, $R_{plan}$ measures plan correctness, while $R_{fine}$ captures execution-level structural validity. The multiplicative coupling ensures that execution alone cannot achieve high reward without a valid plan, and maximal reward is obtained only when a correct plan is faithfully executed. This mechanism directly aligns with Theorem 4.2, promoting robust long-horizon planning and tool-use strategies within virtual training.

5 Experiment

5.1 VisGenBench

Existing video generation benchmark VBench-2.0 ²⁸ has made significant contributions to evaluating the quality of individual-generated videos. But it lacks the capability to evaluate multi-step visual creation trajectories that involve complex tool invocation and long-horizon planning. While ComfyBench ⁹ attempts to assess multi-step trajectories, it is specifically designed for ComfyUI ²⁹ and evaluates agent performance based solely on ComfyUI execution success, making it unsuitable for general API-based tool invocation scenarios. To address this critical gap, we introduce VisGenBench, a comprehensive benchmark designed for evaluating visual generation agentic models that operate through multi-step tool invocation to accomplish complex image and video creation tasks.

Table 1: Test dataset composition of VisGenBench, with 400 image tasks and 800 video tasks.

Type	Content	Content	Object	Scene	Style	Variety	Visual	Video	Video
Type	Creative	Match	Consistency	Consistency	Consistency	Variety	Amount	Duration	Storyboard
Image Tasks	50	50	50	50	50	50	100	–	–
Video Tasks	50	50	50	50	50	50	100	200	200
Total	100	100	100	100	100	100	200	200	200

5.1.1 Test Dataset Composition

As shown in Tab. 1, the VisGenBench consists of a total of 1.2k test samples, including 400 image-generation tasks and 800 video-generation tasks. Each task is designed to reflect multi-step creation trajectories, requiring to generation of many images and videos. The benchmark spans 10 evaluation dimensions and covers 35+ real-world application scenarios, encompassing domains such as advertising, storytelling, entertainment, animation, etc.

5.1.2 Evaluation Framework

The VisGenBench evaluation framework integrates both objective and subjective assessments to measure an agent’s ability to perform multi-step visual generation tasks.

Objective Evaluation Objective evaluation focuses on quantifiable and automatically measurable aspects of the generated content. Specifically, it consists of two components: (1) Success Rate: Measures whether the model successfully returns valid images/videos when requested by user. A generation containing the correct modality is counted as Success. (2) Basic Visual Attributes: Quantitative evaluation of the generated results, including visual quantity, video storyboard count, and video duration. These attributes are automatically assessed using standardized tools.

Subjective Evaluation Subjective aspects such as visual consistency, diversity, storytelling quality, and audio perception cannot be fully captured through traditional metrics. We therefore introduce a VLM-Grader with pre-defined fine-grained scoring rubrics, implemented using the Gemini2.5-Pro model. For each subjective evaluation dimension, we define a tailored meta evaluation list—a structured rubric containing detailed scoring items (e.g., character consistency, style coherence, narrative flow, audio synchronization, etc). Gemini2.5-Pro provides a meta-evaluation score for each meta-item, and the aggregated score forms the overall result for that dimension. To align automated scoring with human judgment, we calibrate Gemini2.5-Pro’s meta-evaluation intensity on VisGenBench. This ensures that both mean scores and relative rankings evaluated by Gemini2.5-Pro remain consistent with expert human assessments, achieving a human-aligned evaluation process.

Table 2: Comparisons on VisGenBench by VLM Evaluation. S-Rate: Success Rate, O-Score: Overall Score. The best and second-best results are highlighted.

Method	Creative	Match	Object	Scene	Style	Variety	Amount	Duration	Storyboard	S-Rate	O-Score
GPT-5	0.683	0.641	0.593	0.579	0.638	0.232	0.620	0.263	0.660	0.863	0.577
Gemini2.5-Pro	0.777	0.802	0.625	0.602	0.573	0.345	0.540	0.376	0.700	0.933	0.627
Qwen3-VL-8B-Tk	0.104	0.078	0.100	0.065	0.109	0.014	0.160	0.034	0.040	0.142	0.085
VisionCreator-8B	0.651	0.661	0.645	0.638	0.595	0.211	0.480	0.429	0.580	0.925	0.581

Table 3: Comparisons on VisGenBench by Human Evaluation. All models use the new version system prompt, which differs from Tab.2. Overall Score = (Success Rate of Image $\times$ Human Evaluation of Image $+$ Success Rate of Video $\times$ Human Evaluation of Video) $/$ 2. The performance comparisons of all detailed human evaluation dimensions are shown in Fig. 1.

Model	Success Rate		Human Evaluation		Overall Score
Model	Image	Video	Image	Video	Overall Score
GPT-5	95.95%	93.00%	3.52	3.25	3.19
Gemini2.5-Pro	91.00%	84.00%	3.53	3.35	3.01
Qwen3-VL-32B-Thinking	97.00%	93.00%	3.47	3.23	3.18
Qwen3-VL-32B-RL	91.00%	87.00%	3.51	3.40	3.07
Qwen3-VL-32B-SFT	96.00%	94.00%	3.53	3.37	3.27
VisionCreator-32B	99.00%	96.00%	3.53	3.49	3.42

5.2 Results on VisGenBench by VLM Evaluation

As shown in Tab. 2, our VisionCreator-8B demonstrates remarkable performance that is highly competitive with much larger commercial models (GPT-5 and Gemini2.5-Pro), while significantly outperforming its base model Qwen3-VL-8B-Thinking. The key findings highlight several advantages of our approach: (1) Superior Success Rate and Reliability: VisionCreator-8B achieves an impressive success rate of 0.925, surpassing GPT-5 (0.863) and approaching Gemini2.5-Pro (0.933). This demonstrates the effectiveness of our UTPC framework in ensuring task completion reliability, a crucial requirement for practical visual creation applications. (2) Exceptional Consistency Performance: VisionCreator-8B achieves the highest scores in object consistency (0.645) and scene consistency (0.638) among all compared models, including the much larger Gemini2.5-Pro and GPT-5. This validates our model’s strong capability in maintaining visual coherence throughout multi-step creation processes, a core benefit of the native agentic architecture. (3) The results validate our core hypothesis: a specialized native visual creation agent, even with significantly fewer parameters, can achieve performance competitive with general-purpose commercial giants through targeted architectural design and training methodology. VisionCreator’s particular strengths in success rate and consistency metrics underscore the practical advantages of our UTPC framework for real-world visual content creation applications.

Table 4: Ablation study with VisionCreator-8B on VisGenBench-104 comparing different training strategies. VisGenBench-104 is a sampled subset of VisGenBench. Model configurations: RL1: PST + Result+Format reward); RL2: PST + Plan×(Result+Format) reward; RL3: Qwen3-VL + Plan×(Result+Format) reward; RL4: PST + Plan×Fine reward; v1: 3×VisGenData-4k; v2: 3×VisGenData-4k + General-1M; v3: 20×VisGenData-4k + General-1M; v4: PST + 3×VisGenData-4k + General-1%; v5: PST + 3×VisGenData-4k + General-5%; v6: PST + 3×VisGenData-4k + General-10%; v7: PST + 3×VisGenData-4k + General-20%.

Method	Creative	Match	Object	Scene	Style	Variety	Amount	Duration	Storyboard	S-Rate	O-Score
RL1	0.534	0.817	0.694	0.547	0.579	0.249	1.000	0.397	0.625	0.904	0.634
RL2	0.579	0.808	0.677	0.479	0.558	0.265	0.800	0.478	0.875	0.942	0.644
RL3	0.671	0.674	0.621	0.622	0.555	0.217	0.800	0.513	0.750	0.885	0.631
RL4	0.573	0.794	0.672	0.696	0.569	0.150	1.000	0.534	0.625	0.925	0.654
v1	0.000	0.050	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.019	0.007
v2	0.230	0.334	0.382	0.339	0.396	0.163	0.600	0.134	0.500	0.490	0.357
v3	0.262	0.422	0.300	0.260	0.473	0.068	0.600	0.100	0.250	0.481	0.322
v4	0.283	0.468	0.399	0.295	0.318	0.000	0.600	0.183	0.000	0.442	0.299
v5	0.266	0.361	0.366	0.246	0.201	0.084	0.200	0.029	0.125	0.490	0.237
v6	0.239	0.310	0.326	0.194	0.273	0.149	0.600	0.098	0.125	0.413	0.273
v7	0.420	0.701	0.430	0.447	0.430	0.028	0.600	0.344	0.375	0.625	0.440

5.3 Results on VisGenBench by Human Evaluation

In addition to automated VLM-based evaluation (Tab. 2), we conduct a thorough human evaluation to assess the perceptual quality of multi-step visual creation tasks, including images and videos (Tab. 3), which shows that: (1) Overall Findings: VisionCreator-32B achieves the highest Overall Score of 3.42, surpassing both GPT-5 (3.19) and Gemini2.5-Pro (3.01). This indicates that our UTPC framework not only ensures task success in an automated setting but also delivers outputs that are qualitatively preferred by human evaluators. (2) Image vs. Video Performance: VisionCreator-32B excels across both modalities, with 99% image success and 96% video success, accompanied by strong human evaluation scores (3.53 for images, 3.49 for videos). This balanced performance highlights the model’s capability to maintain coherent multi-step planning and execution for both static and dynamic content. (3) Implications: The human evaluation corroborates trends observed in VLM-based metrics, validating that the model’s planning-driven reward design and VRL training not only improve automated success metrics but also enhance perceptual quality, consistency, and user satisfaction in real-world multi-step visual creation tasks.

5.4 Ablation Studies

We conduct ablation studies on sampled VisGenBench-104, where key findings from Tab. 4 include: (1) Effectiveness of PST. Our PST with v7 (PST + 3×VisGenData-4k + General-20%) achieves significant improvement over SFT with v2 (3×VisGenData-4k + General-1M) (0.440 vs. 0.357). Performance improves with increasing general data ratio (v4→v5→v6→v7), confirming balanced specialization prevents overfitting while maintaining generalization. (2) Data Configuration Strategies. Simply increasing specialized data scale does not guarantee improvement. v3 (20×VisGenData-4k + General-1M) underperforms v2 (3×VisGenData-4k + General-1M) (0.322 vs. 0.357), indicating excessive data repetition causes overfitting. Our PST strategy achieves better performance through optimized data ratios. (3) Virtual Reinforcement Learning. All VRL models substantially outperform SFT variants. RL4 (PST + Plan×Fine reward) improves Overall Score by 49% over the best PST model v7 (0.654 vs. 0.440), demonstrating VRL’s effectiveness. (4) Reward Function Designs. Building upon RL1, RL2 (PST + Plan×(Result+Format) reward) which incorporates additional plan reward, demonstrates improved performance with a higher Success Rate (0.942 vs. 0.904) and Overall Score (0.644 vs. 0.634). RL4 achieves the best Overall Score (0.654) and demonstrates strong comprehensive performance across multiple dimensions, proving fine-grained rewards enhance model capability. (5) Importance of Pre-training Foundation RL2 (PST + Plan×(Result+Format) reward) outperforms RL3 (Qwen3-VL + Plan×(Result+Format) reward) (0.644 vs. 0.631) despite identical rewards, with RL2 achieving a notably higher Success Rate of 0.942 compared to 0.885 for RL3, validating PST provides a stronger foundation for RL training.

Refer to caption

Figure 9: Visualization comparisons of consistency.

6 Conclusion

We present VisionCreator, a native visual-generation agent that unifies Understanding, Thinking, Planning, and Creation (UTPC) in an end-to-end framework. Our contributions include: (1) VisGenData-4k with UTPC structures via metacognition-based VisionAgent; (2) Progressive Specialization Training and Virtual Reinforcement Learning for stable capability acquisition; (3) VisGenBench for multi-step visual creation evaluation. Experiments show VisionCreator outperforms larger closed-source models, validating our approach. This work establishes a foundation for visual-generation agentic systems and autonomous creative content generation.

Detailed Theoretical Derivations of VRL Theorems

This appendix provides detailed mathematical derivations and proofs for the two VRL theorems presented in the main text. The derivation process is divided into three stages: formal modeling and definitions, derivation of the error upper bound (Theorems 4.1), and analysis of performance improvement (Theorems 4.2).

Stage 1: Formal Modeling and Definitions

We first formalize the agent’s policy, environment, and rewards to establish the foundation for subsequent derivations.

1.1 Formalization of Environment and Policy

Definition A.1 (MDP Tuple): The real-world task is modeled as a Markov Decision Process (MDP), denoted as $M_{real} = (S, A, P_{real}, R_{vision}, ρ_{0}, γ)$ .

$S$ : State space, containing multimodal observations $o_{t}$ (textual instructions, tool feedback, virtual visual states).
$A$ : Action space, containing reasoning tokens, planning steps, and tool invocations.
$P_{real} (s^{'} ∣ s, a)$ : Dynamic transition probability of the real environment.
$R_{vision} (s, a, s^{'})$ : Real reward function, measuring the perceptual quality of generated content (e.g., aesthetics, alignment).
$ρ_{0}$ : Initial state distribution.
$γ \in (0, 1)$ : Discount factor.

Definition A.2 (Virtual Environment): The virtual environment is $M_{vrt} = (S, A, P_{vrt}, R_{vrt}, ρ_{0}, γ)$ . Its core differences are:

$P_{vrt} (s^{'} ∣ s, a)$ : Tool dynamics simulated by VisGenEnv, with fidelity quantified by the tool capability $C_{tool} \in [0, 1]$ .
$R_{vrt}$ : Virtual reward function, composed of $R_{plan}$ and $R_{fine}$ according to the plan-driven reward design. It is a structural proxy reward that substitutes for the computationally infeasible $R_{vision}$ in the virtual environment.

Definition A.3 (Policy and Return): Let $π$ be a policy (mapping from states to actions). The expected discounted return of policy $π$ in environment $M$ is defined as:

J (π; M) = E_{τ \sim (π, M)} [t = 0 \sum \infty γ^{t} R (s_{t}, a_{t}, s_{t + 1})] .

Here, the trajectory $τ = (s_{0}, a_{0}, s_{1}, a_{1}, \dots)$ is generated by $s_{0} \sim ρ_{0}$ , $a_{t} \sim π (\cdot ∣ s_{t})$ , and $s_{t + 1} \sim P (\cdot ∣ s_{t}, a_{t})$ . For brevity, we denote $J_{real} (π) = J (π; M_{real})$ and $J_{vrt} (π) = J (π; M_{vrt})$ .

1.2 Key Variables and Core Assumptions

Definition A.4 (Key Variables):

Tool capability $C_{tool}$ : Measures how well the virtual environment dynamics $P_{vrt}$ approximate the real dynamics $P_{real}$ . $C_{tool} = 1$ indicates perfect simulation.
PST prior $π_{pst}$ : The initialization policy obtained through Progressive Specialization Training (PST). Its behavioral distribution on real expert data is denoted as $d_{pst} (s, a)$ .
Plan sufficiency $Φ_{plan} \in [0, 1]$ : Measures the strength of the causal link between a “logically correct” plan and the final “high-quality visual output”.
Result reward $R_{result} \in [0, 1]$ : A subcomponent of $R_{fine}$ that evaluates whether the task is structurally completed (e.g., number of images, video duration).

Assumption A.1 (Dynamic Difference Upper Bound): There exists a constant $δ > 0$ related to environment complexity such that for all state-action pairs $(s, a)$ ,

∣ P_{real} (\cdot ∣ s, a) - P_{vrt} (\cdot ∣ s, a) ∣_{1} \leq δ (1 - C_{tool}) .

This assumption stems from the high-fidelity simulation design of VisGenEnv: higher tool capability $C_{tool}$ leads to smaller differences between virtual and real transitions.

Assumption A.2 (Reward Proxy Error): The relationship between the real visual reward $R_{vision}$ and the proxy reward is modulated by plan sufficiency $Φ_{plan}$ and result reward $R_{result}$ . There exists a constant $β > 0$ such that for meaningful trajectories (i.e., when planning logic is correct), the reward difference satisfies:

∣ R_{vision} (s, a, s^{'}) - Φ_{plan} \cdot R_{result} (s, a, s^{'}) ∣ \leq β (1 - Φ_{plan} \cdot R_{result}) .

This assumption reflects the design philosophy of LtrReward: when planning is sufficient ( $Φ_{plan} \approx 1$ ) and the task is perfectly completed structurally ( $R_{result} \approx 1$ ), the real visual quality also tends to be high.

Assumption A.3 (KL Constraint on Policy Deviation): The policy $π_{vrt}$ trained in the virtual environment differs from the PST prior $π_{pst}$ in the state-action distribution. This difference can be measured by the KL divergence $D_{KL} (π_{vrt} ∣ π_{pst})$ , and its impact on the return difference is linearly bounded. That is, there exists a constant $α > 0$ such that the related performance difference is constrained by it.

Definition A.5 (Sim-to-Real Error): For a given policy $π$ , its sim-to-real error is defined as:

E (π) = ∣ J_{real} (π) - J_{vrt} (π) ∣.

Stage 2: Derivation of Theorems (Virtual-to-Real Error Upper Bound)

Theorems 4.1 (Virtual-to-Real Error Upper Bound) Restated: Under Assumptions A.1, A.2, A.3, for any policy $π$ (trained in the virtual environment, denoted as $π_{vrt}$ ), its sim-to-real error $E (π)$ satisfies:

E (π) \leq Dynamics Gap δ (1 - C_{tool}) + Action Bias Bound α \cdot D_{KL} (π_{vrt} ∣ π_{pst}) + Goal Alignment Error β (1 - Φ_{plan} \cdot R_{result}) .

Proof:

We decompose the total error into three separately bounded components via the triangle inequality and constrain each using the above assumptions.

Step 2.1: Decompose Total Error Consider an intermediate environment $M_{hybrid} = (S, A, P_{vrt},$ $R_{vision}, ρ_{0}, γ)$ , which uses the virtual environment dynamics $P_{vrt}$ but retains the real reward $R_{vision}$ . Denote $J_{hybrid} (π) = J (π; M_{hybrid})$ . Then:

E (π)

= ∣ J_{real} (π) - J_{vrt} (π) ∣

\leq ∣ J_{real} (π) - J_{hybrid} (π) ∣

+ ∣ J_{hybrid} (π) - J_{vrt}^{ideal} (π) ∣

+ ∣ J_{vrt}^{ideal} (π) - J_{vrt} (π) ∣.

Here $J_{vrt}^{ideal} (π)$ represents the ideal return under dynamics $P_{vrt}$ and reward $R_{vrt}$ with the policy perfectly constrained by the PST prior (no deviation). We next upper bound each term.

Step 2.2: Bounding Term I (Dynamics Gap) Term I measures the return difference due to the difference between dynamic models $P_{real}$ and $P_{vrt}$ . According to Assumption A.1 and the Performance Difference Lemma, for any policy $π$ ,

∣ J_{real} (π) - J_{hybrid} (π) ∣ \leq \frac{γ \cdot δ ( 1 - C _{tool} )}{( 1 - γ ) ^{2}} \cdot s, a max ∣ R_{vision} (s, a) ∣.

Let $R_{max} = max_{s, a} ∣ R_{vision} (s, a) ∣$ and define $δ^{'} = \frac{γ R _{max}}{( 1 - γ ) ^{2}} δ$ , we obtain:

Term I \leq δ^{'} (1 - C_{tool}) .

In the theorem statement, constant factors are absorbed into $δ$ , so we have Term I $\leq δ (1 - C_{tool})$ .

Step 2.3: Bounding Term II (Reward Gap) Term II measures the difference between using the real reward $R_{vision}$ and using the proxy reward $Φ_{plan} \cdot R_{result}$ (as the core part of $R_{vrt}$ ) under the same dynamics. According to Assumption A.2, for each step in the trajectory, the reward difference is bounded. Applying the Performance Difference Lemma (reward difference part) again yields:

∣ J_{hybrid} (π) - J_{vrt}^{ideal} (π) ∣ \leq \frac{β ( 1 - Φ _{plan} \cdot R _{result} )}{1 - γ} .

Define $β^{'} = β / (1 - γ)$ , then Term II $\leq β^{'} (1 - Φ_{plan} \cdot R_{result})$ . In the theorem statement, $β^{'}$ is written as $β$ .

Step 2.4: Bounding Term III (Policy Bias) Term III measures the return loss due to the deviation of the virtually trained policy $π_{vrt}$ from the ideal PST prior $π_{pst}$ . According to Assumption A.3, there exists a constant $α > 0$ such that:

∣ J_{vrt}^{ideal} (π) - J_{vrt} (π) ∣ \leq α \cdot D_{KL} (π_{vrt} ∣ π_{pst}) .

This assumption stems from the “anchoring” effect of the PST prior on the policy exploration space, preventing catastrophic policy drift in the absence of real visual feedback.

Step 2.5: Combining Error Upper Bounds Summing the upper bounds of Term I, II, and III, we obtain:

E (π) \leq δ^{'} (1 - C_{tool}) + β^{'} (1 - Φ_{plan} \cdot R_{result}) + α \cdot D_{KL} (π_{vrt} ∣ π_{pst}) .

Relabeling constants $δ^{'} \to δ$ , $β^{'} \to β$ yields the form in Theorems 4.1. $■$

Stage 3: Derivation of Theorems (Real-World Performance Improvement Lower Bound)

Theorems 4.2 (Real-World Improvement of VRL) Restated: Let $π_{pst}$ be the initial policy after PST training, and $π_{VRL}$ be the policy optimized through Virtual Reinforcement Learning (VRL). Define the virtual optimization gain as $Δ_{vrt} = J_{vrt} (π_{VRL}) - J_{vrt} (π_{pst}) = E_{π} [Δ (R_{plan} + R_{fine})]$ . Then, under the error bound of Theorems 4.1, the real-world performance improvement satisfies:

J_{real} (π_{VRL}) - J_{real} (π_{pst}) \geq Causal Improvement Γ \cdot Δ_{vrt} - Transfer Loss E (π_{VRL}),

where $Γ = C_{tool} \cdot Φ_{plan} \cdot κ (π_{pst})$ is the effectiveness coefficient, and $κ (π_{pst}) \in (0, 1]$ denotes the Anchoring Strength of the PST prior in constraining policy exploration.

Proof:

Step 3.1: Establish Inequality Based on Error Bound From Theorems 4.1, for any policy $π$ , we have $J_{real} (π) \geq J_{vrt} (π) - E (π)$ . Applying this inequality to $π_{VRL}$ and $π_{pst}$ respectively:

J_{real} (π_{VRL})

\geq J_{vrt} (π_{VRL}) - E (π_{VRL}),

J_{real} (π_{pst})

\geq J_{vrt} (π_{pst}) - E (π_{pst}) .

Subtracting the second inequality from the first yields:

J_{real} (π_{VRL}) - J_{real} (π_{pst}) \geq [J_{vrt} (π_{VRL}) - J_{vrt} (π_{pst})] - [E (π_{VRL}) - E (π_{pst})] .

Since $π_{pst}$ itself is trained on real expert data, its sim-to-real error $E (π_{pst})$ is expected to be small (aligned during PST). Therefore, the lower bound of performance improvement is mainly affected by the error $E (π_{VRL})$ of $π_{VRL}$ . Conservatively setting the transfer loss term as $E (π_{VRL})$ gives:

J_{real} (π_{VRL}) - J_{real} (π_{pst}) \geq Δ_{vrt} - E (π_{VRL}) . (1)

Step 3.2: Relating Virtual Gain to Real Gain (Causal Improvement) The $Δ_{vrt}$ in inequality (1) is the gain in virtual reward. We need to relate it to real performance improvement. This relies on a core idea: optimizing “planning and execution logic” in the virtual environment, as long as the simulation is sufficiently credible, causally leads to improved real-world visual quality. Define the effectiveness coefficient $Γ$ , which quantifies the expected increment in real reward per unit increment in virtual reward. We model it as the product of three key factors:

$C_{tool}$ : Tool capability determines the probability of logical execution being reproduced in reality.
$Φ_{plan}$ : Plan sufficiency determines the strength of association between correct logic and high-quality output.
$κ (π_{pst})$ : Anchoring strength of the PST prior, indicating the degree to which the policy remains in a “reasonable” distribution region during VRL optimization, with $κ \in (0, 1]$ . Strong anchoring ( $κ \approx 1$ ) ensures the optimization direction remains effective in the real world.

Therefore, we assume a monotonic relationship:

J_{real} (π_{VRL}) - J_{real} (π_{pst}) \geq Γ \cdot Δ_{vrt} - E (π_{VRL}), where Γ = C_{tool} \cdot Φ_{plan} \cdot κ (π_{pst}) . (2)

When $Γ > 0$ , the logical improvement brought by virtual optimization can be partially translated into real-world improvement.

Step 3.3: Derive the Final Lower Bound Substituting $Δ_{vrt} = E_{π} [Δ R_{vrt}]$ into inequality (2) yields the lower bound stated in Theorems 4.2:

J_{real} (π_{VRL}) - J_{real} (π_{pst}) \geq Γ \cdot E_{π} [Δ R_{vrt}] - E (π_{VRL}) .

Step 3.4: Condition for Non-Negative Improvement From the inequality in Theorems 4.2, the sufficient condition for non-negative improvement in real-world performance (i.e., $J_{real} (π_{VRL}) \geq J_{real} (π_{pst})$ ) is directly obtained as:

Γ \cdot E_{π} [Δ R_{vrt}] \geq E (π_{VRL}) .

This means that the Causal Improvement brought by virtual training must be sufficient to cover the Transfer Loss arising from simulation imperfections. This does not require $C_{tool} = 1$ or $Φ_{plan} = 1$ ; as long as their product combined with the anchoring strength is large enough to make $Γ$ sufficiently large, and VRL can effectively increase $Δ R_{vrt}$ (as shown in experiments where virtual reward exceeds 95%), positive transfer is guaranteed. $■$

Summary

Through formal modeling, this derivation decomposes the challenge of sim-to-real transfer into differences at the dynamic, reward, and policy levels, and quantifies their upper bounds using key variables such as tool capability, plan sufficiency, and PST prior. Theorems 4.1 shows that systematic error can be controlled by improving tool fidelity, strengthening PST anchoring, and optimizing plan-result alignment. Theorems 4.2 further proves that as long as virtual training can effectively enhance the agent’s logical capabilities (Causal Improvement) and this improvement outweighs the bounded systematic error (Transfer Loss), performance improvement in the real world is guaranteed. This provides a solid theoretical foundation for the application of virtual reinforcement learning in high-dimensional, long-horizon tasks such as visual creation.

Table 5: Human Evaluation of Detailed Dimensions on VisGenBench-Image (Score = Success Rate $\times$ Human Evaluation Score)

Model	Semantic	Style	Emotion	Subject	Design	Visual	Text	Creativity	Overall
Model	Matching	Matching	Matching	Consistency	Integrity	Integrity	Quality	Creativity	Overall
GPT-5	3.4883	3.6214	2.9656	3.6024	3.4408	3.4218	2.7565	3.4408	3.3458
Gemini2.5-Pro	3.3943	3.5399	2.8119	3.4034	3.2214	3.2669	2.7300	3.3215	3.2123
Qwen3-VL-32B-Tk	3.4435	3.7248	2.9876	3.5890	3.4047	3.4823	2.8130	3.4726	3.3659
Qwen3-VL-32B-SFT	3.3504	3.8016	2.8896	3.7632	3.4368	3.5040	2.8224	3.5232	3.3888
VisionCreator-32B	3.6432	3.8412	3.1581	3.7620	3.4452	3.6531	2.8809	3.5739	3.4947

Table 6: Human Evaluation of Detailed Dimensions on VisGenBench-Video (Score = Success Rate $\times$ Human Evaluation Score)

Model Script Story- Content Subject Video Visual board Consistency Consistency Effect Motion GPT-5 3.1062 2.9202 3.1434 3.1713 3.0597 3.0039 Gemini2.5-Pro 2.9484 2.6796 3.0156 2.856 2.8728 2.6628 Qwen3-VL-32B-Thinking 3.069 2.9016 3.1713 3.162 2.9574 2.9388 Qwen3-VL-32B-SFT 3.6002 2.867 3.5814 3.3652 3.243 2.9328 VisionCreator-32B 3.5616 3.1872 3.5808 3.4176 3.4752 3.2256

Model Audio-Visual Music Dubbing Subtitle Transition Editing Overall GPT-5 3.0411 3.2643 2.8644 2.9788 2.8812 2.8392 2.814 Gemini2.5-Pro 2.8056 2.7888 2.8728 2.8812 2.8392 2.562 2.814 Qwen3-VL-32B-Thinking 2.9481 2.9202 3.1341 3.069 2.9295 2.8737 3.0039 Qwen3-VL-32B-SFT 3.1772 3.0644 3.2148 3.0174 3.0268 2.9328 3.1678 VisionCreator-32B 3.3792 3.2928 3.3888 3.3216 3.2352 3.1104 3.3504

Table 7: General-purpose Datasets.

Category	Name	Quantity
NLP	DeepSeek-R1-Distill-110k ¹⁹	110k
	LONGCOT-Refine-500K ²⁶	500k
	alpaca-gpt4-data ²⁴ ²⁵	100k
Multimodal	M3IT ¹⁵	1592k
Tool Calling	function-calling-chatml ⁸	112k
	xlam-function-calling-60k ³⁹	60k
	ms-agent ²¹	600k
	ToolACE ²⁰	11k
	ToolBench ²⁷	123k
	AFM ¹⁷	76k

Table 8: Task Distribution in VisGenData-4k Dataset

Video Generation Tasks		Image Generation Tasks
No.	Task Type	No.	Task Type
1.	Product marketing videos	1.	Product images
2.	Public service advertisements	2.	Detail pages
3.	Corporate promotion videos	3.	Key Visual (KV)
4.	Brand story videos	4.	Landing pages / H5 graphics
5.	Event promotion videos	5.	Complete brand visual identity
6.	Instructional videos	6.	Banner graphics
7.	Popular science documentaries	7.	Official account cover images
8.	Music videos (MV)	8.	Xiaohongshu covers
9.	Concert recordings	9.	Marketing posters
10.	Variety shows	10.	Avatar design
11.	Story videos	11.	Static emoji generation
12.	Video podcasts	12.	ICON design
13.	Picture books	13.	LOGO design
14.	Dynamic comics	14.	Mini-game UI design
15.	Animated short films	15.	Character design
16.	Animated movies	16.	Character action design
17.	Game adaptation films	17.	Scene design
18.	Game videos	18.	Storyboards
19.	Movies	19.	Picture Book
20.	Short dramas	20.	Stylization
21.	Story explanations	21.	Realistic Photography

Table 9: VisGenEnv integrates 36 visual creation tools.

Tool Category	Tool Function	Tool Name
Text-to-Text	Storyboard Text Polishing (Claude)	tool_prompt_refine
	Storyboard Generation (Claude)	tool_video_shot_gen
	Script Tool (Claude)	tool_video_script_gen
	Storyboard Polishing (Claude)	tool_storyboard_polish
	Script & Storyboard Polishing	tool_script_storyboard_merge
	Text-to-Video (Veo3)	tool_text2video_veo
Text-to-Image	Text-to-Image (nano-banana)	tool_text2image_gemini
	Text-to-Image (hunyuan)	tool_text2image_hunyuan
	Text-to-Image (ByteDance)	tool_text2image_seed
	Text-to-Image (GPT)	tool_text2image_gpt
	Text-to-Image (Qwen)	tool_text2image_qwen
Image-to-Image	Image-to-Image (nano-banana)	tool_image_edit_gemini
	Image-to-Image (Qwen)	tool_image_edit_qwen
	Image-to-Image (GPT)	tool_image_edit_gpt
Image-to-Video	Image-to-Video (Keling)	tool_image2video_keling
Image-to-Video	Image-to-Video (Veo3)	tool_image2video_veo3
Audio Generation	Music Generation (Suno)	tool_music_suno
	Video Sound Effect Generation	tool_sound_fx_gen
	TTS Generation	tool_tts_generation
	Video Composition (MoviePy)	tool_video_composite
	Video Clip - MoviePy Post-processing	tool_video_postprocess
	Video Generation Automation Pipeline	tool_video_auto_pipeline
	Beat Detection Tool	tool_beat_detect
	Video Editing (Trim)	tool_video_trim_edit
	Video Speed Change	tool_video_speed_adjust
	TTS + Composition Tool	tool_tts_composite
	Audio Editing	tool_audio_edit_cut
	Add Subtitles	tool_subtitle_add_text
Multimodal	Video Understanding (Gemini2.5-Pro)	tool_video_analysis
Understanding	Audio Understanding (Gemini2.5-Pro)	tool_audio_analysis
	Image Understanding (Gemini2.5-Pro)	tool_image_analysis
Other	Tavily Search - Content Extraction	tool_search_content
	Inspiration Search	tool_search_inspire
	Summary Tool	tool_content_summary
	To-Do List	tool_task_manager
	HTML Generation Tool	tool_html_builder

Refer to caption

Figure 10: Visualizations for scene consistency.

Refer to caption

Figure 11: Visualizations for object consistency.

Refer to caption

Figure 12: Demo page.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. ↩ ↩²
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022. ↩ ↩²
Black Forest Labs. Flux. https://github.com/black-forest-labs/flux, 2024. ↩ ↩²
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022. ↩ ↩²
Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model. arXiv preprint arXiv:2412.00131, 2024. ↩ ↩²
Weijia Wu, Zeyu Zhu, and Mike Zheng Shou. Automated movie generation via multi-agent cot planning. arXiv preprint arXiv:2503.07314, 2025. ↩ ↩² ↩³
Junfei Xiao, Ceyuan Yang, Lvmin Zhang, Shengqu Cai, Yang Zhao, Yuwei Guo, Gordon Wetzstein, Maneesh Agrawala, Alan Yuille, and Lu Jiang. Captain cinema: Towards short movie generation. arXiv preprint arXiv:2507.18634, 2025. ↩ ↩² ↩³
Xuenan Xu, Jiahao Mei, Chenliang Li, Yuning Wu, Ming Yan, Shaopeng Lai, Ji Zhang, and Mengyue Wu. Mm-storyagent: Immersive narrated storybook video generation with a multi-agent paradigm across text, image and audio. arXiv preprint arXiv:2503.05242, 2025a. ↩ ↩² ↩³
Xiangyuan Xue, Zeyu Lu, Di Huang, Zidong Wang, Wanli Ouyang, and Lei Bai. Comfybench: Benchmarking llm-based agents in comfyui for autonomously designing collaborative ai systems. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 24614–24624, 2025. ↩ ↩² ↩³ ↩⁴ ↩⁵
Litao Guo, Xinli Xu, Luozhou Wang, Jiantao Lin, Jinsong Zhou, Zixin Zhang, Bolan Su, and Ying-Cong Chen. Comfymind: Toward general-purpose generation via tree-based planning and reactive feedback. arXiv preprint arXiv:2505.17908, 2025. ↩ ↩² ↩³ ↩⁴
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683, 2025. ↩
Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Zichun Liao, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Omniflow: Any-to-any generation with multi-modal rectified flows. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 13178–13188, 2025a. ↩
Jinxiang Lai, Jie Zhang, Jun Liu, Jian Li, Xiaocheng Lu, and Song Guo. Spider: Any-to-many multimodal llm. arXiv preprint arXiv:2411.09439, 2024. ↩
Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners. arXiv preprint arXiv:2510.26583, 2025. ↩
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International conference on machine learning, pp. 1691–1703. PMLR, 2020. ↩
Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, and Yonglong Tian. Fluid: Scaling autoregressive text-to-image generative models with continuous tokens. arXiv preprint arXiv:2410.13863, 2024. ↩
Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. arXiv preprint arXiv:2412.04431, 2024. ↩
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. arXiv preprint arXiv:2404.02905, 2024. ↩
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024. ↩
Yatian Pang, Peng Jin, Shuo Yang, Bin Lin, Bin Zhu, Zhenyu Tang, Liuhan Chen, Francis EH Tay, Ser-Nam Lim, Harry Yang, et al. Next patch prediction for autoregressive visual generation. arXiv preprint arXiv:2412.15321, 2024. ↩
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022. ↩
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. ↩
William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205, 2023. ↩
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024. ↩
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024. ↩
Zhenran Xu, Yiyu Wang, Xue Yang, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, and Min Zhang. Comfyui-r1: Exploring reasoning models for workflow generation. arXiv preprint arXiv:2506.09790, 2025b. ↩
Oucheng Huang, Yuhang Ma, Zeng Zhao, Mingrui Wu, Jiayi Ji, Rongsheng Zhang, Zhipeng Hu, Xiaoshuai Sun, and Rongrong Ji. Comfygpt: A self-optimizing multi-agent system for comprehensive comfyui workflow generation. arXiv preprint arXiv:2503.17671, 2025. ↩
Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness. arXiv preprint arXiv:2503.21755, 2025. ↩
comfyanonymous. Comfyui. https://github.com/comfyanonymous/ComfyUI, 2023. GitHub repository. ↩

Blog1

探索

VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation

Abstract

1 Introduction

2.1 Image Generation

2.2 Video Generation

2.3 Visual Generation Agents

3 VisGenData-4k with UTPC Structure

3.1 VisionAgent for Dataset Construction

3.2 Dataset Composition and Statistics

4 Agentic Post-Training

4.1 Agentic Framework

4.2 Progressive Specialization Training

4.3 Virtual Reinforcement Learning

4.3.1 Virtual VisGenEnv Environment

4.3.2 LtrReward Components

4.3.3 Theoretical Foundations of Virtual Reinforcement Learning

Theorem 4.1 (Virtual-to-Real Error Bound).

Theorem 4.2 (Real-World Improvement of VRL).

Corollary 4.3 (Fidelity-Anchored Transfer).

4.3.4 Plan-Driven Reward Design

5 Experiment

5.1 VisGenBench

5.1.1 Test Dataset Composition

5.1.2 Evaluation Framework

5.2 Results on VisGenBench by VLM Evaluation

5.3 Results on VisGenBench by Human Evaluation

5.4 Ablation Studies

6 Conclusion

Detailed Theoretical Derivations of VRL Theorems

Stage 1: Formal Modeling and Definitions

1.1 Formalization of Environment and Policy

1.2 Key Variables and Core Assumptions

Stage 2: Derivation of Theorems (Virtual-to-Real Error Upper Bound)

Stage 3: Derivation of Theorems (Real-World Performance Improvement Lower Bound)

Summary

关系图谱

目录

反向链接

Blog1

探索

VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation

Abstract

1 Introduction

2 Related Works

2.1 Image Generation

2.2 Video Generation

2.3 Visual Generation Agents

3 VisGenData-4k with UTPC Structure

3.1 VisionAgent for Dataset Construction

3.2 Dataset Composition and Statistics

4 Agentic Post-Training

4.1 Agentic Framework

4.2 Progressive Specialization Training

4.3 Virtual Reinforcement Learning

4.3.1 Virtual VisGenEnv Environment

4.3.2 LtrReward Components

4.3.3 Theoretical Foundations of Virtual Reinforcement Learning

Theorem 4.1 (Virtual-to-Real Error Bound).

Theorem 4.2 (Real-World Improvement of VRL).

Corollary 4.3 (Fidelity-Anchored Transfer).

4.3.4 Plan-Driven Reward Design

5 Experiment

5.1 VisGenBench

5.1.1 Test Dataset Composition

5.1.2 Evaluation Framework

5.2 Results on VisGenBench by VLM Evaluation

5.3 Results on VisGenBench by Human Evaluation

5.4 Ablation Studies

6 Conclusion

Detailed Theoretical Derivations of VRL Theorems

Stage 1: Formal Modeling and Definitions

1.1 Formalization of Environment and Policy

1.2 Key Variables and Core Assumptions

Stage 2: Derivation of Theorems (Virtual-to-Real Error Upper Bound)

Stage 3: Derivation of Theorems (Real-World Performance Improvement Lower Bound)

Summary

Footnotes

关系图谱

目录

反向链接