Show-o2: Improved Native Unified Multimodal Models

^$*$

Jinheng Xie ¹ Zhenheng Yang ² Mike Zheng Shou ^1∗

¹ Show Lab, National University of Singapore ² ByteDance

Abstract

This paper presents improved native unified multimodal models, i.e., Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.

1 Introduction

Large language models (LLMs) ¹ ² have achieved unprecedented performance levels, fueled by extensive web-scale text resources, substantial computational power, and billions of parameters. In the multimodal domain, large multimodal models (LMMs) ³ ⁴ ⁵ and visual generative models ⁶ ⁷ ⁸, have also demonstrated exceptional capabilities in tasks such as general-purpose visual question answering and text-to-image/video generation. Given their success, unified multimodal models (UMMs) ⁹ ¹⁰ ¹¹ have been investigated to unify multimodal understanding and generation within a single model or system. In addition to multimodal understanding capability, this line of approaches seeks to simultaneously cultivate multimodal understanding and generation abilities in the model/system through pre-training, fine-tuning, or connecting tailored models.

Here, we provide a comparative analysis of selected UMMs in Table 1, focusing on two perspectives, including i) visual representations for understanding and generation and ii) the type of unified modeling. Generally, there are two approaches to incorporating visual representations for multimodal understanding and generation: i) a unified representation for both understanding and generation, as seen in works like Chameleon ¹⁰, Transfusion ¹², and Show-o ¹¹; and ii) decoupled representations, utilizing CLIP ¹³ for multimodal understanding and variational autoencoder (VAE) for visual generation. To involve both multimodal understanding and generation capabilities, two primary methods have been explored: i) natively applying multimodal understanding and generation objectives within a single model and ii) tuning adapters to assemble tailored models. We refer the first type as native unified multimodal models, distinguishing it from the second type that assembles tailored models. These principles, combined with autoregressive or diffusion modeling or both, contribute to the development of unified multimodal models.

Compared to existing UMMs that primarily focus on text and image, our approach explores model designs that provide substantial potential and scalability in natively unifying text, image, and video modalities. An overview of our approach is presented in Fig. 1. Specifically, for visual inputs, we operate within the 3D causal VAE ¹⁴ space, which is capable of accommodating both images and videos. Recognizing the distinct feature dependencies between multimodal understanding and generation, we construct unified visual representations that simultaneously capture rich semantic information and low-level features with intrinsic structures and textual details from the visual latents. This is achieved through a dual-path mechanism consisting of semantic layers, a projector, and a spatial (-temporal) fusion process. As the fusion process occurs within the 3D causal VAE space, when it comes to videos, semantic and low-level features are temporally aligned and fused with full-frame video information.

Text embeddings and unified visual representations are structured into a sequence to go through a pre-trained language model and are modeled by a specific language head and flow head, respectively. Specifically, autoregressive modeling with causal attention is performed on the language head when dealing with text token prediction, and flow matching with full attention is applied to the flow head for image/video generation. Since the base language model lacks visual generation capabilities, we propose a two-stage training recipe to effectively learn such an ability while retaining the language knowledge, without requiring a massive text corpus. In the first stage, we mainly focus on pre-training the flow head for visual generation using (interleaved) text, image, and video data. In the second stage, the full model is fine-tuned with high-quality multimodal understanding and generation data.

Extensive experimental results have demonstrated that our model surpasses the existing methods in terms of most metrics across multimodal understanding and visual generation benchmarks. Collectively, the main contributions of this paper can be summarized as:

We present an improved native unified multimodal model that seamlessly integrates autoregressive modeling and flow matching, enabling a wide range of multimodal understanding and generation across (interleaved) text, images, and videos.
Based on the 3D causal VAE space, we construct unified visual representations scalable to both multimodal understanding and generation, image and video modalities by combining semantic and low-level features through a dual-path of spatial (-temporal) fusion mechanism.
We design a two-stage training pipeline that effectively and efficiently learns unified multimodal models, retaining language knowledge and enabling effective scaling up to larger models, without requiring a massive text corpus.
The proposed model demonstrates state-of-the-art performance on multimodal understanding and visual generation benchmarks, surpassing existing methods across various metrics.

Table 1: Comparative analysis of selected unified multimodal models based on the type of visual representations and unified modeling for multimodal understanding and generation. In this context, native und. & gen. refers to the direct decoding of output sequences into texts, images, and videos, as opposed to serving as conditions for decoding using external pre-trained decoders like Stable Diffusion. ^∗ indicates the method adopts two distinct models for multimodal understanding and generation, respectively. Diff. means the diffusion modeling. Please refer to the complete table in the appendix.

Methods	Und. & Gen. Representation			Type of Unified Modeling
Methods	Unified	Decoupled	Support Video	Native Und. & Gen.	Assembling Tailored Models	Paradigm
Chameleon ¹⁰²	✓			✓		AR
Transfusion ¹⁴⁷	✓			✓		AR + Diff.
Show-o ¹²⁸	✓			✓		AR + Diff.
VILA-U ¹²³	✓		✓	✓		AR
Emu3 ¹¹⁴	✓		✓	✓		AR
LlamaFusion ⁹⁵	✓			✓		AR + Diff.
Show-o2 (Ours)	✓		✓	✓		AR + Diff.
Janus-Series ²⁶ ⁷⁹ ²⁷		✓		✓		AR (+Diff)
UnidFluid ³⁸		✓		✓		AR + MAR
Mogao ⁶⁵		✓		✓		AR + Diff.
BAGEL ³²		✓	✓	✓		AR + Diff.
NExT-GPT ¹²⁰		✓	✓		✓	AR + Diff.
SEED-X ⁴⁰		✓			✓	AR + Diff.
ILLUME ¹¹¹		✓			✓	AR + Diff.
MetaMorph ¹⁰⁶		✓			✓	AR + Diff.
MetaQueries ⁸³		✓			✓	AR + Diff.
TokenFlow ^∗ ⁸⁹	✓				✓	AR

2.1 Large Multimodal Models

Building upon the advancements of large language models (LLMs) ¹ ², large multimodal models (LMMs) ¹⁵ ⁵ ⁴ ³ have showcased remarkable capabilities in general-purpose visual question answering. These approaches typically leverage pre-trained vision encoders to project visual features and align them within the embedding space of LLMs. Meanwhile, a growing number of encoder-free LMMs ¹¹ ¹⁶ ¹⁷ aim to directly align raw visual features within the LLM embedding space. However, these encoder-free methods often fall behind models that utilize image-text-aligned visual features in terms of performance. Beyond model architecture, recent studies ¹⁸ ¹⁹ ⁴ have highlighted the critical role of high-quality instructional data in enhancing multimodal capabilities.

2.2 Visual Generative Models

Two prominent paradigms for visual generation, namely diffusion ²⁰ ²¹ ²² ²³ ²⁴ ²⁵ ²⁶ ⁷ ⁸ ²⁷ ²⁸ and autoregressive modeling ²⁹ ³⁰ ³¹ ³² ³³, have been extensively studied in image and video generation in recent years. Diffusion-based methods typically employ optimized architectures that integrate pre-trained text encoders with denoising networks. In contrast, autoregressive methods often utilize LLM-based architectures and are trained through next-token prediction. Recently, several studies ³⁴ ³⁵ ³⁶ have explored hybrid approaches that combine diffusion and autoregressive modeling to further advance visual generation capabilities.

2.3 Unified Multimodal Models

Building on the success of large multimodal and visual generative models, pioneering unified multimodal models (UMMs) such as Chameleon ¹⁰, Show-o ¹¹, and Transfusion ¹² aim to integrate these capabilities into a single model through autoregressive or diffusion modeling or both. Further advancements ³⁷ ³⁸ ³⁹ ⁴⁰ ⁴¹ ⁴² have focused on optimizing the training pipeline and enhancing the semantics of discrete tokens, leading to improved performance. We refer to these approaches as native unified multimodal models, as they inherently combine multimodal understanding and generation objectives within a unified architecture.

An alternative and promising direction ⁴³ ⁴⁴ ⁴⁵ ⁴⁶ ⁴⁷ ⁴⁸ ⁴⁹ for unifying multimodal understanding and generation involves assembling off-the-shelf specialized LMMs and visual generative models by tuning adapters or learnable tokens. Representative works ⁹ ⁴⁶ ⁴⁸ ⁴⁹ have demonstrated the promising capabilities and intriguing properties of such assembled unified frameworks, highlighting their potential for further exploration.

3 Methodology

In this section, we introduce the overall framework (Section 3.1), which consists of two key components: i) the design of unified visual representations for multimodal understanding and generation, applicable to both images and videos, and ii) the native learning of multimodal understanding and generation capabilities. Subsequently, we present a two-stage training recipe (Section 3.2), which is designed to progressively learn and effectively scale up the unified multimodal model.

3.1 Overall Framework

Overall Architecture. An overview of our proposed unified model is depicted in Fig. 1. Given (interleaved) texts, images, or videos, a text tokenizer with an embedding layer and a 3D causal VAE encoder accordingly process them into continuous text embeddings and visual latent representations. Subsequently, the visual latent representations undergo a dual-path extraction of spatial (-temporal) fusion to create the unified visual representations. These representations are then structured into a sequence, which is fed into a language model equipped with language and flow heads to model the sequence via autoregressive modeling and flow matching accordingly. Finally, a text de-tokenizer in conjunction with a 3D causal VAE decoder is employed to decode the final output. Next, we will delve into the fundamental design principles behind the unified visual representation and flow head.

Refer to caption

Figure 1: Our approach begins by encoding input texts, images, and videos into continuous embeddings and visual latents. The visual latents are processed through a dual-path extraction and spatial (-temporal) fusion mechanism to construct unified visual representations that are scalable for both multimodal understanding and generation, image and video modalities. These text embeddings and unified visual representations are then structured into a sequence for the base language model, equipped with dedicated heads. Specifically, text tokens are modeled autoregressively by a language head, while image and video latents are handled by a flow head using flow matching. We employ the omni-attention mechanism 128 147 to enable causal attention along the sequence while maintaining full attention within the unified visual representations. This design empowers our model to effectively tackle tasks such as image/video understanding, generation, and mixed-modality generation.

Unified Visual Representation. To scalably support image and video modalities, we employ a 3D causal VAE encoder to extract image/video latents. As multimodal understanding and generation differ in feature dependency, we propose a dual-path architecture comprising semantic layers $S (\cdot)$ to extract high-level representations of rich semantic contextual information and a projector $P (\cdot)$ to retain complete low-level information from the extracted visual latents. Specifically, semantic layers $S (\cdot)$ share the same vision transformer blocks of SigLIP ⁵⁰ with a new $2 \times 2$ patch embedding layer. Given $n$ visual latents $x_{t} = {x_{i}}_{i = 1}^{n}$ at a noise level:

x_{t} = t \cdot x_{1} + (1 - t) \cdot x_{0},

where $x_{0} \sim N (0, 1)$ and $t \sim [0, 1]$ , we load the pre-trained weights of SigLIP and pre-distill $S (\cdot)$ as follows:

L_{distill} = - \frac{1}{n} \sum lo g sim (S (x_{t}), SigLIP (X)),

where X is the input image, $SigLIP (\cdot)$ extracts the image patch features, and $sim (\cdot)$ indicates the cosine similarity calculator. In this way, semantic layers $S (\cdot)$ can mimic extracting semantic features from both clean and noised visual latents $x_{t}$ . The projector $P (\cdot)$ is simply composed of a 2D patch embedding layer. The extracted high- and low-level representations are spatially (and temporally when it comes to videos) fused by concatenating through the feature dimension and applying RMSNorm ⁵¹ with two MLP layers to get the unified visual representations u:

u = STF (S (x_{t}), P (x_{t})),

where STF indicates the spatial (-temporal) fusion mechanism. In addition, we prepend a time step $t$ embedding to the unified visual representations for generative modeling. $t$ is set as 1.0 to get time step embedding for the clean image.

We structure the text embeddings and unified visual representations into a sequence following a general interleaved image-text format below:

[BOS] {Text} [BOI / BOV] {Image / Video} [EOI / EOV] {Text} \dots [EOS] .

The sequence format above is flexible and can be adapted to various input types. We adopt the omni-attention mechanism ¹¹ ¹² to let the sequence modeling be causal but with full attention within the unified visual representations.

Flow Head. Apart from the language head for text token prediction, we employ a flow head to predict the defined velocity $v_{t} = \frac{d x _{t}}{d t}$ via flow matching ²⁶ ⁵². Specifically, the flow head simply consists of several transformer layers with time step modulation via the adaLN-Zero blocks, as seen in DiT ⁵³.

During training, we natively apply next token prediction $L_{NTP}$ to the language head and flow matching $L_{FM}$ to the flow head for predicting velocity, respectively:

L = α L_{NTP} + L_{FM} .

3.2 Training Recipe

Table 2: Trainable components and datasets in the training stages.

	Trainable Components	Datasets
		# Image-Text	# Video-Text	# Interleaved Data
Stage-1	Projector	66M	WebVid ⁸ Pandas ²³	OmniCorpus ⁶⁰
	Spatial (-Temporal) Fusion
	Flow Head
Stage-2	Full Model (w/o VAE)	9M HQ Und.	OpenVid-1M ⁸⁰ Gen.	VIST ⁴⁷
		16M HQ Gen.	1.5M Internal Data Gen.	CoMM ²⁴
			1.6M Video Und.

Existing UMMs, such as Show-o ¹¹, Janus-Pro ⁵⁴, Transfusion ¹², Chameleon ¹⁰, and Emu3 ³⁷, are typically trained from LLMs, LMMs, or from scratch. These approaches aim to cultivate visual generative modeling capabilities while preserving language modeling proficiency. However, this process often relies on web-scale, high-quality text corpora, which are prohibitively expensive to collect. Consequently, the lack of such resources can lead to a degradation in language knowledge and modeling performance. To address this challenge, we adopt a two-stage training recipe (as shown in Table 2) that effectively retains language knowledge while simultaneously developing visual generation capabilities, without requiring a massive text corpus.

Stage-1. Before the two-stage training, we have pre-distilled the semantic layers $S (\cdot)$ (implementation details can be found in Section 4). The first stage only involves trainable components of the projector, spatial (-temporal) fusion, and flow head. In this stage, we train these components using autoregressive modeling and flow matching using around 66M image-text pairs and progressively add interleaved data and video-text pairs.

Stage-2. Subsequently, we tune the full model using 9M high-quality multimodal understanding instruction data, 16M high-quality visual generation data filtered from the 66M image-text pairs, and 1.6M video understanding data.

Scaling Up. After the training of the small-sized model with approximately 1.5B LLM parameters, we resume the pre-trained flow head for the larger model with 7B LLM parameters and introduce a lightweight MLP transformation to align the hidden size, allowing it to quickly adapt to the larger model and converge.

4 Experiments

4.1 Experimental Setup

Datasets. The curated approximately 66M image-text pairs consist of images with a resolution of at least 512 pixels in width and height. The images are filtered from CC12M ⁵⁵, COYO ⁵⁶, LAION-Aesthetic-12M ^∗* and AI synthetic data. The images are recaptioned by LMMs except for the synthetic data. The 9M high-quality multimodal understanding instruction data is curated from Densefusion-1M ⁵⁷, and LLaVA-OneVision ⁴.

Implementation Details. The semantic layers $S (\cdot)$ are pre-distilled from SigLIP-so400m-patch14-384 ^∗* over 200K iterations, using a batch size of 512 and a cosine-scheduled learning rate of 2e-5. During distillation, Eq. 1 is applied to the visual latents with only a probability of 0.3 in the last 20K iterations. The input image resolution of 3D causal VAE encoder with $2 \times 2$ patch embedding layer is set as $432 \times 432$ to get $729 = 27 \times 27$ visual latents, which matches the ones extracted by SigLIP. Once distilled, the semantic layers $S (\cdot)$ are capable of extracting rich semantic features from both clean and noised visual latents. In statistics, the extracted features from clean visual latents by $S (\cdot)$ have converged to an average cosine similarity of around 0.9 with those extracted by the original SigLIP on the curated 66M image-text pairs. We interpolate the position embeddings in the bicubic mode when involving other image/video resolutions.

Our models build upon two LLM variants, i.e., Qwen2.5-1.5B-Instruct ² and Qwen2.5-7B-Instruct ², respectively. We adopt 3D causal VAE proposed in Wan2.1 ¹⁴ with 8 $\times$ and 4 $\times$ spatial and temporal compression, respectively. In stage 1, we first train the 1.5B variant for 150K iterations using AdamW optimizer with a constant learning rate of 0.0001 on the curated 66M image-text pairs in a resolution of $432 \times 432$ . The context length of single image-text pairs is set as 1024. The total batch sizes for multimodal understanding and generation are 128 and 384, respectively. $α$ in Eq. 4 is set as 0.2. For visual generation data, the caption is dropped with a probability of 0.1 to enable the classifier-free guidance. This training process roughly takes one and a half days using 64 H100 GPUs. Subsequently, we replace the generation data with 16M high-quality data (filtered from 66M image-text pairs) and continue to train for 40K iterations. In stage 2, we follow the training strategies in LLaVA-OneVision ⁴ to train the 1.5B model using around 9M multimodal instructional and 16M high-quality generation data for a total of around 35K iterations. $α$ in Eq. 4 is set as 1.0. The stage 2 training process takes around 15 hours. For models with mixed-modality and video generation capabilities, we progressively add video-text and interleaved data in stage 1. For video data, we randomly sample a 2s 480p or 432 $\times$ 432 clips with 17 frames from each video with an interval of 3 frames. The context length at this time is set as 7006. In stage 2, high-quality video-text and interleaved data are added to further improve video and mixed-modality generation capabilities.

To futher improve the image generation and text rendering quality, we further train the small-scale model on images with higher resoluton ( $512 \times 512$ and $1024 \times 1024$ ) and involve an additional text-rich image data, i.e., a subset of TextAtlas ⁵⁸.

Building on the pre-trained image-level Show-o2 models, we enhance their video understanding capabilities by further fine-tuning on 1.6M video samples from ⁵⁹, together with 1.1M image-level samples from the earlier stage. We adopt the same video training and inference settings as LLaVA-OneVision. The evaluation results are shown in Table 4.

Table 3: Evaluation on multimodal understanding benchmarks. # Params. indicates the number of parameters of base LLM. * indicates the method uses two distinct models or sets of parameters for multimodal understanding and generation, respectively. ^† indicates the Show-o2 models fine-tuned using video understanding data. Und. indicates “understanding”. Results in gray indicate the performance of und. only models or models with total parameters more than 13B.

Types	Models	# Params.	MME $↑$	GQA $↑$	SEED $↑$	MMB $↑$	MMMU $↑$	MMStar $↑$	AI2D $↑$
Types	Models	# Params.	(p)		(all)	(en)	(val)
Und. Only	LLaVA-v1.5 ⁷¹	7B	1510.7	62.0	58.6	64.3	-	-	-
	Qwen-VL-Chat ⁶	7B	1487.6	57.5	58.2	60.6	-	-	57.7
	LLaVA-OV ⁵⁶	7B	1580.0	-	-	80.8	48.8	57.5	81.4
Unify via	NExT-GPT ¹²⁸	13B	-	-	57.5	58.0	-	-	-
Assembling	SEED-X ⁴⁰	17B	1457.0	49.1	66.5	70.1	35.6	-	-
Tailored	MetaMorph ¹⁰⁶	8B	-	-	71.8	75.2	-	-	-
Models	TokenFlow-XL ^∗ ⁸⁹	14B	1551.1	62.5	72.6	76.8	43.2	-	75.9
	ILLUME ¹¹¹	7B	1445.3	-	72.9	75.1	38.2	-	71.4
Native Unified	BAGEL ³²	14B	1687.0	-	-	85.0	55.3	-	-
	Show-o ¹²⁸	1.3B	1097.2	58.0	51.5	-	27.4	-	-
	JanusFlow ⁷⁹	1.5B	1333.1	60.3	70.5	74.9	29.3	-	-
	SynerGen-VL ⁵⁸	2.4B	1381.0	-	-	53.7	34.2	-	-
	Janus-Pro ²⁶	1.5B	1444.0	59.3	68.3	75.5	36.3	-	-
	Show-o2 (Ours)	1.5B	1450.9	60.0	65.6	67.4	37.1	43.4	69.0
	Emu3 ¹¹⁴	8B	-	60.3	68.2	58.5	31.6	-	70.0
	VILA-U ¹²³	7B	1401.8	60.8	59.0	-	-	-	-
	MUSE-VL ¹²⁹	7B	-	-	69.1	72.1	39.7	49.6	69.8
	Liquid ¹¹⁸	8B	1448.0	61.1	-	-	-	-	-
	Janus-Pro ²⁶	7B	1567.1	62.0	72.1	79.2	41.0	-	-
	Mogao ⁶⁵	7B	1592.0	60.9	74.6	75.0	44.2	-	-
	Show-o2 (Ours)	7B	1620.5	63.1	69.8	79.3	48.9	56.6	78.6

In the training of our model based on the 7B LLM variant, we resume the flow head pre-trained based on the 1.5B model and additionally train the newly initialized spatial (-temporal) fusion, projector, and MLP transformations for 3K iterations with 2K warm-up steps to align the hidden size and then further train spatial (-temporal) fusion, the projector, MLP transformations, and the flow head together. Following that, we conduct the training stages 1 and 2 in the same manner as those of the 1.5B model. The whole training process of our 7B model takes approximately 2 and a half days using 128 H100 GPUs. We do not include interleaved and video data in the training stages of the larger model due to the huge computational cost and training duration.

4.2 Multimodal Understanding on Images and Videos

Quantitative Results. Table 3 highlights the performance of our models on multimodal understanding benchmarks, evaluated across metrics such as MME ⁶⁰, GQA ⁶¹, SEED-Bench ⁶², MM-Bench ⁶³, MMU ⁶⁴, MMStar ⁶⁵, and AI2D ⁶⁶. As shown in the table, both the 1.5B and 7B variants of our model consistently outperform state-of-the-art models across many metrics. For models with similar parameter sizes (1.5B), our model achieves the best scores on MME-p and MMU-val benchmarks while delivering competitive performance on GQA and SEED-Bench metrics. When compared to larger models with approximately 7B parameters, our models surpass state-of-the-art models such as Janus-Pro and even the significantly larger TokenFlow-XL model (14B parameters) in metrics including MME-p, GQA, MMMU-val, MMStar, and AI2D, while maintaining competitive performance on SEED-Bench and MM-Bench. These results underscore the robust perception capabilities of our unified visual representations, demonstrating their effectiveness in multimodal understanding tasks and the promising potential in this domain. In addition, we present the video understanding performance of Show-o2 ^† in Table 4.

Qualitative Results. Fig. 2 showcases the multimodal understanding capabilities of our model. As demonstrated, the model excels at answering general-purpose questions about an image. Specifically, it can provide detailed descriptions of an image, count objects, and recognize text within the image. Besides, the model can leverage its world knowledge to offer step-by-step instructions for preparing daily drinks like an avocado milkshake and supports bilingual question-answering, highlighting its versatility and practical utility. Further, our model supports multimodal understanding in both English and Chinese, enabling bilingual capabilities.

Table 4: Evaluation on video understanding benchmarks. # Params. denotes the number of parameters in the base LLM, while # Frames represents the maximum number of video frames used during training and inference. Und. stands for understanding. ^† marks the Show-o2 models that have been fine-tuned on video understanding data. All results are reported in terms of zero-shot accuracy.

Model	# Params.	# Frames	ActNet-QA	MVBench	NExT-QA	PerceptionTest	LongVideoBench	VideoMME
Model	# Params.	# Frames	test	test	mc	val	val	wo/w-subs
Proprietary Und. Only Models
GPT-4V ⁸¹	-	-	57.0	43.5	-	-	61.3	59.9/63.3
GPT-4o ⁸²	-	-	-	-	-	-	66.7	71.9/77.2
Gemini-1.5-Flash ¹⁰³	-	-	55.3	-	-	-	61.6	70.3/75.0
Gemini-1.5-Pro ¹⁰³	-	-	57.5	-	-	-	64.0	75.0/81.3
Open-source Und. Only Models
VILA ⁶⁹	40B	-	58.0	-	67.9	54.0	-	60.1/61.1
PLLaVA ¹³¹	34B	16 / 16	60.9	58.1	-	-	53.2	-
LongVA ¹⁴³	7B	-	50.0	-	68.3	-	-	52.6/54.3
IXC-2.5 ¹⁴²	7B	64 / 64	52.8	69.1	71.0	34.4	-	55.8/58.8
LLaVA-OV ⁵⁶	7B	32 / 32	56.6	56.7	79.4	57.1	56.5	58.2/61.5
VideoLLaMA2 ³⁰	7B	16 / 16	50.2	54.6	-	51.4	-	47.9/50.3
Unified Multimodal Models
Show-o2 ^† (Ours)	1.5B	32 / 32	52.7	49.8	72.1	56.1	49.2	48.0/51.6
Show-o2 ^† (Ours)	7B	16 / 32	56.4	55.8	79.0	61.9	55.5	57.4/60.9

Table 5: Evaluation on the GenEval ⁶⁷ benchmark. Gen. denotes “generation”. # Params. indicates the number of parameters of base LLM. # Data. indicates the number of image-text pairs used for visual generation during training. * means the method uses two distinct models for multimodal understanding and generation, respectively. Obj.: Object. Attri.: Attribute. Our results are obtained using rewritten prompts. + indicates the additional data required by the pretrained diffusion models.

Type	Method	# Params.	# Data	Single Obj.	Two Obj.	Counting	Colors	Position	Color Attri.	Overall $↑$
Gen. Only	SD3-Medium ³⁷	-	-	0.99	0.94	0.72	0.89	0.33	0.60	0.74
Unifying via	SEED-X ⁴⁰	17B	158M+	0.97	0.58	0.26	0.80	0.19	0.14	0.49
Assembling	TokenFlow-XL ^∗ ⁸⁹	14B	60M	0.95	0.60	0.41	0.81	0.16	0.24	0.55
Tailored	ILLUME ¹¹¹	7B	15M+	0.99	0.86	0.45	0.71	0.39	0.28	0.61
Models	MetaQuery-XL ⁸³	7B	28M+	-	-	-	-	-	-	0.80
Native Unified	Show-o ¹²⁸	1.3B	2.0B	0.98	0.80	0.66	0.84	0.31	0.50	0.68
	Emu3 ¹¹⁴	8B	-	-	-	-	-	-	-	0.66
	MUSE-VL ¹²⁹	7B	24M							0.57
	Transfusion ¹⁴⁷	7B	3.5B	-	-	-	-	-	-	0.63
	D-DiT ⁶³	2B	40M	0.97	0.80	0.54	0.76	0.32	0.50	0.65
	Janus-Pro ²⁶	7B	144M	0.99	0.89	0.59	0.90	0.79	0.66	0.80
	BAGEL ³²	14B	1600M	0.98	0.95	0.84	0.95	0.78	0.77	0.88
	Mogao ⁶⁵	7B	-	1.00	0.97	0.83	0.93	0.84	0.80	0.89
	Show-o2 (Ours)	1.5B	66M	0.99	0.86	0.55	0.86	0.46	0.63	0.73
	Show-o2 (Ours)	7B	66M	1.00	0.87	0.58	0.92	0.52	0.62	0.76

Table 6: Evaluation on the DPG-Bench ⁶⁸ benchmark. Gen. denotes “generation”. # Params. indicates the number of parameters of base LLM. # Data. indicates the number of image-text pairs used for visual generation during training.

Type	Method	# Params.	# Data	Global	Entity	Attribute	Relation	Other	Overall $↑$
Gen. Only	Hunyuan-DiT ⁶⁴	1.5B	-	84.59	80.59	88.01	74.36	86.41	78.87
	Playground v2.5 ⁵⁷	-	-	83.06	82.59	81.20	84.08	83.50	75.47
	PixArt- $Σ$ ¹⁷	-	-	86.89	82.89	88.94	86.59	87.68	80.54
	DALL-E 3 ¹⁰	-	-	90.97	89.61	88.39	90.58	89.83	83.50
	SD3-Medium ³⁷	2B	-	87.90	91.01	88.83	80.70	88.68	84.08
Native Unified	Emu3-DPO ¹¹⁴	8B	-	-	-		-	-	81.60
	Janus-Pro ²⁶	7B	144M	86.90	88.90	89.40	89.32	89.48	84.19
	Mogao ⁶⁵	7B	-	82.37	90.03	88.26	93.18	85.40	84.33
	Show-o2 (Ours)	1.5B	66M	87.53	90.38	91.34	90.30	91.21	85.02
	Show-o2 (Ours)	7B	66M	89.00	91.78	89.96	91.81	91.64	86.14

Table 7: Overall quantitative comparison of different methods on OneIG-Bench. Gen. denotes “generation”. # Params. indicates the number of parameters of base LLM. # Data. indicates the number of image-text pairs used for visual generation during training.

Type	Method	# Params.	# Data	Alignment $↑$	Text $↑$	Reasoning $↑$	Style $↑$	Diversity $↑$
Gen. Only	SD3.5-Large ³⁷	8B	-	0.809	0.629	0.294	0.353	0.225
	Flux.1-dev ⁵⁴	12B	-	0.786	0.523	0.253	0.368	0.238
	SANA-1.5 (PAG) ¹²⁶	4.8B	-	0.765	0.069	0.217	0.401	0.216
	Lumina-Image 2.0 ⁸⁸	2.6B	110M	0.819	0.106	0.270	0.354	0.216
	HiDream-I1-Full ⁴⁴	17B	-	0.829	0.707	0.317	0.347	0.186
Unified Models	Show-o-512 ¹²⁸	1.3B	2B	0.702	0.002	0.213	0.361	0.241
	Janus-Pro ²⁷	7B	144M	0.553	0.001	0.139	0.276	0.365
	BLIP3-o ¹⁸	8B	55M	0.711	0.013	0.223	0.361	0.229
	BAGEL ³²	14B	1600M	0.769	0.244	0.173	0.367	0.251
	OmniGen2 ¹¹⁷	7B	150M	0.804	0.680	0.271	0.377	0.242
	Show-o2 (Ours)	1.5B	66M	0.798	0.002	0.219	0.317	0.186
	Show-o2-1024 $\times$ 1024 (Ours)	1.5B	66M	0.798	0.125	0.274	0.351	0.186
	Show-o2 (Ours)	7B	66M	0.817	0.002	0.226	0.317	0.177

4.3 Visual Generation

Image Generation. We compare our model with the state-of-the-art approaches on GenEval ⁶⁷, DPG-Bench ⁶⁸, and OneIG ⁶⁹ benchmarks in Tables 5, 6, and 7. One can observe that our model surpasses most approaches, including TokenFlow-XL, Show-o, Emu3, and Transfusion, on the GenEval benchmark. Compared to Janus-Pro, which was trained on a significantly larger dataset of 144M image-text pairs, our model achieves promising results with only 66M image-text pairs. On DPG-Bench evaluation, our model has demonstrated the best overall score compared to generation-only models such as SD3-Medium and unified models, including Emu3-DPO and Janus-Pro. On OneIG-Bench, our models also achieve competitive performance. We also show qualitative results in Fig. 2 to illustrate that our model can generate high-quality and realistic images.

Video Generation. We compare our model with the text-to-video and image-to-video generation models in Tables 8 and 9. One can observe that with only 2B parameters, our model outperforms models such as Show-1, Emu3, and VILA-U with more than 6B parameters. Besides, our model has demonstrated competitive performance compared to CogVideoX and Step-Video-T2V. We also provide qualitative results of the text-to-video and image-to-video generation capability of our model

Refer to caption

Figure 2: Multimodal understanding and generation examples.

in the middle of Fig. 2. One can observe that, given text prompts or an input image, our model can generate consistent video frames with reasonable motions, such as the smiling girl, lapping waves, and floating clouds.

4.4 Mixed-Modality Generation

We demonstrate mixed-modality generation capabilities of our model using downstream task visual storytelling dataset ⁷⁰ in Fig. 2. During fine-tuning, given an interleaved image-text sequence, we apply noise to all images in the sequence with a probability of 0.3. Otherwise, we randomly retain a number of the earlier images in the sequence and only apply noise to the later ones. Benefiting from the general interleaved sequence format mentioned in 3.1, our model can predict the [BOI] once it begins to generate an image. Upon detecting the [BOI] token, noises will be appended to the sequence to gradually generate an image. The generated text tokens and images will be served as context to continue generating the following output. Fig. 2 includes two examples demonstrating our model’s ability to interleavely generate coherent text and images, vividly narrating a story.

Table 8: Comparison with text-to-video models on the VBench ⁷¹ benchmark. # Params. indicates the number of total parameters for video generation including the base LLM and flow head. QS: Quality Score, SS: Semantic Score, SC: Subject Consistency, BC: Background Consistency, TF: Temporal Flickering, MS: Motion Smoothness, DD: Dynamic Degree, AQ: Aesthetic Quality, IQ: Imaging Quality, OC: Object Class, MO: Multiple Objects, HA: Human Action, C: Color, SR: Spatial Relationship, S: Scene, AS: Appearance style, TS: Temporal Style, OC’: Overall Consistency.

Models	# Params.	Total	QS	SS	SC	BC	TF	MS	DD	AQ	IQ	OC	MO	HA	C	SR	S	AS	TS	OC’
ModelScope ⁷²	1.7B	75.75	78.05	66.54	89.87	95.29	98.28	95.79	66.39	52.06	58.57	82.25	38.98	92.40	81.72	33.68	39.26	23.39	25.37	25.67
LaVie ⁷³	3B	77.08	78.78	70.31	91.41	97.47	98.30	96.38	49.72	54.94	61.90	91.82	33.32	96.80	86.39	34.09	52.69	23.56	25.93	26.41
OpenSoraPlan V1.3 ⁷⁴	-	77.23	80.14	65.62	97.79	97.24	99.20	99.05	30.28	60.42	56.21	85.56	43.58	86.80	79.30	51.61	36.73	20.03	22.47	24.47
Show-1 ²⁷	6B	78.93	80.42	72.98	95.53	98.02	99.12	98.24	44.44	57.35	58.66	93.07	45.47	95.60	86.35	53.50	47.03	23.06	25.28	27.46
AnimateDiff-V2 ⁷⁵	-	80.27	82.90	69.75	95.30	97.68	98.75	97.76	40.83	67.16	70.10	90.90	36.88	92.60	87.47	34.60	50.19	22.42	26.03	27.04
Gen-2 ⁷⁶	-	80.58	82.47	73.03	97.61	97.61	99.56	99.58	18.89	66.96	67.42	90.92	55.47	89.20	89.49	66.91	48.91	19.34	24.12	26.17
Pika-1.0 ⁷⁷	-	80.69	82.92	71.77	96.94	97.36	99.74	99.50	47.50	62.04	61.87	88.72	43.08	86.20	90.57	61.03	49.83	22.26	24.22	25.94
VideoCrafter-2.0 ⁷⁸	-	80.44	82.20	73.42	96.85	98.22	98.41	97.73	42.50	63.13	67.22	92.55	40.66	95.00	92.92	35.86	55.29	25.13	25.84	28.23
CogVideoX ⁷⁹	5B	81.61	82.75	77.04	96.23	96.52	98.66	96.92	70.97	61.98	62.90	85.23	62.11	99.40	82.81	66.35	53.20	24.91	25.38	27.59
Kling ⁸⁰	-	81.85	83.39	75.68	98.33	97.60	99.30	99.40	46.94	61.21	65.62	87.24	68.05	93.40	89.90	73.03	50.86	19.62	24.17	26.42
Step-Video-T2V ⁸¹	30B	81.83	84.46	71.28	98.05	97.67	99.40	99.08	53.06	61.23	70.63	80.56	50.55	94.00	88.25	71.47	24.38	23.17	26.01	27.12
Gen-3 ⁸²	-	82.32	84.11	75.17	97.10	96.62	98.61	99.23	60.14	63.34	66.82	87.81	53.64	96.40	80.90	65.09	54.57	24.31	24.71	26.69
Emu3 ³⁷	8B	80.96	-	-	95.32	97.69	-	98.93	79.27	59.64	-	86.17	44.64	77.71	-	68.73	37.11	20.92	-	-
VILA-U ³⁸	7B	74.01	76.26	65.04	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
HaploOmni ⁸³	9B	78.10	-	-	96.40	97.60	-	96.80	65.30	-	-	-	-	-	-	-	34.60	-	-	-
Show-o2 (Ours)	2B	81.34	82.10	78.31	97.28	96.78	97.68	98.25	40.83	65.15	67.06	94.81	76.01	95.20	80.89	62.61	57.67	23.29	25.27	27.00

Table 9: Comparison with image-to-video models on the VBench ⁷¹ benchmark.

Models	I2VSubject	I2VBackground	CameraMotion	SubjectConsistency	BackgroundConsistency	TemporalFlickering	MotionSmoothness	DynamicDegree	AestheticQuality	ImagingQuality
DynamiCrafter-1024 ⁸⁴	96.71	96.05	35.44	95.69	97.38	97.63	97.38	47.40	66.46	69.34
SEINE-512x320 ⁸⁵	94.85	94.02	23.36	94.20	97.26	96.72	96.68	34.31	58.42	70.97
I2VGen-XL ⁸⁶	96.74	95.44	13.32	96.36	97.93	98.48	98.31	24.96	65.33	69.85
Animate-Anything ⁸⁷	98.54	96.88	12.56	98.90	98.19	98.14	98.61	2.68	67.12	72.09
ConsistI2V ⁸⁸	94.69	94.57	33.60	95.27	98.28	97.56	97.38	18.62	59.00	66.92
VideoCrafter-I2V ⁸⁹	90.97	90.51	33.58	97.86	98.79	98.19	98.00	22.60	60.78	71.68
SVD-XT-1.1 ⁹⁰	97.51	97.62	-	95.42	96.77	99.17	98.12	43.17	60.23	70.23
MarDini ³⁵	98.78	96.46	-	-	-	-	-	-	-	-
Show-o2 (Ours)	96.94	98.83	28.41	93.83	97.45	-	97.76	25.85	61.92	69.87

4.5 Ablation Studies

Table 10: Impact of spatial (-temporal) fusion.

	MME $- p$ $↑$	GQA $↑$	POPE $↑$	FID-5K $↓$
w/o Fusion	1164.7	56.2	82.6	21.8
w Fusion	1187.8	57.6	82.6	20.5

We show the pilot study results in Table 10, which validated the effect of spatial (-temporal) fusion on multimodal understanding and generation performance. For efficiency, we adopt LLaMA-3.2-1B as the base language model and use only around 1M multimodal understanding data and the ImageNet-1K generation data ⁹¹. Under the same training settings, there are improvements in terms of both multimodal understanding and generation metrics, including MME-p, GQA, and FID-5K. This validates that the involved semantic and low-level features in the fusion mechanism would potentially help both the multimodal generation and understanding capabilities to some extent.

Table 11: Effect of CFG guidance and inference steps.

CFG guidance	Inference steps	GenEval	DPG-Bench
2.5	50	0.65	81.6
5.0	50	0.71	83.9
7.5	50	0.71	84.8
10	50	0.71	85.0
7.5	25	0.71	84.6
7.5	100	0.73	84.7

We perform ablation studies to examine the effect of classifier-free guidance (CFG) and inference steps on the performance using the 1.5B model. As shown in Table 11, increasing the CFG guidance scale and inference steps (in a range) would potentially improve the GenEval and DPG-Bench scores. However, the improvements of the GenEval score are not significant when the CFG guidance is set as larger than 5.0.

Table 12: Effect of training stages.

Stage-1	Stage-2	GenEval	DPG-Bench
✓		0.63	83.28
✓	✓	0.73	84.70

Table 12 provides the effect of training stages on the generation performance on the GenEval and DPG-Bench benchmarks. One can observe that stage-2 training consistently and significantly improves both metrics, which validates the importance of the second stage.

Table 13: Impact of training recipe on text-only performance. One-stage training denotes full-parameter-co-training on image-text pairs and the text-only RefinedWeb ⁹² data. Note that the curated multimodal understanding data consists of text-only instructional data. We perform the evaluation under the same setting using the lm-evaluation-harness tool.

Models	# Params.	Training Recipe	MMLU	GPQA	GSM8K	HumanEval
Qwen2.5 Instruct ²	1.5B	-	60.20 $\pm$ 0.39	28.12 $\pm$ 2.13	51.86 $\pm$ 1.38	35.37 $\pm$ 3.74
Show-o2 (Ours)	1.5B	One-stage training with RefinedWeb	28.25 $\pm$ 0.38	25.00 $\pm$ 2.05	4.55 $\pm$ 0.57	3.05 $\pm$ 1.35
Show-o2 (Ours)	1.5B	Our two-stage training	56.75 $\pm$ 1.37	29.24 $\pm$ 2.15	49.43 $\pm$ 1.38	35.54 $\pm$ 3.70
Qwen2.5 Instruct ²	7B	-	71.75 $\pm$ 0.36	32.37 $\pm$ 2.21	82.49 $\pm$ 1.05	65.24 $\pm$ 3.73
Show-o2 (Ours)	7B	One-stage training with RefinedWeb	28.43 $\pm$ 0.21	26.34 $\pm$ 2.08	1.52 $\pm$ 0.34	4.01 $\pm$ 1.25
Show-o2 (Ours)	7B	Our two-stage training	70.73 $\pm$ 0.36	31.47 $\pm$ 2.22	75.28 $\pm$ 1.19	70.73 $\pm$ 3.56

Table 13 shows that our models effectively preserve language knowledge and achieve performance comparable to the original Qwen2.5-1.5B and Qwen2.5-7B Instruct models. In contrast, direct one-stage full-parameter-co-training with textual data such as RefinedWeb results in substantial performance degradation, highlighting the necessity of the two-stage training approach when high-quality corpora are unavailable.

Table 14: Impact of image token count on chart, text, and document VQA.

Models	# Params.	# Image tokens	ChartQA	DocVQA $_{val}$	InfoVQA $_{val}$	TextVQA $_{val}$
LLaVA-OV	7B	729	56.24	62.71	39.59	66.19
Show-o2	7B	729	48.00	59.34	42.31	62.92
Show-o2	7B	5 $\times$ 729	66.92	77.26	45.80	71.54

As shown in Table 14, our ablation study reveals that increasing the number of image tokens significantly boosts performance across all tasks, even though the model was trained with a fixed image resolution. Using the AnyRes strategy at inference time consistently improves results, highlighting the benefit of higher token counts for capturing fine-grained details. When compared to the baseline LLaVA-OV-7B, our model achieves comparable results on DocVQA, InfoVQA, and TextVQA validation sets, but underperforms on ChartQA. We attribute this gap to the limited chart-related data available during semantic layer distillation, which constrains the model’s ability to capture chart-specific information. We believe that incorporating more OCR and document-centric data into the distillation process will further strengthen the unified model’s OCR and document understanding capabilities.

5 Limitations and Broader Impacts

We found that our model is not good at rendering text on the image. We investigated our generation datasets and observed that the proportion of images with rendered texts is relatively small, which potentially leads to bad text rendering. In addition, the generated images will lack details of the small objects because of the limited image resolution. To address this limitation, as outlined in the implementation details, we have enhanced the model by training it on higher resolution data and incorporating image datasets rich in textual information.

Our models possess the ability to generate text and images, which may carry the risk of unintended misuse, such as creating fake information or profiles. Additionally, our large-scale dataset includes content featuring celebrities and copyrighted materials, which could potentially result in intellectual property infringement.

6 Conclusion

This paper proposed native unified multimodal models, i.e., Show-o2, scalable for multimodal understanding and generation, image and video modalities, by integrating 3D causal VAE, autoregressive modeling, and flow matching. A dual-path of spatial (-temporal) fusion mechanism guided the construction of unified visual representations with both high- and low-level features. A two-stage training recipe enables effective learning of unified capabilties, resulting in a versatile model capable of handling diverse tasks, including multimodal understanding and image/video generation. Extensive experiments demonstrate the model’s state-of-the-art performance across various benchmarks.

Acknowledgments and Disclosure of Funding

We thank Haozhe Liu for his valuable input and discussions throughout this project. We are also grateful to Meng Wei and Weihao Wang for their assistance in preparing and organizing the datasets for image and video generation.

Appendix A Technical Appendices and Supplementary Material

Table 15: Comparative analysis of selected unified multimodal models based on the utilization of visual representations and type of unified modeling for multimodal understanding and generation. In this context, native und. & gen. refers to the direct decoding of output sequences into texts, images, and videos, as opposed to serving as conditions for decoding using external pre-trained decoders like Stable Diffusion. ^∗ indicates the method uses two distinct models for multimodal understanding and generation, respectively.

Methods	Und. & Gen. Representation			Type of Unified Modeling
Methods	Unified	Decoupled	Support Video	Native Und. & Gen.	Assembling Tailored Models	Paradigm
Chameleon ¹⁰²	✓			✓		AR
Show-o ¹²⁸	✓			✓		AR + Diff.
Transfusion ¹⁴⁷	✓			✓		AR + Diff.
VILA-U ¹²³	✓		✓	✓		AR
Emu3 ¹¹⁴	✓		✓	✓		AR
MonoFormer ¹⁴⁶	✓			✓		AR + Diff.
Dual-Diffusion ⁶³	✓			✓		Diff.
SynerGen-VL ⁵⁸	✓			✓		AR
MMAR ¹³⁴	✓			✓		AR + MAR
MUSE-VL ¹²⁹	✓			✓		AR
Orthus ⁵³	✓			✓		AR + Diff.
Liquid ¹¹⁸	✓			✓		AR
LlamaFusion ⁹⁵	✓			✓		AR + Diff.
UGen ⁹⁹	✓			✓		AR
UniDisc ⁹⁸	✓			✓		Diff.
UniToken ⁵⁰	✓			✓		AR
Harmon ¹²²	✓			✓		AR+MAR
DualToken ⁹⁶	✓			✓		AR
UniTok ⁷⁷	✓			✓		AR
Selftok ¹¹⁰	✓			✓		AR
Muddit ⁹⁴	✓			✓		Diff.
MMaDA ¹³⁵	✓			✓		Diff.
HaploOmni ¹²⁴	✓		✓	✓		AR + Diff.
TokLIP ⁶⁸	✓			✓		AR
Show-o2 (Ours)	✓		✓	✓		AR + Diff.
Janus-Series ²⁶ ⁷⁹ ²⁷		✓		✓		AR (+Diff.)
VARGPT ¹⁴⁸		✓		✓		AR
UnidFluid ³⁸		✓		✓		AR + MAR
OmniMamba ¹⁴⁹		✓		✓		AR
Mogao ⁶⁵		✓		✓		AR + Diff.
BAGEL ³²		✓	✓	✓		AR + Diff.
Fudoki ¹¹²		✓		✓		Diff.
UniGen ¹⁰⁴		✓		✓		AR + Diff.
NExT-GPT ¹²⁰		✓	✓		✓	AR + Diff.
CoDI ¹⁰¹		✓	✓		✓	AR + Diff.
DreamLLM ³⁶		✓			✓	AR + Diff.
SEED-X ⁴⁰		✓			✓	AR + Diff.
MIO ¹¹⁶		✓	✓		✓	AR + Diff.
CoDI-2 ¹⁰⁰		✓	✓		✓	AR + Diff.
MetaMorph ¹⁰⁶		✓			✓	AR + Diff.
ILLUME ¹¹¹		✓			✓	AR + Diff.
ILLUME+ ⁴⁶		✓			✓	AR + Diff.
MetaQueries ⁸³		✓			✓	AR + Diff.
Nexus-Gen ¹⁴¹		✓			✓	AR + Diff.
Ming-Lite-Uni ⁴²		✓			✓	AR + Diff.
BLIP3-o ¹⁸		✓			✓	AR + Diff.
OpenUni ¹²¹		✓			✓	AR + Diff.
UniWorld ⁶⁷		✓			✓	AR + Diff.
Ming-Omni ⁵		✓	✓		✓	AR + Diff.
Pisces ¹³²		✓			✓	AR + Diff.
TokenFlow ^∗ ⁸⁹	✓				✓	AR
SemHiTok ^∗ ²⁸	✓				✓	AR

A.1 More Qualitative Results

Refer to caption

Figure 3: Text-to-video and image-to-video generation examples.

A.2 Text Prompts

We provide the text prompts for image generation used in Fig. 2 below:

“Hyper-detailed image of a mature man with short, graying hair and deep blue eyes. He has a rugged, weathered face with a strong jawline and a slight beard. His expression is thoughtful and introspective. The lighting is dramatic, highlighting the contours of his face. The photo is in 8K resolution, capturing every wrinkle and pore. ”

“A soft, natural portrait photograph captures a young woman with fair skin and long, ash-blonde hair cascading gently over her shoulders, her striking light blue eyes subtly enhanced with natural makeup and a gentle, calm smile playing on her lips. She wears a cozy, cream-colored winter sweater and a delicate woolen scarf adorned with subtle snowflake patterns, positioned slightly off-center, creating a sense of relaxed elegance. Behind her, a softly blurred snowy Moscow street scene unfolds, with traditional architecture and the diffused, golden glow of a winter afternoon contributing to a serene and contemplative atmosphere. At the very bottom of the frame, in simple, elegant lettering, appears the phrase “BE KIND”. ”

“A vibrant, highly detailed close-up of a colorful parrot perched on a branch, featuring intricate feather textures, vivid colors (red, blue, green, yellow), and a tropical rainforest background. The parrot’s eyes are sharp and expressive, with a natural glint of light. The image is photorealistic, ultra HD (8K resolution), with soft natural lighting and a shallow depth of field, creating a blurred bokeh effect in the background. The scene is peaceful and lush, showcasing the beauty of nature. ”

“A dark, moody room with a glowing neon sign on the wall that spells out ’SHOW O2’ in bold, vibrant pink and blue colors. The neon light reflects softly on the polished concrete floor, creating a futuristic and artistic vibe. ”

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023. ↩ ↩²
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. ↩ ↩²
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. ↩ ↩² ↩³ ↩⁴ ↩⁵
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024. ↩ ↩²
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024. ↩
Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer, 2025. ↩ ↩²
Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Xinyue Li, Dongyang Liu, Xiangyang Zhu, Will Beddow, Erwann Millon, Wenhai Wang Victor Perez, Yu Qiao, Bo Zhang, Xiaohong Liu, Hongsheng Li, Chang Xu, and Peng Gao. Lumina-image 2.0: A unified and efficient image generative framework, 2025. ↩ ↩²
Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023. ↩ ↩²
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024. ↩ ↩² ↩³ ↩⁴
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. In ICLR, 2025. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Chunting Zhou, LILI YU, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. In ICLR, 2025. ↩ ↩² ↩³ ↩⁴
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021. ↩
Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Tianxing Wang, Tianyi Gui, Tingyu Weng, Tong Shen, Wei Lin, Wei Wang, Wei Wang, Wenmeng Zhou, Wente Wang, Wenting Shen, Wenyuan Yu, Xianzhong Shi, Xiaoming Huang, Xin Xu, Yan Kou, Yangyu Lv, Yifei Li, Yijing Liu, Yiming Wang, Yingya Zhang, Yitong Huang, Yong Li, You Wu, Yu Liu, Yulin Pan, Yun Zheng, Yuntao Hong, Yupeng Shi, Yutong Feng, Zeyinzi Jiang, Zhen Han, Zhi-Fan Wu, and Ziyu Liu. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025. ↩ ↩²
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. NeurIPS, 36, 2024. ↩
Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, and Xinlong Wang. Unveiling encoder-free vision-language models. arXiv preprint arXiv:2406.11832, 2024. ↩
Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, and Xinlong Wang. Evev2: Improved baselines for encoder-free vision-language models. arXiv preprint arXiv:2502.06788, 2025. ↩
Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. ↩
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024. ↩
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022. ↩
William Peebles and Saining Xie. Scalable diffusion models with transformers. In ICCV, pages 4195–4205, 2023. ↩
Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. In CVPR, 2023. ↩
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Zhongdao Wang, James T. Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- $α$ : Fast training of diffusion transformer for photorealistic text-to-image synthesis. In ICLR. OpenReview.net, 2024. ↩
Jinheng Xie, Yuexiang Li, Yawen Huang, Haozhe Liu, Wentian Zhang, Yefeng Zheng, and Mike Zheng Shou. Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In ICCV, pages 7452–7461, 2023. ↩
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, 2023. ↩
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023. ↩ ↩²
David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023. ↩ ↩²
Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, et al. Seaweed-7b: Cost-effective training of video generation foundation model. arXiv preprint arXiv:2504.08685, 2025. ↩
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024. ↩
Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023. ↩
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In ICML, pages 1691–1703, 2020. ↩
Ziqi Pang, Tianyuan Zhang, Fujun Luan, Yunze Man, Hao Tan, Kai Zhang, William T. Freeman, and Yu-Xiong Wang. Randar: Decoder-only autoregressive visual generation in random orders. arXiv preprint arXiv:2412.01827, 2024. ↩
Haopeng Li, Jinyue Yang, Guoqi Li, and Huan Wang. Autoregressive image generation with randomized parallel decoding. arXiv preprint arXiv:2503.10568, 2025. ↩
Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. arXiv preprint arXiv:2406.11838, 2024. ↩
Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yanping Xie, Xiao Han, Juan C. Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, Jui-Chieh Wu, Sen He, Tao Xiang, Jürgen Schmidhuber, and Juan-Manuel Pérez-Rúa. Mardini: Masked autoregressive diffusion for video generation at scale. arXiv preprint arXiv:2410.20280, 2024. ↩ ↩²
Lijie Fan, Luming Tang, Siyang Qin, Tianhong Li, Xuan Yang, Siyuan Qiao, Andreas Steiner, Chen Sun, Yuanzhen Li, Tao Zhu, et al. Unified autoregressive visual generation and understanding with continuous tokens. arXiv preprint arXiv:2503.13436, 2025. ↩
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024. ↩ ↩² ↩³
Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429, 2024. ↩ ↩²
Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding. arXiv preprint arXiv:2502.20321, 2025. ↩
Yang Jiao, Haibo Qiu, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, and Yu-Gang Jiang. Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding. arXiv preprint arXiv:2504.04423, 2025. ↩
Zisheng Chen, Chunwei Wang, Xiuwei Chen, Hang Xu, Jianhua Han, and Xiaodan Liang. Semhitok: A unified image tokenizer via semantic-guided hierarchical codebook for multimodal understanding and generation. arXiv preprint arXiv:2503.06764, 2025. ↩
Wei Song, Yuran Wang, Zijia Song, Yadong Li, Haoze Sun, Weipeng Chen, Zenan Zhou, Jianhua Xu, Jiaqi Wang, and Kaicheng Yu. Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. arXiv preprint arXiv:2503.14324, 2025. ↩
Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, and Mohit Bansal. Any-to-any generation via composable diffusion. NeurIPS, 36, 2024. ↩
Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv preprint arXiv:2312.17172, 2023. ↩
Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, and Li Yi. DreamLLM: Synergistic multimodal comprehension and creation. In ICLR, 2024. ↩
Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024. ↩ ↩²
Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal understanding and generation via instruction tuning. arXiv preprint arXiv:2412.14164, 2024. ↩
Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, Ji Hou, and Saining Xie. Transfer between modalities with metaqueries. arXiv preprint arXiv:2504.06256, 2025. ↩ ↩²
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025. ↩ ↩²
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training, 2023. ↩
Biao Zhang and Rico Sennrich. Root mean square layer normalization. In NeurIPS, 2019. ↩
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022. ↩
William Peebles and Saining Xie. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022. ↩
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025. ↩
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, pages 3558–3568, 2021. ↩
Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022. ↩
Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, and Ling-Yu Duan. Densefusion-1m: Merging vision experts for comprehensive multimodal perception. 2407.08303, 2024. ↩
Alex Jinpeng Wang, Dongxing Mao, Jiawei Zhang, Weiming Han, Zhuobai Dong, Linjie Li, Yiqi Lin, Zhengyuan Yang, Libo Qin, Fuwei Zhang, et al. Textatlas5m: A large-scale dataset for dense text image generation. arXiv preprint arXiv:2502.07870, 2025. ↩
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713, 2024. ↩
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. MME: A comprehensive evaluation benchmark for multimodal large language models. CoRR, abs/2306.13394, 2023. ↩
Drew A. Hudson and Christopher D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709. Computer Vision Foundation / IEEE, 2019. ↩
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023. ↩
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024. ↩
Xiang Yue, Yuansheng Ni, Tianyu Zheng, Kai Zhang, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. In CVPR, pages 9556–9567. IEEE, 2024. ↩
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024. ↩
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–251. Springer, 2016. ↩
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. In NeurIPS, 2023. ↩ ↩²
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024. ↩ ↩²
Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation. arXiv preprint arxiv:2506.07977, 2025. ↩
Ting-Hao K. Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aishwarya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. Visual storytelling. In 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2016), 2016. ↩
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. ↩ ↩²
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023. ↩
Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models. IJCV, 2024. ↩
Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model. arXiv preprint arXiv:2412.00131, 2024. ↩
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. International Conference on Learning Representations, 2024. ↩
Gen-2. Accessed September 25, 2023 [Online] https://research.runwayml.com/gen2, 2023. ↩
Pika 1.0. Accessed December 28, 2023 [Online] https://www.pika.art/, 2023. ↩
Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024. ↩
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024. ↩
Kling. Accessed June 6, 2024 [Online] https://klingai.kuaishou.com/, 2024. ↩
Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang Yu, Dapeng Shi, Dingyuan Hu, Enle Liu, Gang Yu, Ge Yang, Guanzhe Huang, Gulin Yan, Haiyang Feng, Hao Nie, Haonan Jia, Hanpeng Hu, Hanqi Chen, Haolong Yan, Heng Wang, Hongcheng Guo, Huilin Xiong, Huixin Xiong, Jiahao Gong, Jianchang Wu, Jiaoren Wu, Jie Wu, Jie Yang, Jiashuai Liu, Jiashuo Li, Jingyang Zhang, Junjing Guo, Junzhe Lin, Kaixiang Li, Lei Liu, Lei Xia, Liang Zhao, Liguo Tan, Liwen Huang, Liying Shi, Ming Li, Mingliang Li, Muhua Cheng, Na Wang, Qiaohui Chen, Qinglin He, Qiuyan Liang, Quan Sun, Ran Sun, Rui Wang, Shaoliang Pang, Shiliang Yang, Sitong Liu, Siqi Liu, Shuli Gao, Tiancheng Cao, Tianyu Wang, Weipeng Ming, Wenqing He, Xu Zhao, Xuelin Zhang, Xianfang Zeng, Xiaojia Liu, Xuan Yang, Yaqi Dai, Yanbo Yu, Yang Li, Yineng Deng, Yingming Wang, Yilei Wang, Yuanwei Lu, Yu Chen, Yu Luo, Yuchu Luo, Yuhe Yin, Yuheng Feng, Yuxiang Yang, Zecheng Tang, Zekai Zhang, Zidong Yang, Binxing Jiao, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu, Heung-Yeung Shum, and Daxin Jiang. Step-video-t2v technical report: The practice, challenges, and future of video foundation model, 2025. ↩
Gen-3. Accessed June 17, 2024 [Online] https://runwayml.com/research/introducing-gen-3-alpha, 2024. ↩
Yicheng Xiao, Lin Song, Rui Yang, Cheng Cheng, Zunnan Xu, Zhaoyang Zhang, Yixiao Ge, Xiu Li, and Ying Shan. Haploomni: Unified single transformer for multimodal video understanding and generation. arXiv preprint arXiv:2506.02975, 2025. ↩
Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Xintao Wang, Tien-Tsin Wong, and Ying Shan. Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023. ↩
Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:2310.20700, 2023. ↩
Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023. ↩
Zuozhuo Dai, Zhenghao Zhang, Yao Yao, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. Fine-grained open domain image animation with motion guidance. arXiv preprint arXiv:2311.12886, 2023. ↩
Weiming Ren, Harry Yang, Ge Zhang, Cong Wei, Xinrun Du, Stephen Huang, and Wenhu Chen. Consisti2v: Enhancing visual consistency for image-to-video generation. arXiv preprint arXiv:2402.04324, 2024. ↩
Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023. ↩
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. ↩
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009. ↩
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Hamza Alobeidli, Alessandro Cappelli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon LLM: outperforming curated corpora with web data only. In NeurIPS, 2023. ↩

Blog1

探索

Show-o2: Improved Native Unified Multimodal Models

Abstract