Jinheng Xie 1 Zhenheng Yang 2 Mike Zheng Shou 1∗

1 Show Lab, National University of Singapore  2 ByteDance

Abstract

This paper presents improved native unified multimodal models, i.e., Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.

1 Introduction

Large language models (LLMs) 1 2 have achieved unprecedented performance levels, fueled by extensive web-scale text resources, substantial computational power, and billions of parameters. In the multimodal domain, large multimodal models (LMMs) 3 4 5 and visual generative models 6 7 8, have also demonstrated exceptional capabilities in tasks such as general-purpose visual question answering and text-to-image/video generation. Given their success, unified multimodal models (UMMs) 9 10 11 have been investigated to unify multimodal understanding and generation within a single model or system. In addition to multimodal understanding capability, this line of approaches seeks to simultaneously cultivate multimodal understanding and generation abilities in the model/system through pre-training, fine-tuning, or connecting tailored models.

Here, we provide a comparative analysis of selected UMMs in Table 1, focusing on two perspectives, including i) visual representations for understanding and generation and ii) the type of unified modeling. Generally, there are two approaches to incorporating visual representations for multimodal understanding and generation: i) a unified representation for both understanding and generation, as seen in works like Chameleon 10, Transfusion 12, and Show-o 11; and ii) decoupled representations, utilizing CLIP 13 for multimodal understanding and variational autoencoder (VAE) for visual generation. To involve both multimodal understanding and generation capabilities, two primary methods have been explored: i) natively applying multimodal understanding and generation objectives within a single model and ii) tuning adapters to assemble tailored models. We refer the first type as native unified multimodal models, distinguishing it from the second type that assembles tailored models. These principles, combined with autoregressive or diffusion modeling or both, contribute to the development of unified multimodal models.

Compared to existing UMMs that primarily focus on text and image, our approach explores model designs that provide substantial potential and scalability in natively unifying text, image, and video modalities. An overview of our approach is presented in Fig. 1. Specifically, for visual inputs, we operate within the 3D causal VAE 14 space, which is capable of accommodating both images and videos. Recognizing the distinct feature dependencies between multimodal understanding and generation, we construct unified visual representations that simultaneously capture rich semantic information and low-level features with intrinsic structures and textual details from the visual latents. This is achieved through a dual-path mechanism consisting of semantic layers, a projector, and a spatial (-temporal) fusion process. As the fusion process occurs within the 3D causal VAE space, when it comes to videos, semantic and low-level features are temporally aligned and fused with full-frame video information.

Text embeddings and unified visual representations are structured into a sequence to go through a pre-trained language model and are modeled by a specific language head and flow head, respectively. Specifically, autoregressive modeling with causal attention is performed on the language head when dealing with text token prediction, and flow matching with full attention is applied to the flow head for image/video generation. Since the base language model lacks visual generation capabilities, we propose a two-stage training recipe to effectively learn such an ability while retaining the language knowledge, without requiring a massive text corpus. In the first stage, we mainly focus on pre-training the flow head for visual generation using (interleaved) text, image, and video data. In the second stage, the full model is fine-tuned with high-quality multimodal understanding and generation data.

Extensive experimental results have demonstrated that our model surpasses the existing methods in terms of most metrics across multimodal understanding and visual generation benchmarks. Collectively, the main contributions of this paper can be summarized as:

  • We present an improved native unified multimodal model that seamlessly integrates autoregressive modeling and flow matching, enabling a wide range of multimodal understanding and generation across (interleaved) text, images, and videos.
  • Based on the 3D causal VAE space, we construct unified visual representations scalable to both multimodal understanding and generation, image and video modalities by combining semantic and low-level features through a dual-path of spatial (-temporal) fusion mechanism.
  • We design a two-stage training pipeline that effectively and efficiently learns unified multimodal models, retaining language knowledge and enabling effective scaling up to larger models, without requiring a massive text corpus.
  • The proposed model demonstrates state-of-the-art performance on multimodal understanding and visual generation benchmarks, surpassing existing methods across various metrics.

Table 1: Comparative analysis of selected unified multimodal models based on the type of visual representations and unified modeling for multimodal understanding and generation. In this context, native und. & gen. refers to the direct decoding of output sequences into texts, images, and videos, as opposed to serving as conditions for decoding using external pre-trained decoders like Stable Diffusion. indicates the method adopts two distinct models for multimodal understanding and generation, respectively. Diff. means the diffusion modeling. Please refer to the complete table in the appendix.

MethodsUnd. & Gen. RepresentationType of Unified Modeling
UnifiedDecoupledSupport VideoNative Und. & Gen.Assembling Tailored ModelsParadigm
Chameleon 102AR
Transfusion 147AR + Diff.
Show-o 128AR + Diff.
VILA-U 123AR
Emu3 114AR
LlamaFusion 95AR + Diff.
Show-o2 (Ours)AR + Diff.
Janus-Series 26 79 27AR (+Diff)
UnidFluid 38AR + MAR
Mogao 65AR + Diff.
BAGEL 32AR + Diff.
NExT-GPT 120AR + Diff.
SEED-X 40AR + Diff.
ILLUME 111AR + Diff.
MetaMorph 106AR + Diff.
MetaQueries 83AR + Diff.
TokenFlow 89AR

2.1 Large Multimodal Models

Building upon the advancements of large language models (LLMs) 1 2, large multimodal models (LMMs) 15 5 4 3 have showcased remarkable capabilities in general-purpose visual question answering. These approaches typically leverage pre-trained vision encoders to project visual features and align them within the embedding space of LLMs. Meanwhile, a growing number of encoder-free LMMs 11 16 17 aim to directly align raw visual features within the LLM embedding space. However, these encoder-free methods often fall behind models that utilize image-text-aligned visual features in terms of performance. Beyond model architecture, recent studies 18 19 4 have highlighted the critical role of high-quality instructional data in enhancing multimodal capabilities.

2.2 Visual Generative Models

Two prominent paradigms for visual generation, namely diffusion 20 21 22 23 24 25 26 7 8 27 28 and autoregressive modeling 29 30 31 32 33, have been extensively studied in image and video generation in recent years. Diffusion-based methods typically employ optimized architectures that integrate pre-trained text encoders with denoising networks. In contrast, autoregressive methods often utilize LLM-based architectures and are trained through next-token prediction. Recently, several studies 34 35 36 have explored hybrid approaches that combine diffusion and autoregressive modeling to further advance visual generation capabilities.

2.3 Unified Multimodal Models

Building on the success of large multimodal and visual generative models, pioneering unified multimodal models (UMMs) such as Chameleon 10, Show-o 11, and Transfusion 12 aim to integrate these capabilities into a single model through autoregressive or diffusion modeling or both. Further advancements 37 38 39 40 41 42 have focused on optimizing the training pipeline and enhancing the semantics of discrete tokens, leading to improved performance. We refer to these approaches as native unified multimodal models, as they inherently combine multimodal understanding and generation objectives within a unified architecture.

An alternative and promising direction 43 44 45 46 47 48 49 for unifying multimodal understanding and generation involves assembling off-the-shelf specialized LMMs and visual generative models by tuning adapters or learnable tokens. Representative works 9 46 48 49 have demonstrated the promising capabilities and intriguing properties of such assembled unified frameworks, highlighting their potential for further exploration.

3 Methodology

In this section, we introduce the overall framework (Section 3.1), which consists of two key components: i) the design of unified visual representations for multimodal understanding and generation, applicable to both images and videos, and ii) the native learning of multimodal understanding and generation capabilities. Subsequently, we present a two-stage training recipe (Section 3.2), which is designed to progressively learn and effectively scale up the unified multimodal model.

3.1 Overall Framework

Overall Architecture. An overview of our proposed unified model is depicted in Fig. 1. Given (interleaved) texts, images, or videos, a text tokenizer with an embedding layer and a 3D causal VAE encoder accordingly process them into continuous text embeddings and visual latent representations. Subsequently, the visual latent representations undergo a dual-path extraction of spatial (-temporal) fusion to create the unified visual representations. These representations are then structured into a sequence, which is fed into a language model equipped with language and flow heads to model the sequence via autoregressive modeling and flow matching accordingly. Finally, a text de-tokenizer in conjunction with a 3D causal VAE decoder is employed to decode the final output. Next, we will delve into the fundamental design principles behind the unified visual representation and flow head.

Refer to caption

Figure 1: Our approach begins by encoding input texts, images, and videos into continuous embeddings and visual latents. The visual latents are processed through a dual-path extraction and spatial (-temporal) fusion mechanism to construct unified visual representations that are scalable for both multimodal understanding and generation, image and video modalities. These text embeddings and unified visual representations are then structured into a sequence for the base language model, equipped with dedicated heads. Specifically, text tokens are modeled autoregressively by a language head, while image and video latents are handled by a flow head using flow matching. We employ the omni-attention mechanism 128 147 to enable causal attention along the sequence while maintaining full attention within the unified visual representations. This design empowers our model to effectively tackle tasks such as image/video understanding, generation, and mixed-modality generation.

Unified Visual Representation. To scalably support image and video modalities, we employ a 3D causal VAE encoder to extract image/video latents. As multimodal understanding and generation differ in feature dependency, we propose a dual-path architecture comprising semantic layers to extract high-level representations of rich semantic contextual information and a projector to retain complete low-level information from the extracted visual latents. Specifically, semantic layers share the same vision transformer blocks of SigLIP 50 with a new patch embedding layer. Given visual latents at a noise level:

where and , we load the pre-trained weights of SigLIP and pre-distill as follows:

where X is the input image, extracts the image patch features, and indicates the cosine similarity calculator. In this way, semantic layers can mimic extracting semantic features from both clean and noised visual latents . The projector is simply composed of a 2D patch embedding layer. The extracted high- and low-level representations are spatially (and temporally when it comes to videos) fused by concatenating through the feature dimension and applying RMSNorm 51 with two MLP layers to get the unified visual representations u:

where STF indicates the spatial (-temporal) fusion mechanism. In addition, we prepend a time step embedding to the unified visual representations for generative modeling. is set as 1.0 to get time step embedding for the clean image.

We structure the text embeddings and unified visual representations into a sequence following a general interleaved image-text format below:

The sequence format above is flexible and can be adapted to various input types. We adopt the omni-attention mechanism 11 12 to let the sequence modeling be causal but with full attention within the unified visual representations.

Flow Head. Apart from the language head for text token prediction, we employ a flow head to predict the defined velocity via flow matching 26 52. Specifically, the flow head simply consists of several transformer layers with time step modulation via the adaLN-Zero blocks, as seen in DiT 53.

During training, we natively apply next token prediction to the language head and flow matching to the flow head for predicting velocity, respectively:

3.2 Training Recipe

Table 2: Trainable components and datasets in the training stages.

Trainable ComponentsDatasets
# Image-Text# Video-Text# Interleaved Data
Stage-1Projector66MWebVid 8 Pandas 23OmniCorpus 60
Spatial (-Temporal) Fusion
Flow Head
Stage-2Full Model (w/o VAE)9M HQ Und.OpenVid-1M 80 Gen.VIST 47
16M HQ Gen.1.5M Internal Data Gen.CoMM 24
1.6M Video Und.

Existing UMMs, such as Show-o 11, Janus-Pro 54, Transfusion 12, Chameleon 10, and Emu3 37, are typically trained from LLMs, LMMs, or from scratch. These approaches aim to cultivate visual generative modeling capabilities while preserving language modeling proficiency. However, this process often relies on web-scale, high-quality text corpora, which are prohibitively expensive to collect. Consequently, the lack of such resources can lead to a degradation in language knowledge and modeling performance. To address this challenge, we adopt a two-stage training recipe (as shown in Table 2) that effectively retains language knowledge while simultaneously developing visual generation capabilities, without requiring a massive text corpus.

Stage-1. Before the two-stage training, we have pre-distilled the semantic layers (implementation details can be found in Section 4). The first stage only involves trainable components of the projector, spatial (-temporal) fusion, and flow head. In this stage, we train these components using autoregressive modeling and flow matching using around 66M image-text pairs and progressively add interleaved data and video-text pairs.

Stage-2. Subsequently, we tune the full model using 9M high-quality multimodal understanding instruction data, 16M high-quality visual generation data filtered from the 66M image-text pairs, and 1.6M video understanding data.

Scaling Up. After the training of the small-sized model with approximately 1.5B LLM parameters, we resume the pre-trained flow head for the larger model with 7B LLM parameters and introduce a lightweight MLP transformation to align the hidden size, allowing it to quickly adapt to the larger model and converge.

4 Experiments

4.1 Experimental Setup

Datasets. The curated approximately 66M image-text pairs consist of images with a resolution of at least 512 pixels in width and height. The images are filtered from CC12M 55, COYO 56, LAION-Aesthetic-12M ∗* and AI synthetic data. The images are recaptioned by LMMs except for the synthetic data. The 9M high-quality multimodal understanding instruction data is curated from Densefusion-1M 57, and LLaVA-OneVision 4.

Implementation Details. The semantic layers are pre-distilled from SigLIP-so400m-patch14-384 ∗* over 200K iterations, using a batch size of 512 and a cosine-scheduled learning rate of 2e-5. During distillation, Eq. 1 is applied to the visual latents with only a probability of 0.3 in the last 20K iterations. The input image resolution of 3D causal VAE encoder with patch embedding layer is set as to get visual latents, which matches the ones extracted by SigLIP. Once distilled, the semantic layers are capable of extracting rich semantic features from both clean and noised visual latents. In statistics, the extracted features from clean visual latents by have converged to an average cosine similarity of around 0.9 with those extracted by the original SigLIP on the curated 66M image-text pairs. We interpolate the position embeddings in the bicubic mode when involving other image/video resolutions.

Our models build upon two LLM variants, i.e., Qwen2.5-1.5B-Instruct 2 and Qwen2.5-7B-Instruct 2, respectively. We adopt 3D causal VAE proposed in Wan2.1 14 with 8 and 4 spatial and temporal compression, respectively. In stage 1, we first train the 1.5B variant for 150K iterations using AdamW optimizer with a constant learning rate of 0.0001 on the curated 66M image-text pairs in a resolution of . The context length of single image-text pairs is set as 1024. The total batch sizes for multimodal understanding and generation are 128 and 384, respectively. in Eq. 4 is set as 0.2. For visual generation data, the caption is dropped with a probability of 0.1 to enable the classifier-free guidance. This training process roughly takes one and a half days using 64 H100 GPUs. Subsequently, we replace the generation data with 16M high-quality data (filtered from 66M image-text pairs) and continue to train for 40K iterations. In stage 2, we follow the training strategies in LLaVA-OneVision 4 to train the 1.5B model using around 9M multimodal instructional and 16M high-quality generation data for a total of around 35K iterations. in Eq. 4 is set as 1.0. The stage 2 training process takes around 15 hours. For models with mixed-modality and video generation capabilities, we progressively add video-text and interleaved data in stage 1. For video data, we randomly sample a 2s 480p or 432 432 clips with 17 frames from each video with an interval of 3 frames. The context length at this time is set as 7006. In stage 2, high-quality video-text and interleaved data are added to further improve video and mixed-modality generation capabilities.

To futher improve the image generation and text rendering quality, we further train the small-scale model on images with higher resoluton ( and ) and involve an additional text-rich image data, i.e., a subset of TextAtlas 58.

Building on the pre-trained image-level Show-o2 models, we enhance their video understanding capabilities by further fine-tuning on 1.6M video samples from 59, together with 1.1M image-level samples from the earlier stage. We adopt the same video training and inference settings as LLaVA-OneVision. The evaluation results are shown in Table 4.

Table 3: Evaluation on multimodal understanding benchmarks. # Params. indicates the number of parameters of base LLM. * indicates the method uses two distinct models or sets of parameters for multimodal understanding and generation, respectively. indicates the Show-o2 models fine-tuned using video understanding data. Und. indicates “understanding”. Results in gray indicate the performance of und. only models or models with total parameters more than 13B.

TypesModels# Params.MME \uparrowGQA \uparrowSEED \uparrowMMB \uparrowMMMU \uparrowMMStar \uparrowAI2D \uparrow
(p)(all)(en)(val)
Und. OnlyLLaVA-v1.5 717B1510.762.058.664.3---
Qwen-VL-Chat 67B1487.657.558.260.6--57.7
LLaVA-OV 567B1580.0--80.848.857.581.4
Unify viaNExT-GPT 12813B--57.558.0---
AssemblingSEED-X 4017B1457.049.166.570.135.6--
TailoredMetaMorph 1068B--71.875.2---
ModelsTokenFlow-XL 8914B1551.162.572.676.843.2-75.9
ILLUME 1117B1445.3-72.975.138.2-71.4
Native UnifiedBAGEL 3214B1687.0--85.055.3--
Show-o 1281.3B1097.258.051.5-27.4--
JanusFlow 791.5B1333.160.370.574.929.3--
SynerGen-VL 582.4B1381.0--53.734.2--
Janus-Pro 261.5B1444.059.368.375.536.3--
Show-o2 (Ours)1.5B1450.960.065.667.437.143.469.0
Emu3 1148B-60.368.258.531.6-70.0
VILA-U 1237B1401.860.859.0----
MUSE-VL 1297B--69.172.139.749.669.8
Liquid 1188B1448.061.1-----
Janus-Pro 267B1567.162.072.179.241.0--
Mogao 657B1592.060.974.675.044.2--
Show-o2 (Ours)7B1620.563.169.879.348.956.678.6

In the training of our model based on the 7B LLM variant, we resume the flow head pre-trained based on the 1.5B model and additionally train the newly initialized spatial (-temporal) fusion, projector, and MLP transformations for 3K iterations with 2K warm-up steps to align the hidden size and then further train spatial (-temporal) fusion, the projector, MLP transformations, and the flow head together. Following that, we conduct the training stages 1 and 2 in the same manner as those of the 1.5B model. The whole training process of our 7B model takes approximately 2 and a half days using 128 H100 GPUs. We do not include interleaved and video data in the training stages of the larger model due to the huge computational cost and training duration.

4.2 Multimodal Understanding on Images and Videos

Quantitative Results. Table 3 highlights the performance of our models on multimodal understanding benchmarks, evaluated across metrics such as MME 60, GQA 61, SEED-Bench 62, MM-Bench 63, MMU 64, MMStar 65, and AI2D 66. As shown in the table, both the 1.5B and 7B variants of our model consistently outperform state-of-the-art models across many metrics. For models with similar parameter sizes (1.5B), our model achieves the best scores on MME-p and MMU-val benchmarks while delivering competitive performance on GQA and SEED-Bench metrics. When compared to larger models with approximately 7B parameters, our models surpass state-of-the-art models such as Janus-Pro and even the significantly larger TokenFlow-XL model (14B parameters) in metrics including MME-p, GQA, MMMU-val, MMStar, and AI2D, while maintaining competitive performance on SEED-Bench and MM-Bench. These results underscore the robust perception capabilities of our unified visual representations, demonstrating their effectiveness in multimodal understanding tasks and the promising potential in this domain. In addition, we present the video understanding performance of Show-o2 in Table 4.

Qualitative Results. Fig. 2 showcases the multimodal understanding capabilities of our model. As demonstrated, the model excels at answering general-purpose questions about an image. Specifically, it can provide detailed descriptions of an image, count objects, and recognize text within the image. Besides, the model can leverage its world knowledge to offer step-by-step instructions for preparing daily drinks like an avocado milkshake and supports bilingual question-answering, highlighting its versatility and practical utility. Further, our model supports multimodal understanding in both English and Chinese, enabling bilingual capabilities.

Table 4: Evaluation on video understanding benchmarks. # Params. denotes the number of parameters in the base LLM, while # Frames represents the maximum number of video frames used during training and inference. Und. stands for understanding. marks the Show-o2 models that have been fine-tuned on video understanding data. All results are reported in terms of zero-shot accuracy.

Model# Params.# Frames

ActNet-QA

MVBench

NExT-QA

PerceptionTest

LongVideoBench

VideoMME

testtestmcvalvalwo/w-subs
Proprietary Und. Only Models
GPT-4V 81--57.043.5--61.359.9/63.3
GPT-4o 82------66.771.9/77.2
Gemini-1.5-Flash 103--55.3---61.670.3/75.0
Gemini-1.5-Pro 103--57.5---64.075.0/81.3
Open-source Und. Only Models
VILA 6940B-58.0-67.954.0-60.1/61.1
PLLaVA 13134B16 / 1660.958.1--53.2-
LongVA 1437B-50.0-68.3--52.6/54.3
IXC-2.5 1427B64 / 6452.869.171.034.4-55.8/58.8
LLaVA-OV 567B32 / 3256.656.779.457.156.558.2/61.5
VideoLLaMA2 307B16 / 1650.254.6-51.4-47.9/50.3
Unified Multimodal Models
Show-o2 (Ours)1.5B32 / 3252.749.872.156.149.248.0/51.6
Show-o2 (Ours)7B16 / 3256.455.879.061.955.557.4/60.9

Table 5: Evaluation on the GenEval 67 benchmark. Gen. denotes “generation”. # Params. indicates the number of parameters of base LLM. # Data. indicates the number of image-text pairs used for visual generation during training. * means the method uses two distinct models for multimodal understanding and generation, respectively. Obj.: Object. Attri.: Attribute. Our results are obtained using rewritten prompts. + indicates the additional data required by the pretrained diffusion models.

TypeMethod# Params.# DataSingle Obj.Two Obj.CountingColorsPositionColor Attri.Overall \uparrow
Gen. OnlySD3-Medium 37--0.990.940.720.890.330.600.74
Unifying viaSEED-X 4017B158M+0.970.580.260.800.190.140.49
AssemblingTokenFlow-XL 8914B60M0.950.600.410.810.160.240.55
TailoredILLUME 1117B15M+0.990.860.450.710.390.280.61
ModelsMetaQuery-XL 837B28M+------0.80
Native UnifiedShow-o 1281.3B2.0B0.980.800.660.840.310.500.68
Emu3 1148B-------0.66
MUSE-VL 1297B24M0.57
Transfusion 1477B3.5B------0.63
D-DiT 632B40M0.970.800.540.760.320.500.65
Janus-Pro 267B144M0.990.890.590.900.790.660.80
BAGEL 3214B1600M0.980.950.840.950.780.770.88
Mogao 657B-1.000.970.830.930.840.800.89
Show-o2 (Ours)1.5B66M0.990.860.550.860.460.630.73
Show-o2 (Ours)7B66M1.000.870.580.920.520.620.76

Table 6: Evaluation on the DPG-Bench 68 benchmark. Gen. denotes “generation”. # Params. indicates the number of parameters of base LLM. # Data. indicates the number of image-text pairs used for visual generation during training.

TypeMethod# Params.# DataGlobalEntityAttributeRelationOtherOverall \uparrow
Gen. OnlyHunyuan-DiT 641.5B-84.5980.5988.0174.3686.4178.87
Playground v2.5 57--83.0682.5981.2084.0883.5075.47
PixArt- Σ \Sigma 17--86.8982.8988.9486.5987.6880.54
DALL-E 3 10--90.9789.6188.3990.5889.8383.50
SD3-Medium 372B-87.9091.0188.8380.7088.6884.08
Native UnifiedEmu3-DPO 1148B-----81.60
Janus-Pro 267B144M86.9088.9089.4089.3289.4884.19
Mogao 657B-82.3790.0388.2693.1885.4084.33
Show-o2 (Ours)1.5B66M87.5390.3891.3490.3091.2185.02
Show-o2 (Ours)7B66M89.0091.7889.9691.8191.6486.14

Table 7: Overall quantitative comparison of different methods on OneIG-Bench. Gen. denotes “generation”. # Params. indicates the number of parameters of base LLM. # Data. indicates the number of image-text pairs used for visual generation during training.

TypeMethod# Params.# DataAlignment \uparrowText \uparrowReasoning \uparrowStyle \uparrowDiversity \uparrow
Gen. OnlySD3.5-Large 378B-0.8090.6290.2940.3530.225
Flux.1-dev 5412B-0.7860.5230.2530.3680.238
SANA-1.5 (PAG) 1264.8B-0.7650.0690.2170.4010.216
Lumina-Image 2.0 882.6B110M0.8190.1060.2700.3540.216
HiDream-I1-Full 4417B-0.8290.7070.3170.3470.186
Unified ModelsShow-o-512 1281.3B2B0.7020.0020.2130.3610.241
Janus-Pro 277B144M0.5530.0010.1390.2760.365
BLIP3-o 188B55M0.7110.0130.2230.3610.229
BAGEL 3214B1600M0.7690.2440.1730.3670.251
OmniGen2 1177B150M0.8040.6800.2710.3770.242
Show-o2 (Ours)1.5B66M0.7980.0020.2190.3170.186
Show-o2-1024 × \times 1024 (Ours)1.5B66M0.7980.1250.2740.3510.186
Show-o2 (Ours)7B66M0.8170.0020.2260.3170.177

4.3 Visual Generation

Image Generation. We compare our model with the state-of-the-art approaches on GenEval 67, DPG-Bench 68, and OneIG 69 benchmarks in Tables 5, 6, and 7. One can observe that our model surpasses most approaches, including TokenFlow-XL, Show-o, Emu3, and Transfusion, on the GenEval benchmark. Compared to Janus-Pro, which was trained on a significantly larger dataset of 144M image-text pairs, our model achieves promising results with only 66M image-text pairs. On DPG-Bench evaluation, our model has demonstrated the best overall score compared to generation-only models such as SD3-Medium and unified models, including Emu3-DPO and Janus-Pro. On OneIG-Bench, our models also achieve competitive performance. We also show qualitative results in Fig. 2 to illustrate that our model can generate high-quality and realistic images.

Video Generation. We compare our model with the text-to-video and image-to-video generation models in Tables 8 and 9. One can observe that with only 2B parameters, our model outperforms models such as Show-1, Emu3, and VILA-U with more than 6B parameters. Besides, our model has demonstrated competitive performance compared to CogVideoX and Step-Video-T2V. We also provide qualitative results of the text-to-video and image-to-video generation capability of our model

Refer to caption

Figure 2: Multimodal understanding and generation examples.

in the middle of Fig. 2. One can observe that, given text prompts or an input image, our model can generate consistent video frames with reasonable motions, such as the smiling girl, lapping waves, and floating clouds.

4.4 Mixed-Modality Generation

We demonstrate mixed-modality generation capabilities of our model using downstream task visual storytelling dataset 70 in Fig. 2. During fine-tuning, given an interleaved image-text sequence, we apply noise to all images in the sequence with a probability of 0.3. Otherwise, we randomly retain a number of the earlier images in the sequence and only apply noise to the later ones. Benefiting from the general interleaved sequence format mentioned in 3.1, our model can predict the [BOI] once it begins to generate an image. Upon detecting the [BOI] token, noises will be appended to the sequence to gradually generate an image. The generated text tokens and images will be served as context to continue generating the following output. Fig. 2 includes two examples demonstrating our model’s ability to interleavely generate coherent text and images, vividly narrating a story.

Table 8: Comparison with text-to-video models on the VBench 71 benchmark. # Params. indicates the number of total parameters for video generation including the base LLM and flow head. QS: Quality Score, SS: Semantic Score, SC: Subject Consistency, BC: Background Consistency, TF: Temporal Flickering, MS: Motion Smoothness, DD: Dynamic Degree, AQ: Aesthetic Quality, IQ: Imaging Quality, OC: Object Class, MO: Multiple Objects, HA: Human Action, C: Color, SR: Spatial Relationship, S: Scene, AS: Appearance style, TS: Temporal Style, OC’: Overall Consistency.

Models# Params.TotalQSSSSCBCTFMSDDAQIQOCMOHACSRSASTSOC’
ModelScope 721.7B75.7578.0566.5489.8795.2998.2895.7966.3952.0658.5782.2538.9892.4081.7233.6839.2623.3925.3725.67
LaVie 733B77.0878.7870.3191.4197.4798.3096.3849.7254.9461.9091.8233.3296.8086.3934.0952.6923.5625.9326.41
OpenSoraPlan V1.3 74-77.2380.1465.6297.7997.2499.2099.0530.2860.4256.2185.5643.5886.8079.3051.6136.7320.0322.4724.47
Show-1 276B78.9380.4272.9895.5398.0299.1298.2444.4457.3558.6693.0745.4795.6086.3553.5047.0323.0625.2827.46
AnimateDiff-V2 75-80.2782.9069.7595.3097.6898.7597.7640.8367.1670.1090.9036.8892.6087.4734.6050.1922.4226.0327.04
Gen-2 76-80.5882.4773.0397.6197.6199.5699.5818.8966.9667.4290.9255.4789.2089.4966.9148.9119.3424.1226.17
Pika-1.0 77-80.6982.9271.7796.9497.3699.7499.5047.5062.0461.8788.7243.0886.2090.5761.0349.8322.2624.2225.94
VideoCrafter-2.0 78-80.4482.2073.4296.8598.2298.4197.7342.5063.1367.2292.5540.6695.0092.9235.8655.2925.1325.8428.23
CogVideoX 795B81.6182.7577.0496.2396.5298.6696.9270.9761.9862.9085.2362.1199.4082.8166.3553.2024.9125.3827.59
Kling 80-81.8583.3975.6898.3397.6099.3099.4046.9461.2165.6287.2468.0593.4089.9073.0350.8619.6224.1726.42
Step-Video-T2V 8130B81.8384.4671.2898.0597.6799.4099.0853.0661.2370.6380.5650.5594.0088.2571.4724.3823.1726.0127.12
Gen-3 82-82.3284.1175.1797.1096.6298.6199.2360.1463.3466.8287.8153.6496.4080.9065.0954.5724.3124.7126.69
Emu3 378B80.96--95.3297.69-98.9379.2759.64-86.1744.6477.71-68.7337.1120.92--
VILA-U 387B74.0176.2665.04----------------
HaploOmni 839B78.10--96.4097.60-96.8065.30-------34.60---
Show-o2 (Ours)2B81.3482.1078.3197.2896.7897.6898.2540.8365.1567.0694.8176.0195.2080.8962.6157.6723.2925.2727.00

Table 9: Comparison with image-to-video models on the VBench 71 benchmark.

ModelsI2VSubjectI2VBackgroundCameraMotionSubjectConsistencyBackgroundConsistencyTemporalFlickeringMotionSmoothnessDynamicDegreeAestheticQualityImagingQuality
DynamiCrafter-1024 8496.7196.0535.4495.6997.3897.6397.3847.4066.4669.34
SEINE-512x320 8594.8594.0223.3694.2097.2696.7296.6834.3158.4270.97
I2VGen-XL 8696.7495.4413.3296.3697.9398.4898.3124.9665.3369.85
Animate-Anything 8798.5496.8812.5698.9098.1998.1498.612.6867.1272.09
ConsistI2V 8894.6994.5733.6095.2798.2897.5697.3818.6259.0066.92
VideoCrafter-I2V 8990.9790.5133.5897.8698.7998.1998.0022.6060.7871.68
SVD-XT-1.1 9097.5197.62-95.4296.7799.1798.1243.1760.2370.23
MarDini 3598.7896.46--------
Show-o2 (Ours)96.9498.8328.4193.8397.45-97.7625.8561.9269.87

4.5 Ablation Studies

Table 10: Impact of spatial (-temporal) fusion.

MME GQA POPE FID-5K
w/o Fusion1164.756.282.621.8
w Fusion1187.857.682.620.5

We show the pilot study results in Table 10, which validated the effect of spatial (-temporal) fusion on multimodal understanding and generation performance. For efficiency, we adopt LLaMA-3.2-1B as the base language model and use only around 1M multimodal understanding data and the ImageNet-1K generation data 91. Under the same training settings, there are improvements in terms of both multimodal understanding and generation metrics, including MME-p, GQA, and FID-5K. This validates that the involved semantic and low-level features in the fusion mechanism would potentially help both the multimodal generation and understanding capabilities to some extent.

Table 11: Effect of CFG guidance and inference steps.

CFG guidanceInference stepsGenEvalDPG-Bench
2.5500.6581.6
5.0500.7183.9
7.5500.7184.8
10500.7185.0
7.5250.7184.6
7.51000.7384.7

We perform ablation studies to examine the effect of classifier-free guidance (CFG) and inference steps on the performance using the 1.5B model. As shown in Table 11, increasing the CFG guidance scale and inference steps (in a range) would potentially improve the GenEval and DPG-Bench scores. However, the improvements of the GenEval score are not significant when the CFG guidance is set as larger than 5.0.

Table 12: Effect of training stages.

Stage-1Stage-2GenEvalDPG-Bench
0.6383.28
0.7384.70

Table 12 provides the effect of training stages on the generation performance on the GenEval and DPG-Bench benchmarks. One can observe that stage-2 training consistently and significantly improves both metrics, which validates the importance of the second stage.

Table 13: Impact of training recipe on text-only performance. One-stage training denotes full-parameter-co-training on image-text pairs and the text-only RefinedWeb 92 data. Note that the curated multimodal understanding data consists of text-only instructional data. We perform the evaluation under the same setting using the lm-evaluation-harness tool.

Models# Params.Training RecipeMMLUGPQAGSM8KHumanEval
Qwen2.5 Instruct 21.5B-60.20 0.3928.12 2.1351.86 1.3835.37 3.74
Show-o2 (Ours)1.5BOne-stage training with RefinedWeb28.25 0.3825.00 2.054.55 0.573.05 1.35
Show-o2 (Ours)1.5BOur two-stage training56.75 1.3729.24 2.1549.43 1.3835.54 3.70
Qwen2.5 Instruct 27B-71.75 0.3632.37 2.2182.49 1.0565.24 3.73
Show-o2 (Ours)7BOne-stage training with RefinedWeb28.43 0.2126.34 2.081.52 0.344.01 1.25
Show-o2 (Ours)7BOur two-stage training70.73 0.3631.47 2.2275.28 1.1970.73 3.56

Table 13 shows that our models effectively preserve language knowledge and achieve performance comparable to the original Qwen2.5-1.5B and Qwen2.5-7B Instruct models. In contrast, direct one-stage full-parameter-co-training with textual data such as RefinedWeb results in substantial performance degradation, highlighting the necessity of the two-stage training approach when high-quality corpora are unavailable.

Table 14: Impact of image token count on chart, text, and document VQA.

Models# Params.# Image tokensChartQADocVQA InfoVQA TextVQA
LLaVA-OV7B72956.2462.7139.5966.19
Show-o27B72948.0059.3442.3162.92
Show-o27B5 72966.9277.2645.8071.54

As shown in Table 14, our ablation study reveals that increasing the number of image tokens significantly boosts performance across all tasks, even though the model was trained with a fixed image resolution. Using the AnyRes strategy at inference time consistently improves results, highlighting the benefit of higher token counts for capturing fine-grained details. When compared to the baseline LLaVA-OV-7B, our model achieves comparable results on DocVQA, InfoVQA, and TextVQA validation sets, but underperforms on ChartQA. We attribute this gap to the limited chart-related data available during semantic layer distillation, which constrains the model’s ability to capture chart-specific information. We believe that incorporating more OCR and document-centric data into the distillation process will further strengthen the unified model’s OCR and document understanding capabilities.

5 Limitations and Broader Impacts

We found that our model is not good at rendering text on the image. We investigated our generation datasets and observed that the proportion of images with rendered texts is relatively small, which potentially leads to bad text rendering. In addition, the generated images will lack details of the small objects because of the limited image resolution. To address this limitation, as outlined in the implementation details, we have enhanced the model by training it on higher resolution data and incorporating image datasets rich in textual information.

Our models possess the ability to generate text and images, which may carry the risk of unintended misuse, such as creating fake information or profiles. Additionally, our large-scale dataset includes content featuring celebrities and copyrighted materials, which could potentially result in intellectual property infringement.

6 Conclusion

This paper proposed native unified multimodal models, i.e., Show-o2, scalable for multimodal understanding and generation, image and video modalities, by integrating 3D causal VAE, autoregressive modeling, and flow matching. A dual-path of spatial (-temporal) fusion mechanism guided the construction of unified visual representations with both high- and low-level features. A two-stage training recipe enables effective learning of unified capabilties, resulting in a versatile model capable of handling diverse tasks, including multimodal understanding and image/video generation. Extensive experiments demonstrate the model’s state-of-the-art performance across various benchmarks.

Acknowledgments and Disclosure of Funding

We thank Haozhe Liu for his valuable input and discussions throughout this project. We are also grateful to Meng Wei and Weihao Wang for their assistance in preparing and organizing the datasets for image and video generation.

Appendix A Technical Appendices and Supplementary Material

Table 15: Comparative analysis of selected unified multimodal models based on the utilization of visual representations and type of unified modeling for multimodal understanding and generation. In this context, native und. & gen. refers to the direct decoding of output sequences into texts, images, and videos, as opposed to serving as conditions for decoding using external pre-trained decoders like Stable Diffusion. indicates the method uses two distinct models for multimodal understanding and generation, respectively.

MethodsUnd. & Gen. RepresentationType of Unified Modeling
UnifiedDecoupledSupport VideoNative Und. & Gen.Assembling Tailored ModelsParadigm
Chameleon 102AR
Show-o 128AR + Diff.
Transfusion 147AR + Diff.
VILA-U 123AR
Emu3 114AR
MonoFormer 146AR + Diff.
Dual-Diffusion 63Diff.
SynerGen-VL 58AR
MMAR 134AR + MAR
MUSE-VL 129AR
Orthus 53AR + Diff.
Liquid 118AR
LlamaFusion 95AR + Diff.
UGen 99AR
UniDisc 98Diff.
UniToken 50AR
Harmon 122AR+MAR
DualToken 96AR
UniTok 77AR
Selftok 110AR
Muddit 94Diff.
MMaDA 135Diff.
HaploOmni 124AR + Diff.
TokLIP 68AR
Show-o2 (Ours)AR + Diff.
Janus-Series 26 79 27AR (+Diff.)
VARGPT 148AR
UnidFluid 38AR + MAR
OmniMamba 149AR
Mogao 65AR + Diff.
BAGEL 32AR + Diff.
Fudoki 112Diff.
UniGen 104AR + Diff.
NExT-GPT 120AR + Diff.
CoDI 101AR + Diff.
DreamLLM 36AR + Diff.
SEED-X 40AR + Diff.
MIO 116AR + Diff.
CoDI-2 100AR + Diff.
MetaMorph 106AR + Diff.
ILLUME 111AR + Diff.
ILLUME+ 46AR + Diff.
MetaQueries 83AR + Diff.
Nexus-Gen 141AR + Diff.
Ming-Lite-Uni 42AR + Diff.
BLIP3-o 18AR + Diff.
OpenUni 121AR + Diff.
UniWorld 67AR + Diff.
Ming-Omni 5AR + Diff.
Pisces 132AR + Diff.
TokenFlow 89AR
SemHiTok 28AR

A.1 More Qualitative Results

Refer to caption

Figure 3: Text-to-video and image-to-video generation examples.

A.2 Text Prompts

We provide the text prompts for image generation used in Fig. 2 below:

“Hyper-detailed image of a mature man with short, graying hair and deep blue eyes. He has a rugged, weathered face with a strong jawline and a slight beard. His expression is thoughtful and introspective. The lighting is dramatic, highlighting the contours of his face. The photo is in 8K resolution, capturing every wrinkle and pore. ”

“A soft, natural portrait photograph captures a young woman with fair skin and long, ash-blonde hair cascading gently over her shoulders, her striking light blue eyes subtly enhanced with natural makeup and a gentle, calm smile playing on her lips. She wears a cozy, cream-colored winter sweater and a delicate woolen scarf adorned with subtle snowflake patterns, positioned slightly off-center, creating a sense of relaxed elegance. Behind her, a softly blurred snowy Moscow street scene unfolds, with traditional architecture and the diffused, golden glow of a winter afternoon contributing to a serene and contemplative atmosphere. At the very bottom of the frame, in simple, elegant lettering, appears the phrase “BE KIND”. ”

“A vibrant, highly detailed close-up of a colorful parrot perched on a branch, featuring intricate feather textures, vivid colors (red, blue, green, yellow), and a tropical rainforest background. The parrot’s eyes are sharp and expressive, with a natural glint of light. The image is photorealistic, ultra HD (8K resolution), with soft natural lighting and a shallow depth of field, creating a blurred bokeh effect in the background. The scene is peaceful and lush, showcasing the beauty of nature. ”

“A dark, moody room with a glowing neon sign on the wall that spells out ’SHOW O2’ in bold, vibrant pink and blue colors. The neon light reflects softly on the polished concrete floor, creating a futuristic and artistic vibe. ”

Footnotes

  1. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023. 2

  2. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024. 2 3 4 5 6

  3. Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 2

  4. Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 2 3 4 5

  5. Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024. 2

  6. Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024.

  7. Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer, 2025. 2

  8. Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Xinyue Li, Dongyang Liu, Xiangyang Zhu, Will Beddow, Erwann Millon, Wenhai Wang Victor Perez, Yu Qiao, Bo Zhang, Xiaohong Liu, Hongsheng Li, Chang Xu, and Peng Gao. Lumina-image 2.0: A unified and efficient image generative framework, 2025. 2

  9. Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023. 2

  10. Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024. 2 3 4

  11. Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. In ICLR, 2025. 2 3 4 5 6

  12. Chunting Zhou, LILI YU, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. In ICLR, 2025. 2 3 4

  13. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.

  14. Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Tianxing Wang, Tianyi Gui, Tingyu Weng, Tong Shen, Wei Lin, Wei Wang, Wei Wang, Wenmeng Zhou, Wente Wang, Wenting Shen, Wenyuan Yu, Xianzhong Shi, Xiaoming Huang, Xin Xu, Yan Kou, Yangyu Lv, Yifei Li, Yijing Liu, Yiming Wang, Yingya Zhang, Yitong Huang, Yong Li, You Wu, Yu Liu, Yulin Pan, Yun Zheng, Yuntao Hong, Yupeng Shi, Yutong Feng, Zeyinzi Jiang, Zhen Han, Zhi-Fan Wu, and Ziyu Liu. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025. 2

  15. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. NeurIPS, 36, 2024.

  16. Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, and Xinlong Wang. Unveiling encoder-free vision-language models. arXiv preprint arXiv:2406.11832, 2024.

  17. Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, and Xinlong Wang. Evev2: Improved baselines for encoder-free vision-language models. arXiv preprint arXiv:2502.06788, 2025.

  18. Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.

  19. Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024.

  20. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.

  21. William Peebles and Saining Xie. Scalable diffusion models with transformers. In ICCV, pages 4195–4205, 2023.

  22. Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. In CVPR, 2023.

  23. Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Zhongdao Wang, James T. Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- : Fast training of diffusion transformer for photorealistic text-to-image synthesis. In ICLR. OpenReview.net, 2024.

  24. Jinheng Xie, Yuexiang Li, Yawen Huang, Haozhe Liu, Wentian Zhang, Yefeng Zheng, and Mike Zheng Shou. Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In ICCV, pages 7452–7461, 2023.

  25. Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, 2023.

  26. Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023. 2

  27. David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023. 2

  28. Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, et al. Seaweed-7b: Cost-effective training of video generation foundation model. arXiv preprint arXiv:2504.08685, 2025.

  29. Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024.

  30. Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.

  31. Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In ICML, pages 1691–1703, 2020.

  32. Ziqi Pang, Tianyuan Zhang, Fujun Luan, Yunze Man, Hao Tan, Kai Zhang, William T. Freeman, and Yu-Xiong Wang. Randar: Decoder-only autoregressive visual generation in random orders. arXiv preprint arXiv:2412.01827, 2024.

  33. Haopeng Li, Jinyue Yang, Guoqi Li, and Huan Wang. Autoregressive image generation with randomized parallel decoding. arXiv preprint arXiv:2503.10568, 2025.

  34. Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. arXiv preprint arXiv:2406.11838, 2024.

  35. Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yanping Xie, Xiao Han, Juan C. Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, Jui-Chieh Wu, Sen He, Tao Xiang, Jürgen Schmidhuber, and Juan-Manuel Pérez-Rúa. Mardini: Masked autoregressive diffusion for video generation at scale. arXiv preprint arXiv:2410.20280, 2024. 2

  36. Lijie Fan, Luming Tang, Siyang Qin, Tianhong Li, Xuan Yang, Siyuan Qiao, Andreas Steiner, Chen Sun, Yuanzhen Li, Tao Zhu, et al. Unified autoregressive visual generation and understanding with continuous tokens. arXiv preprint arXiv:2503.13436, 2025.

  37. Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024. 2 3

  38. Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429, 2024. 2

  39. Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding. arXiv preprint arXiv:2502.20321, 2025.

  40. Yang Jiao, Haibo Qiu, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, and Yu-Gang Jiang. Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding. arXiv preprint arXiv:2504.04423, 2025.

  41. Zisheng Chen, Chunwei Wang, Xiuwei Chen, Hang Xu, Jianhua Han, and Xiaodan Liang. Semhitok: A unified image tokenizer via semantic-guided hierarchical codebook for multimodal understanding and generation. arXiv preprint arXiv:2503.06764, 2025.

  42. Wei Song, Yuran Wang, Zijia Song, Yadong Li, Haoze Sun, Weipeng Chen, Zenan Zhou, Jianhua Xu, Jiaqi Wang, and Kaicheng Yu. Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. arXiv preprint arXiv:2503.14324, 2025.

  43. Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, and Mohit Bansal. Any-to-any generation via composable diffusion. NeurIPS, 36, 2024.

  44. Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv preprint arXiv:2312.17172, 2023.

  45. Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, and Li Yi. DreamLLM: Synergistic multimodal comprehension and creation. In ICLR, 2024.

  46. Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024. 2

  47. Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal understanding and generation via instruction tuning. arXiv preprint arXiv:2412.14164, 2024.

  48. Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, Ji Hou, and Saining Xie. Transfer between modalities with metaqueries. arXiv preprint arXiv:2504.06256, 2025. 2

  49. Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025. 2

  50. Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training, 2023.

  51. Biao Zhang and Rico Sennrich. Root mean square layer normalization. In NeurIPS, 2019.

  52. Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022.

  53. William Peebles and Saining Xie. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022.

  54. Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025.

  55. Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, pages 3558–3568, 2021.

  56. Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.

  57. Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, and Ling-Yu Duan. Densefusion-1m: Merging vision experts for comprehensive multimodal perception. 2407.08303, 2024.

  58. Alex Jinpeng Wang, Dongxing Mao, Jiawei Zhang, Weiming Han, Zhuobai Dong, Linjie Li, Yiqi Lin, Zhengyuan Yang, Libo Qin, Fuwei Zhang, et al. Textatlas5m: A large-scale dataset for dense text image generation. arXiv preprint arXiv:2502.07870, 2025.

  59. Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713, 2024.

  60. Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. MME: A comprehensive evaluation benchmark for multimodal large language models. CoRR, abs/2306.13394, 2023.

  61. Drew A. Hudson and Christopher D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709. Computer Vision Foundation / IEEE, 2019.

  62. Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023.

  63. Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024.

  64. Xiang Yue, Yuansheng Ni, Tianyu Zheng, Kai Zhang, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. In CVPR, pages 9556–9567. IEEE, 2024.

  65. Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024.

  66. Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–251. Springer, 2016.

  67. Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. In NeurIPS, 2023. 2

  68. Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024. 2

  69. Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation. arXiv preprint arxiv:2506.07977, 2025.

  70. Ting-Hao K. Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aishwarya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. Visual storytelling. In 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2016), 2016.

  71. Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 2

  72. Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023.

  73. Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models. IJCV, 2024.

  74. Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model. arXiv preprint arXiv:2412.00131, 2024.

  75. Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. International Conference on Learning Representations, 2024.

  76. Gen-2. Accessed September 25, 2023 [Online] https://research.runwayml.com/gen2, 2023.

  77. Pika 1.0. Accessed December 28, 2023 [Online] https://www.pika.art/, 2023.

  78. Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024.

  79. Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024.

  80. Kling. Accessed June 6, 2024 [Online] https://klingai.kuaishou.com/, 2024.

  81. Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang Yu, Dapeng Shi, Dingyuan Hu, Enle Liu, Gang Yu, Ge Yang, Guanzhe Huang, Gulin Yan, Haiyang Feng, Hao Nie, Haonan Jia, Hanpeng Hu, Hanqi Chen, Haolong Yan, Heng Wang, Hongcheng Guo, Huilin Xiong, Huixin Xiong, Jiahao Gong, Jianchang Wu, Jiaoren Wu, Jie Wu, Jie Yang, Jiashuai Liu, Jiashuo Li, Jingyang Zhang, Junjing Guo, Junzhe Lin, Kaixiang Li, Lei Liu, Lei Xia, Liang Zhao, Liguo Tan, Liwen Huang, Liying Shi, Ming Li, Mingliang Li, Muhua Cheng, Na Wang, Qiaohui Chen, Qinglin He, Qiuyan Liang, Quan Sun, Ran Sun, Rui Wang, Shaoliang Pang, Shiliang Yang, Sitong Liu, Siqi Liu, Shuli Gao, Tiancheng Cao, Tianyu Wang, Weipeng Ming, Wenqing He, Xu Zhao, Xuelin Zhang, Xianfang Zeng, Xiaojia Liu, Xuan Yang, Yaqi Dai, Yanbo Yu, Yang Li, Yineng Deng, Yingming Wang, Yilei Wang, Yuanwei Lu, Yu Chen, Yu Luo, Yuchu Luo, Yuhe Yin, Yuheng Feng, Yuxiang Yang, Zecheng Tang, Zekai Zhang, Zidong Yang, Binxing Jiao, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu, Heung-Yeung Shum, and Daxin Jiang. Step-video-t2v technical report: The practice, challenges, and future of video foundation model, 2025.

  82. Gen-3. Accessed June 17, 2024 [Online] https://runwayml.com/research/introducing-gen-3-alpha, 2024.

  83. Yicheng Xiao, Lin Song, Rui Yang, Cheng Cheng, Zunnan Xu, Zhaoyang Zhang, Yixiao Ge, Xiu Li, and Ying Shan. Haploomni: Unified single transformer for multimodal video understanding and generation. arXiv preprint arXiv:2506.02975, 2025.

  84. Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Xintao Wang, Tien-Tsin Wong, and Ying Shan. Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023.

  85. Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:2310.20700, 2023.

  86. Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023.

  87. Zuozhuo Dai, Zhenghao Zhang, Yao Yao, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. Fine-grained open domain image animation with motion guidance. arXiv preprint arXiv:2311.12886, 2023.

  88. Weiming Ren, Harry Yang, Ge Zhang, Cong Wei, Xinrun Du, Stephen Huang, and Wenhu Chen. Consisti2v: Enhancing visual consistency for image-to-video generation. arXiv preprint arXiv:2402.04324, 2024.

  89. Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023.

  90. Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.

  91. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.

  92. Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Hamza Alobeidli, Alessandro Cappelli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon LLM: outperforming curated corpora with web data only. In NeurIPS, 2023.