Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

1]Meta AI 2]The University of Hong Kong 3]University of Waterloo [*]Joint first authors, listed alphabetically by last name

Zhiheng Liu Weiming Ren Xiaoke Huang Shoufa Chen Tianhong Li Mengzhao Chen Yatai Ji Sen He Jonas Schult Belinda Zeng Tao Xiang Wenhu Chen Ping Luo Luke Zettlemoyer Yuren Cong [ [ [

Abstract

Unified multimodal models typically rely on pretrained vision encoders and use separate visual representations for understanding and generation, creating misalignment between the two tasks and preventing fully end-to-end optimization from raw pixels. We introduce Tuna-2, a native unified multimodal model that performs visual understanding and generation directly based on pixel embeddings. Tuna-2 drastically simplifies the model architecture by employing simple patch embedding layers to encode visual input, completely discarding the modular vision encoder designs such as the VAE or the representation encoder. Experiments show that Tuna-2 achieves state-of-the-art performance in multimodal benchmarks, demonstrating that unified pixel-space modelling can fully compete with latent-space approaches for high-quality image generation. Moreover, while the encoder-based variant converges faster in early pretraining, Tuna-2’s encoder-free design achieves stronger multimodal understanding at scale, particularly on tasks requiring fine-grained visual perception. These results show that pretrained vision encoders are not necessary for multimodal modelling, and end-to-end pixel-space learning offers a scalable path toward stronger visual representations for both generation and perception.

[Project page] https://tuna-ai.org/tuna-2

Refer to caption

Figure 1: Evolution of Tuna-2 architecture and multimodal performance comparison. We simplify Tuna ( liu2025tuna ) by progressively stripping away its visual encoding components. By removing the VAE, we first derive Tuna-R, a pixel-space UMM that relies solely on a representation encoder. further streamlines the design by bypassing the representation encoder entirely, utilizing direct patch embedding layers for raw image inputs. using pixel embeddings outperforms both and across a diverse suite of multimodal benchmarks.

1 Introduction

Visual understanding and generation are two core capabilities in multimodal AI. Recent work has increasingly focused on native unified multimodal models (UMMs) (zhou2024transfusion; deng2025emerging; liu2025tuna), which aim to integrate both capabilities within a single framework. A central challenge in building such models is encoding input images into visual representations that effectively support both understanding and generation. Early approaches (deng2025emerging; chen2025janus) adopted decoupled representations, using representation encoders such as CLIP (radford2021learning) for understanding and reconstruction-oriented encoders such as VQ-VAE (esser2021taming) for generation. To address the representation mismatch introduced by this design, more recent UMMs (xie2025show; liu2025tuna) have moved toward modelling both tasks using unified visual representations through a shared vision encoder.

Despite the significant progress, both decoupled and unified visual representation designs still rely heavily on pretrained vision encoders (wan2025wan; tschannen2025siglip) for visual feature extraction. In parallel, recent research on multimodal understanding and generation has begun to move away from encoder-based modular designs toward simpler monolithic, encoder-free architectures. In multimodal understanding, newer native vision-language models (diao2025pixels) remove the pretrained representation encoder and instead align images and natural language within a unified, end-to-end architecture. In visual generation, pixel-space diffusion models (hoogeboom2023simple; chen2025pixelflow; li2025back) have shown increasing flexibility, stronger scalability, and state-of-the-art performance on a wide range of tasks, suggesting that pretrained VAE encoders may no longer be essential even for high-fidelity image synthesis.

Motivated by these observations, we ask a natural but largely unexplored question: can we move beyond pretrained vision encoders altogether, and build unified multimodal models through end-to-end native modelling directly from raw pixels?

We answer this question affirmatively by introducing Tuna-2, a native unified multimodal model that attempts to progressively simplify the encoder modules, and ultimately remove vision encoders completely. We first introduce Tuna-R, which eliminates the VAE model while keeping a representation encoder in the model architecture. Tuna-R performs multimodal understanding similar to standard encoder-based LMMs, and supports visual generation through pixel-space flow matching with an $x$ -prediction objective. We then propose Tuna-2, which further simplifies the architecture by removing the encoder entirely and using only a single transformer decoder to process image and video tokens. As a result, Tuna-2 enables end-to-end native unified modelling directly from raw pixels, without relying on any pretrained encoder modules.

Since learning unified representations directly in high-dimensional pixel space is substantially more challenging than learning them in a compact latent space, we further introduce a masking-based visual feature learning scheme to stabilize training and encourage the learning of more robust pixel-space representations. Together, these designs enable Tuna-2 to achieve state-of-the-art performance across a diverse set of multimodal understanding and generation benchmarks. More importantly, our controlled comparison reveals a clear design insight: after sufficient visual pretraining, the encoder-free Tuna-2 becomes competitive with the encoder-based Tuna-R on visual generation, while consistently outperforming it on multimodal understanding, especially on benchmarks that require fine-grained visual perception. These findings suggest that removing pretrained vision encoders can be advantageous for learning stronger fine-grained visual representations in end-to-end pretraining. As shown in Figures 1 and 2, this leads to highly competitive performances in both multimodal understanding and generation.

Our main contributions are summarized as follows:

We propose Tuna-2, a native unified multimodal model that supports multimodal understanding and generation with encoder-free designs, achieving state-of-the-art performance across a wide range of understanding and generation benchmarks.
We conduct controlled comparisons between Tuna-2 and an encoder-based pixel-space UMM variant Tuna-R, and show that after sufficient multimodal pretraining, Tuna-2 and its encoder-free design are competitive on generation and advantageous for understanding, especially on fine-grained, perception-intensive tasks.
We conduct comprehensive ablations and analyses on pixel-space UMMs to study their training dynamics and model behaviours, offering useful insights for the development of future native unified multimodal models.

Refer to caption

Figure 2: While being completely encoder-free, Tuna-2 is capable of performing high-fidelity text-to-image generation and image editing.

2 Method

In this section, we present Tuna-2, a native unified multimodal model that performs visual understanding and generation both in pixel space. We start by detailing our approach to progressively remove vision encoder components to derive Tuna-2 in Section 2.1. We then describe our masked feature learning scheme in Section 2.2 and our model training pipeline in Section 2.3.

2.1 Towards Encoder-Free Unified Models

As shown in Figure 1, existing UMMs with unified visual representations, such as Tuna (liu2025tuna), typically consist of a vision encoder and an LLM decoder for joint vision-language modeling, followed by modality-specific heads, including a language modelling head for autoregressive text generation and a flow matching head for image generation. In this work, we propose Tuna-2 as an encoder-free UMM formulation by progressively simplifying the vision encoder components in existing architectures. Our design process for this architectural simplification is as follows:

Representation encoder-based architecture. First, we attempt to remove the VAE model and only employ a pretrained representation encoder in the vision encoder. As shown in Figure 1, this resonates a standard paradigm for vision-language modelling: the representation encoder first encodes input images into visual tokens, which are then combined with the text tokens in the LLM decoder for joint vision-language modelling. Originally proposed in LLaVA (liu2023visual), this paradigm has been verified and scaled up by later works such as Qwen3-VL (bai2025qwen3) and InternVL3.5 (wang2025internvl3), and remains the most popular framework for multimodal understanding. We refer to this intermediate design as Tuna-R. Although our ultimate goal is to move beyond encoder-based architectures, we view Tuna-R as an important intermediate step that enables a controlled comparison with Tuna-2.

Encoder-free (non-encoder) architecture. Second, we consider a further simplified architecture that removes the representation encoder entirely, which becomes our main design for Tuna-2. As shown in Figure 1, this design replaces pretrained vision encoders with simple patch embedding layers that convert images into visual tokens, which are then processed jointly with text tokens by the LLM decoder. Similar encoder-free designs have recently been explored in models such as Mono-InternVL (luo2025mono) and NEO (diao2025pixels). By removing the pretrained representation encoder, this design avoids its built-in inductive biases, such as fixed input resolutions and limited access to fine-grained low-level visual details. It also simplifies the model architecture into a single unified transformer. In Section 3, we present a series of in-depth analyses comparing Tuna-2 with Tuna-R, and demonstrate the effectiveness and scalability of Tuna-2.

Pixel-space image generation. Our VAE-free design allows us to directly perform multimodal understanding and text generation using the LLM decoder and the language modelling head. However, discarding the VAE also means that we can no longer adopt the designs from existing UMMs and generation-only models that follow the latent diffusion architecture. To effectively perform pixel-space image generation, we adopt the $x$ -prediction and $v$ -loss paradigm from JiT (li2025back) for pixel-space flow matching. Specifically, given the source image $x_{1}$ , the sampled noise $x_{0} \sim N (0, I)$ and the timestamp $t$ , we employ rectified flow and its linear schedule to construct a noisy sample in pixel space:

x_{t} = t x_{1} + (1 - t) x_{0}, t \in [0, 1] .

Tuna-2 is then formulated to directly predict the clean image from the noisy image in pixel space:

x_{θ} = π_{θ} (x_{t}, c, t),

where $π_{θ}$ is our unified model (vision-language backbone and flow matching head) and $c$ is the conditioning signals (text for text-to-image generation and text+image for image editing). As suggested in JiT, while our model directly predicts $x_{θ}$ , we still transform it into the velocity term $v_{θ}$ and regress $v_{θ}$ as our learning objective:

v_{θ}

= \frac{x _{θ} - x _{t}}{1 - t},

L_{flow}

= E_{t, c, x_{1}, x_{0}} ∣∣ v_{θ} - v ∣ ∣_{2}^{2},

where $v$ is the ground truth velocity defined by $v = x_{1} - x_{0}$ . During inference, we employ the Euler solver and predict the denoised image at $t^{'}$ from the noisier image at $t < t^{'}$ based on the velocity term $v_{θ}$ , such that $x_{t^{'}} = x_{t} + (t^{'} - t) v_{θ}$ , where $v_{θ}$ is transformed from our model prediction $x_{θ}$ , based on Equation 3.

Refer to caption

Figure 3: Illustration of our proposed masking-based feature learning scheme. During training, we use the learnable mask token to regularize multimodal understanding and perform masked prediction for visual generation.

2.2 Learning Robust Visual Representations via Masking

While removing the VAE simplifies our model architecture and enables fully end-to-end unified multimodal training, it also shifts visual modelling from a compact latent space to the much higher-dimensional pixel space. As a result, learning a unified visual representation becomes more challenging: the increased redundancy in pixel-space inputs makes it easier for the model to rely on superficial shortcuts, rather than learning visual cues that are genuinely informative for both understanding and generation. To learn more robust visual representations in pixel space, we introduce a masking-based visual feature learning scheme. As shown in Figure 3, during training, we (optionally) randomly select a subset of image patches according to a masking ratio and replace the masked visual tokens with a learnable mask token before feeding them into the LLM decoder. The same masking operation is applied to both generation and understanding examples, but plays different roles in the two settings:

For generation examples, we let the model predict the clean image patches in both the masked and the unmasked regions, such that (1) we create a harder denoising problem for the model to predict clean images from partially observed noisy images; and (2) it encourages the learnable mask token to absorb useful information for image reconstruction conditioned on the visible context.
For understanding examples, our model predicts the ground truth text response based on the masked visual input. In this case, masking serves as a regularization mechanism that forces the model to perform multimodal reasoning under partial visual observation, leading to more robust visual representations.

Our masking-based feature learning scheme resembles masked modelling methods in visual understanding and generation, such as MAE (he2022masked) and SigLIP 2 (tschannen2025siglip) for semantic learning and MaskGIT (chang2022maskgit) and DeTok (yang2025latent) for visual generation. Empirically, we find that applying masking leads to enhanced model performance during pretraining stages.

2.3 Training Pipeline

Our encoder-free design enables fully end-to-end training of Tuna-2, without requiring separate stages to train connector layers, which is a common design in encoder-based modular approaches. As described below, our training pipeline consists of two stages, both of which are carried out in a fully end-to-end manner:

Stage 1: full model pretraining. In the first stage, we aim to establish a strong initialization for the flow matching head, and adapt pixel-space visual inputs for unified multimodal understanding and generation. To achieve this, we train the full model jointly on two tasks: image captioning and text-to-image generation.

Stage 2: supervised finetuning (SFT). Next, we perform supervised fine-tuning (SFT) of the full model with a lower learning rate. We use datasets for image editing, image instruction-following, and high-quality image generation. This step refines Tuna-2’s abilities, boosting performance and generalization across various multimodal tasks.

For Tuna-R, which includes a connector layer between the representation encoder and the LLM decoder, we add an extra alignment stage before Stage 1. In this stage, we train only the connector layer for a short period using image captioning and text-to-image generation data. As noted above, Tuna-2 does not require this additional stage because of its encoder-free design.

3 Experiments

3.1 Experiment Setup

We employ Qwen2.5-7B-Instruct (qwen2024qwen2) as the LLM decoder for Tuna-2. For Stage 1 pretraining, we use 550M in-house image-text pairs, consisting of 70% image captioning data for multimodal understanding and 30% text-to-image generation data. In addition, we include text-only data from Nemotron (bercovich2025llama), which accounts for 20% of the total pretraining data. The full model is trained end-to-end for 300k steps on 64 nodes with the AdamW optimizer (loshchilov2017decoupled) and a learning rate of $1 \times 1 0^{- 4}$ . For Stage 2 supervised finetuning, we use a curated SFT corpus covering image instruction-following, image editing, and high-quality image generation. Specifically, for image instruction-following, we include 13M conversational examples from the open-source FineVision (wiedmann2025finevision) dataset. For image editing, we use approximately 2M examples from OmniEdit (wei2024omniedit). This stage is trained for 50k steps with AdamW and a learning rate of $2 \times 1 0^{- 5}$ . For all training stages, we pad the input sequence length to 16k tokens per GPU.

For Tuna-R, we use the same Qwen2.5-7B-Instruct as the LLM decoder. We follow Tuna and adopt SigLIP 2 So400M (tschannen2025siglip) as the representation encoder. For the connector-alignment stage in Tuna-R, we train the model for 3k steps with AdamW and a learning rate of $5 \times 1 0^{- 4}$ .

Table 1: Comparisons between Tuna-2 and baseline models on multimodal understanding benchmarks. Results with model size greater than 13B are grayed. Bold: best results among all UMMs. Underline: second-best among all UMMs.

[HTML]EFEFEFUnderstanding-only Models (LMMs)
Models	Size	General Benchmarks									Pixel-centric Benchmarks
Models		GQA	RealWorldQA	MMVet	MMMU	MMVP	SEED-Bench2+	AI2D	ChartQA	OCRBench	V*	CountBench	VisuLogic
LLaVA-1.5 (liu2023visual)		7B	62.0	54.8	32.9	35.7	-	-	55.5	17.8	31.8	-	-	-
Qwen-VL-Chat (bai2023qwen)	7B	57.5	49.3	47.3	37.0	-	-	57.7	49.8	48.8	-	-	-
LLaVA-OV (li2024llava)	7B	-	69.9	51.9	48.8	77.3	62.2	81.4	80.9	62.2	72.7	76.2	24.8
Qwen2.5-VL (li2024llava)	7B	60.7	69.9	61.7	58.6	78.0	70.5	82.7	83.0	83.7	71.2	74.1	20.0
[HTML]EFEFEFComposite UMMs
TokenFlow-XL (qu2025tokenflow)	14B	62.5	56.6	-	43.2	-	-	-	-	-	-	-	-
BLIP3-o (chen2025blip3)	4B	-	60.4	-	46.6	-	-	-	-	-	-	-	-
Tar (han2025vision)	7B	61.3	-	-	39.0	74.3	46.2	-	-	-	41.4	64.2	24.3
X-Omni (geng2025x)	7B	62.8	62.6	-	47.2	-	-	76.8	81.5	70.4	-	-	-
[HTML]EFEFEFNative UMMs
BAGEL (deng2025emerging)	14B	66.4	72.8	67.2	55.3	85.0	71.9	89.2	78.5	73.3	70.2	82.5	41.7
Ming-UniVision (huang2025mingunivision)	16B	59.4	59.1	64.2	40.3	71.0	56.8	82.8	76.7	72.4	48.2	76.8	26.7
Harmon (wu2025harmonizing)	1.5B	58.9	49.8	-	38.9	61.7	41.6	57.0	29.8	11.2	41.9	67.0	25.1
JanusFlow (ma2025janusflow)	1.3B	60.3	41.2	36.2	29.3	67.7	39.8	54.2	42.4	53.2	42.9	78.6	22.0
Emu3 (wang2024emu3)	8B	60.3	57.4	23.5	31.6	71.0	44.6	70.0	69.4	68.7	53.4	65.2	24.7
VILA-U (wu2024vila)	7B	60.8	46.8	26.3	31.2	62.7	31.9	49.0	11.4	23.3	38.7	55.2	25.4
Janus-Pro (chen2025janus)	7B	62.0	58.0	41.1	41.0	73.3	56.3	71.3	25.8	59.0	47.6	53.2	23.8
Show-o2 (xie2025show)	7B	63.1	64.7	39.6	48.9	76.7	61.3	78.6	52.3	32.4	44.5	63.5	26.9
OneCat (li2025onecat)	9B	63.1	65.2	52.2	41.9	71.3	61.6	77.8	81.2	79.0	63.4	34.2	24.9
Tuna (liu2025tuna)	7B	63.9	66.1	42.9	49.8	70.7	52.7	79.3	85.8	74.3	52.4	73.5	22.4
[HTML]ECF4FF Tuna-R	7B	63.5	67.9	46.7	51.1	74.7	58.4	79.4	85.6	78.3	57.6	77.8	26.2
[HTML]ECF4FF Tuna-2	7B	65.0	67.7	51.7	50.7	77.3	61.1	79.6	85.6	79.7	59.2	81.7	28.8

Refer to caption

Table 2: Image generation results on GenEval and DPG-Bench. “Col. Attr.” means “Color Attribute”. † refers to methods using LLM rewriters in GenEval. Bold: best results among native UMMs. Underline: second-best.

Blog1

探索

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Abstract

1 Introduction

2 Method

2.1 Towards Encoder-Free Unified Models

2.2 Learning Robust Visual Representations via Masking

2.3 Training Pipeline

3 Experiments

3.1 Experiment Setup

关系图谱

目录

反向链接