
PaperBanana: Automating Academic lustration for AI Scientists
Dawei Zhu1 2 *, Rui Meng2, Yale Song2, Xiyu Wei1, Sujian , Tomas Pfister2 and Jinsung Yoon2
1Peking University, 2Google Cloud AI Research
https://dwzhu-pku.github.io/PaperBanana/
Despite rapid advances in autonomous AI scientists powered by language models, generating publicationready illustrations remains a labor-intensive bottleneck in the research workflow. To lift this burden, we introduce PaperBanana, an agentic framework for automated generation of publication-ready academic illustrations. Powered by state-of-the-art VLMs and image generation models, PaperBanana orchestrates specialized agents to retrieve references, plan content and style, render images, and iteratively refine via self-critique. To rigorously evaluate our framework, we introduce PaperBananaBench, comprising 292 test cases for methodology diagrams curated from NeurIPS 2025 publications, covering diverse research domains and illustration styles. Comprehensive experiments demonstrate that PaperBanana consistently outperforms leading baselines in faithfulness, conciseness, readability, and aesthetics. We further show that our method effectively extends to the generation of high-quality statistical plots. Collectively, PaperBanana paves the way for the automated generation of publication-ready illustrations.

Methodology Diagrams







Statistical Plots



Figure 1 | Examples of methodology diagrams and statistical plots generated by PaperBanana, which show the potential of automating the generation of academic illustrations.
1. Introduction
Autonomous scientific discovery is a long-standing pursuit of artificial general intelligence (Ghahramani, 2015; Langley, 1987, 2024; Schmidhuber, 2010). With the rapid evolution of Large Language
Models (LLMs) (Anthropic, 2025; Comanici et al., 2025; Liu et al., 2024; OpenAI, 2025b; Yang et al., 2025a), autonomous AI Scientists have demonstrated the potential to automate many facets of the research lifecycle, such as literature review, idea generation, and experiment iteration (Gottweis et al., 2025; Lu et al., 2024; Luo et al., 2025). Yet scientific discoveries achieve their full value only through effective communication. Despite their proficiency in textual analysis and code execution, current autonomous AI scientists struggle to visually communicate discoveries, especially for generating illustrations (diagrams and plots) that adhere to the rigorous standards of academic manuscripts.
Among these illustration tasks, generating methodology diagrams represents a significant challenge, demanding both content fidelity and visual aesthetics. Prior endeavors for diagram generation have predominantly adopted the code-based paradigm, leveraging TikZ (Belouadi and Eger, 2024; Belouadi et al., 2025), Python-PPTX (Zheng et al., 2025), or SVG to programmatically synthesize diagrams. While effective for structured content, these methods can encounter expressiveness limitations when attempting to produce the intricate visual elements – such as specialized icons and custom shapes – that are increasingly common in modern AI publications. Conversely, although recent image generation models (Deepmind, 2025; OpenAI, 2025a; Team et al., 2025; Wu et al., 2025a) have demonstrated advanced instruction-following capabilities and high-quality visual outputs, consistently generating academic illustrations that meet scholarly standards remains a difficult task (Zuo et al., 2025). Specialized expertise required for professional illustration tools often constrains researchers’ ability to freely express complex ideas, forcing them to invest substantial manual effort into crafting figures. This creates a significant bottleneck in the effective visual communication of scientific discoveries.
In this paper, we introduce PaperBanana, an agentic framework designed to bridge this gap by automating the production of high-quality academic illustrations. Given a methodology description and diagram caption as input, PaperBanana orchestrates specialized agents powered by state-ofthe-art VLMs and image generation models (e.g. Gemini-3-Pro and Nano-Banana-Pro) to retrieve reference examples, devise detailed plans for content and style, render images, and iteratively refine via self-critique. This reference-driven collaborative workflow allows the system to effectively master the logical composition and stylistic norms required for publication-ready illustrations. Beyond methodology diagrams, our framework demonstrates significant versatility by extending to statistical plots, offering a comprehensive solution for scientific visualization.
To rigorously evaluate our framework and address the absence of dedicated benchmarks for automated academic illustration, we introduce PaperBananaBench, a comprehensive benchmark for methodology diagram generation. The benchmark comprises 292 test cases and 292 reference cases curated from NeurIPS 2025 publications, spanning diverse research topics and illustration styles. To assess generation quality, we employ a VLM-as-a-Judge approach for reference-based scoring against human illustrations across four dimensions: faithfulness, conciseness, readability, and aesthetics, with reliability verified through correlation with human judgments.
Comprehensive experiments on our benchmark demonstrate the effectiveness of PaperBanana. Our method consistently outperforms leading baselines across all four evaluation dimensions— faithfulness , conciseness , readability , and aesthetics —as well as the aggregated overall score for diagram generation. We further show that our method also seamlessly extends to statistical plots. Collectively, our method paves the way for automating the generation of academic illustrations (Examples shown in Figure 1). As a demonstration of its capability, figures marked with in this manuscript were entirely generated using PaperBanana. Additionally, we discuss intriguing settings including using our framework to enhance existing humancreated illustrations and using image generation models for statistical plot generation. To sum up, our contributions are:
• We propose PaperBanana, a fully automated agentic framework that orchestrates specialized agents to generate publication-ready academic illustrations.
• We construct PaperBananaBench to assess the quality of academic illustrations, particularly methodology diagrams.
• Comprehensive experiments show that our workflow significantly outperforms leading baselines, showing promise for automating the generation of academic illustrations.
2. Task Formulation
We formalize the task of automated academic illustration generation as learning a mapping from a source context and a communicative intent to a visual representation. Let ?? denote the source context containing the essential information, and ?? denote the communicative intent that specifies the scope and focus of the desired illustration. The goal is to generate an image ?? that faithfully visualizes while fulfilling the communicative intent , formulated as:
To further guide the mapping function, the input can be optionally augmented by a set of reference examples . Each example serves as a ground-truth demonstration, defined as a tuple , where is the reference illustration corresponding to the context and communicative intent . Integrating this, the unified task formulation becomes:
where defaults to ∅ when no examples are used (i.e., zero-shot generation).
Among various types of academic illustrations, this paper primarily focuses on the automated generation of methodology diagrams, which requires interpreting complex technical concepts and logical flows from textual descriptions into high-fidelity, visually pleasing illustrations. In this setting, the source context ?? is the textual description of the method (e.g., methodology sections), and the communicative intent ?? is the figure caption specifying the scope and focus (e.g., “Overview of our framework”).
3. Methodology
In this section, we present the architecture of PaperBanana, a reference-driven agentic framework for automated academic illustration. As illustrated in Figure 2, PaperBanana orchestrates a collaborative team of five specialized agents—Retriever, Planner, Stylist, Visualizer, and Critic—to transform raw scientific content into publication-quality diagrams and plots. (See Appendix G for prompts)
Retriever Agent. Given the source context ?? and the communicative intent ??, the Retriever Agent identifies most relevant examples from the fixed reference set to guide the downstream agents. As defined in Section 2, each example is a triplet . To leverage the reasoning capabilities of VLMs, we adopt a generative retrieval approach where the VLM performs selection over candidate metadata:
Specifically, the VLM is instructed to rank candidates by matching both research domain (e.g., Agent & Reasoning) and diagram type (e.g., pipeline, architecture), with visual structure being prioritized over topic similarity. By explicitly reasoned selection of reference illustrations whose corresponding

Figure 2 | [Generated by , textual description to reproduce this diagram is presented in Appendix E.] Overview of our PaperBanana framework. Given the source context and communicative intent, we first apply a Linear Planning Phase to retrieve relevant reference examples and synthesize a stylistically optimized description. We then use an Iterative Refinement Loop (consisting of Visualizer and Critic Agents) to transform the description into visual output and conduct multi-round refinements to produce the final academic illustration.
contexts best match the current requirements, the Retriever provides a concrete foundation for both structural logic and visual style.
Planner Agent. The Planner Agent serves as the cognitive core of the system. It takes the source context ??, communicative intent ??, and retrieved examples E as inputs. By performing in-context learning from the demonstrations in E, the Planner translates the unstructured or structured data in ?? into a comprehensive and detailed textual description of the target illustration:
Stylist Agent. To ensure the output adheres to the aesthetic standards of modern academic manuscripts, the Stylist Agent acts as a design consultant. A primary challenge lies in defining a comprehensive “academic style,” as manual definitions are often incomplete. To address this, the Stylist traverses the entire reference collection to automatically synthesize an Aesthetic Guideline covering key dimensions such as color palette, shapes and containers, lines and arrows, layout and composition, and typography and icons (see Appendix F for the summarized guideline and implementation details). Armed with this guideline, the Stylist refines each initial description into a stylistically optimized version :
This ensures that the final illustration is not only accurate but also visually professional.
Visualizer Agent. After receiving the stylistically optimized description , the Visualizer Agent collaborates with the Critic Agent to render academic illustrations and iteratively refine their quality. The Visualizer Agent leverages an image generation model to transform textual descriptions into visual output. In each iteration ??, given a description , the Visualizer generates:
where the initial description is set to .
Critic Agent. The Critic Agent forms a closed-loop refinement mechanism with the Visualizer by closely examining the generated image and providing refined description to the Visualizer. Upon receiving the generated image at iteration ??, the Critic inspects it against the original source context to identify factual misalignments, visual glitches, or areas for improvement. It then provides targeted feedback and produces a refined description that addresses the identified issues:
This revised description is then fed back to the Visualizer for regeneration. The Visualizer-Critic loop iterates for rounds, with the final output being . This iterative refinement process ensures that the final illustration meets the high standards required for academic dissemination.
Extension to Statistical Plots. The framework extends to statistical plots by adjusting the Visualizer and Critic agents. For numerical precision, the Visualizer converts the description into executable Python Matplotlib code: . The Critic evaluates the rendered plot and generates a refined description addressing inaccuracies or imperfections: . The same round iterative refinement process applies. While we prioritize this code-based approach for accuracy, we also explore direct image generation in Section 6. See Appendix G.2 for adjusted prompts.
4. Benchmark Construction
The lack of benchmarks hinders rigorous evaluation of automated diagram generation. We address this with PaperBananaBench, a dedicated benchmark curated from NeurIPS 2025 methodology diagrams, capturing the sophisticated aesthetics and diverse logical compositions of modern AI papers. We detail the construction pipeline and evaluation protocol below; dataset statistics are in Figure 3.
4.1. Data Curation
Collection & Parsing. We begin by randomly sampling 2,000 papers from the 5,275 publications at NeurIPS 2025 and retrieving their PDF files. Subsequently, we utilize the MinerU toolkit (Niu et al., 2025) to parse these documents, extracting the text of the methodology sections, and all the diagrams and their captions in the paper.
Filtering. We then apply a filtering stage to ensure data quality. First, we discard papers without methodology diagrams, yielding 1,359 valid candidates. Second, we restrict the aspect ratio to [1.5, 2.5]. Ratios below 1.5 are excluded as methodology diagrams typically require wider landscape layouts for logical flows, while ratios exceeding 2.5 are unsupported by current image generation models. Including such outliers would introduce bias in side-byside evaluations by revealing the human origin of candidates. This yields 610 valid candidates, each a tuple , where ?? is the methodology description, is the methodology diagram, and ?? is the caption.


Figure 3 | [Generated by ] Statistics of the test set of PaperBananaBench (totaling 292 samples). The average length of source context / figure caption is 3,020.1 / 70.4 words.
Categorization. To facilitate future analysis of generating different types of diagrams, we further
categorize the diagrams into four classes, based on visual topology and content: Agent & Reasoning, Vision & Perception, Generative & Learning, and Science & Applications (see Appendix C for definitions). Gemini-3-Pro is used to perform the categorization, assigning samples with hybrid elements to their predominant category.
Human Curation. Finally, we conduct a human curation phase to guarantee the integrity and quality of the dataset. Annotators are tasked with verifying and correcting the extracted methodology descriptions and captions, validating the correctness of diagram categorizations, and filtering out diagrams of insufficient visual quality (e.g., overly simplistic, cluttered, or abstract designs). Following this rigorous process, 584 valid samples remain. We randomly partition these into two equal subsets: a test set for evaluation and a reference set ) to facilitate retrieval-augmented in-context learning.
4.2. Evaluation Protocol
We utilize VLM-as-a-Judge to assess the quality of methodology diagrams and statistical plots. Given the inherent subjectivity in evaluating visual design, we employ a referenced comparison approach where the judge compares the model-generated diagram against the human-drawn diagram to determine which better satisfies each evaluation criterion.
Evaluation Dimensions. Inspired by Quispel et al. (2018), we evaluate diagrams on two perspectives. Detailed rubrics for each dimension are provided in Appendix H.
• Content (Faithfulness & Conciseness): Faithfulness ensures alignment with the source context (methodology description) and communicative intent (caption), while Conciseness requires focusing on core information without visual clutter.
• Presentation (Readability & Aesthetics): Readability demands intelligible layouts, legible text, no excessive crossing lines, etc. Aesthetics evaluates adherence to the stylistic norms of academic manuscripts.
Referenced Scoring. For each dimension, the VLM judge compares the model-generated diagram against the human reference given the context and caption. It determines Model wins, Human wins, or Tie based on relative quality, which are then mapped to scores of 100, 0, and 50, respectively. To aggregate scores into an overall metric, we follow the design principle that information visualization must primarily “show the truth” (Mackinlay, 1986; Quispel et al., 2018; Tufte and Graves-Morris, 1983). We employ a hierarchical aggregation strategy, designating faithfulness and readability as primary dimensions, and conciseness and aesthetics as secondary. If primary dimensions yield a decisive winner (i.e., winning both, or winning one with a tie), this determines the overall winner. In case of a tie (e.g., each wins one, or both tie), we apply the same rule to the secondary dimensions. This hierarchical approach ensures that content fidelity and clarity take precedence over aesthetics and conciseness.
5. Experiments
5.1. Baseline Methods and Models
We compare PaperBanana against three baseline settings: (1) Vanilla, directly prompting the image generation model to generate diagrams based on the input context (methodology description and caption); (2) Few-shot, building upon the vanilla baseline by augmenting the prompt with 10 few-shot examples, where each example consists of a triplet (methodology description, caption, diagram) to enable in-context learning for the image generation model; (3) Paper2Any (Liu et al., 2025), an agentic
Table 1 | Main results on PaperBananaBench. Best score in each column is in bold.
| Method | Faithfulness ↑ | Conciseness ↑ | Readability ↑ | Aesthetic ↑ | Overall ↑ |
| Vanilla Settings | |||||
| GPT-Image-1.5 | 4.5 | 37.5 | 30.0 | 37.0 | 11.5 |
| Nano-Banana-Pro | 43.0 | 43.5 | 38.5 | 65.5 | 43.2 |
| Few-shot Nano-Banana-Pro | 41.6 | 49.6 | 37.6 | 60.5 | 41.8 |
| Agentic Frameworks | |||||
| Paper2Any (w/ Nano-Banana-Pro) | 6.5 | 44.0 | 20.5 | 40.0 | 8.5 |
| PAPERBANANA (Ours) | |||||
| w/ GPT-Image-1.5 | 16.0 | 65.0 | 33.0 | 56.0 | 19.0 |
| w/ Nano-Banana-Pro | 45.8 | 80.7 | 51.4 | 72.1 | 60.2 |
| Human | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 |
framework that generates diagrams to present high-level ideas of the papers, which is the closest to our setting. For VLM backbone, we default to Gemini-3-Pro, while for image generation model, we experiment with Nano-Banana-Pro and GPT-Image-1.5. (See Appendix C for more implementation details.)
5.2. Evaluation Settings.
Evaluating the quality of generated diagrams demands strong visual perception and understanding capabilities, particularly for the Faithfulness dimension, which requires accurately identifying and interpreting subtle modules and connections. Hence, we employ Gemini-3-Pro as our VLM-based Judge. To validate its reliability, we randomly sampled 50 cases (25 from vanilla and 25 from our method) and conducted a two-fold validation process:
Inter-Model Agreement (Consistency). First, we verify that our evaluation protocol is robust and model-agnostic. We evaluated the agreement between our judge (Gemini-3-Pro) and other distinct VLMs (Gemini-3-Flash and GPT-5). Kendall’s tau correlations with Gemini-3-Flash across the four dimensions (Faithfulness, Conciseness, Readability, Aesthetic) and their aggregation are 0.51, 0.60, 0.45, 0.56, and 0.55, respectively; correlations with GPT-5 are 0.43, 0.47, 0.44, 0.42, and 0.45, respectively. This confirms the consistency of our protocol across different judge models1.
Human Alignment (Validity). Second, we verify that our VLM judge is a valid proxy for human evaluation. We tasked two human annotators to independently perform reference-based scoring on the same 50 samples using the same rubrics, followed by a discussion to reach consensus on conflicting cases. Kendall’s tau correlations between Gemini-3-Pro and human annotations are 0.43, 0.57, 0.45, 0.41, and 0.45, respectively. These strong correlations demonstrate that our VLM-based judge aligns well with human perception. (See Appendix B for more details.)
5.3. Main Results
Table 1 summarizes the performance of ours and baseline methods on PaperBananaBench. PaperBanana consistently outperforms leading baselines across all metrics. We attribute the poor performance of GPT-Image in both vanilla and agentic settings to its weaker instruction following and text rendering capabilities compared to Nano-Banana-Pro, which fails to meet the strict requirements
Table 2 | Ablation study on PaperBananaBench. The shaded row indicates the default setting of PaperBanana. We systematically ablate each agent component to assess its contribution. The symbol denotes the Random Retriever which randomly selects 10 examples instead of performing semantic retrieval.
| # | Module | Faithfulness ↑ Conciseness ↑ Readability ↑ Aesthetic ↑ Overall ↑ | ||||||||
| Retriever | Planner | Stylist | Visualizer | Critic | ||||||
| ① | ✓ | ✓ | ✓ | ✓ | 3 iters | 45.8 | 80.7 | 51.4 | 72.1 | 60.2 |
| ② | ✓ | ✓ | ✓ | ✓ | 1 iter | 38.3 | 75.2 | 50.6 | 68.9 | 51.8 |
| ③ | ✓ | ✓ | ✓ | ✓ | - | 30.7 | 79.2 | 47.0 | 72.1 | 45.6 |
| ④ | ✓ | ✓ | - | ✓ | - | 39.2 | 61.7 | 47.9 | 67.4 | 49.2 |
| ⑤ | ○ | ✓ | - | ✓ | - | 37.3 | 62.7 | 51.1 | 65.6 | 48.3 |
| ⑥ | - | ✓ | - | ✓ | - | 41.9 | 58.6 | 43.1 | 62.9 | 44.2 |
of academic illustration. Similarly, while Paper2Any also supports generating paper figures, it prioritizes the presentation of high-level ideas rather than the faithful depiction of specific methodological flows necessary for methodology diagrams. This objective mismatch leads to its underperformance in our evaluation setting.
In contrast, PaperBanana achieves comprehensive improvements over the Vanilla Nano-Banana-Pro baseline: Faithfulness , Conciseness , Readability , and Aesthetics , contributing to a gain in the Overall score. Regarding performance across categories, Agent & Reasoning achieves the highest overall score , followed by Scientific & Application and Generative & Learning , while Vision & Perception scores the lowest . We also conducted a blind human evaluation on a subset of 50 cases to compare PaperBanana against vanilla Nano-Banana-Pro (See Appendix B for details). The average win / tie / loss rate of PaperBanana from 3 human judges is , respectively. This further validates that our agentic workflow shows promising improvements in automated methodology diagram generation. (See Appendix Figure 7 for case studies)
Despite the progress, we note that PaperBanana still underperforms the human reference in terms of faithfulness. We have included some failure analysis in Appendix Figure 10 to provide insights into the challenges.
5.4. Ablation Study
To understand the contribution of each agent component, we conduct an ablation study, with results presented in Table 2.
Impact of the Retriever Agent. We compare the semantic retriever with random and no-retriever baselines (rows in Table 2). Without reference examples as guidance, the no-retriever setting significantly underperforms in Conciseness, Readability, and Aesthetics, as the Planner defaults to verbose, exhaustive descriptions. Moreover, lacking exposure to academic diagram aesthetics, this setting produces visually less refined outputs. Interestingly, the random retriever achieves performance comparable to the semantic approach, suggesting that providing general structural and stylistic patterns is more critical than precise content matching.
Impact of the Stylist and Critic Agents. Comparing rows and shows that the Stylist boosts Conciseness and Aesthetics but lowers Faithfulness , as visual polishing sometimes omits technical details. However, the Critic Agent (row vs. ) effectively bridges this

Figure 4 | [Generated by ] Vanilla Gemini-3-Pro vs. PaperBanana for statistical plots generation.

Figure 5 | [Generated by ] Coding vs. Image Generation for visualizing statistical plots.
gap, substantially recovering Faithfulness. Additional iterations further enhance all metrics, ensuring a balance between aesthetics and technical accuracy.
5.5. PaperBanana for Statistical Plots Generation.
PaperBanana operates by first synthesizing a detailed description of the target illustration, then visualizing it into an image. Unlike methodology diagrams that prioritize aesthetics and logical coherence, statistical plots demand rigorous numerical precision, making standard image generation models unsuitable. To address this, we demonstrate that by adopting executable code for visualization, PaperBanana seamlessly extends to statistical plot generation.
Testset Curation. Following the task formulation in Section 2, we assess PaperBanana’s capability to generate statistical plots from tabular data and brief visual descriptions. Since raw data of statistical plots is rarely available in academic manuscripts, we repurpose ChartMimic (Yang et al., 2025b), a dataset originally constructed for chart-to-code generation. This dataset primarily includes statistical plots from arXiv papers and Matplotlib galleries, paired with human-curated Python code. Leveraging Gemini-3-Pro, we extract the underlying tabular data from the code and synthesize a brief description for each plot. Following rigorous filtering and sampling (see Appendix D), we curate 240 test cases and 240 reference examples, stratified across seven plot categories—bar chart, line chart, tree & pie chart, scatter plot, heatmap, radar chart, and miscellaneous—and two complexity levels (easy and hard). For evaluation, we adhere to the protocol detailed in Section 4, with prompts specifically tailored to statistical plots.
Figure 4 compares PaperBanana with vanilla Gemini-3-Pro on our curated test set. Our method consistently outperforms the baseline across all dimensions, achieving gains of , , , and in Faithfulness, Conciseness, Readability, and Aesthetics, respectively, resulting in a overall improvement. Notably, PaperBanana slightly surpasses human performance in Conciseness, Readability, and Aesthetics while remaining competitive in Faithfulness, showcasing its effectiveness for statistical plot.
6. Discussion
6.1. Enhancing Aesthetics of Human-Drawn Diagrams
Given the summarized aesthetic guidelines , an intriguing question arises: can these guidelines serve to elevate the aesthetic quality of existing human-drawn diagrams? To explore this, we implement a streamlined pipeline where Gemini-3-Pro first formulates up to 10 actionable suggestions based on
the original diagram and , which are then executed by Nano-Banana-Pro to refine the image. We evaluate the results using our referencebased protocol, comparing the refined output against the original human-drawn diagram. Across the 292 test cases, the refined diagrams achieved a win / tie / loss ratio of in aesthetics against their original counterparts, showing that the summarized aesthetic guidelines can indeed serve to elevate the aesthetic quality of existing human-authored diagrams. An illustrative example is provided in Figure 6 2. More examples are provided in AppendixFigure 8.

Figure 6 | Example of enhancing aesthetics of human-drawn diagrams
6.2. Coding vs Image Generation for Visualizing Statistical Plots
For statistical plots, code-based approaches have demonstrated remarkable efficacy, as evidenced by Figure 4 and prior studies (Chen et al., 2025; Goswami et al., 2025; Yang et al., 2024). Given the advanced fidelity and visual appeal of recent image generation models, we compare code-based (Gemini-3-Pro) and image-generation-based (Nano-Banana-Pro) approaches for the Visualizer agent in PaperBanana, as shown in Figure 5. Results reveal distinct trade-offs: image generation excels in presentation (Readability and Aesthetics) but underperforms in content fidelity (Faithfulness and Conciseness). Manual inspection shows that while image models faithfully render sparse plots, they struggle with dense or complex data, exhibiting numerical hallucinations or element repetition (Appendix Figure 9). Thus, hybridly using image generation for sparse visualizations and code for dense plots may offer the best balance.
7. Related Work
7.1. Automated Academic Diagram Generation.
Automated academic diagram generation remains a long-standing challenge (Rodriguez et al., 2023). Prior work primarily adopts code-based generation using TikZ (Belouadi and Eger, 2024; Belouadi et al., 2023, 2025; Zhang et al., 2025) or Python-PPT (Pang et al., 2025; Zheng et al., 2025) for programmatic synthesis. While effective for structured content, these approaches face expressiveness limitations in generating the intricate visual designs prevalent in modern AI publications.
Recent image generation models have achieved remarkable progress in synthesizing high-fidelity, visually sophisticated figures (Deepmind, 2025; OpenAI, 2025a; Tang et al., 2026; Team et al., 2025; Zuo et al., 2025), offering a promising alternative. Concurrent to our work, AutoFigure (Zhu et al., 2026) and AutoFigure-Edit (Lin et al., 2026) transforms scientific content into symbolic representations before rendering them as images using GPT-Image. In comparison, our method achieves broader generalizability through adaptive retrieval and academic-style transfer, with greater extensibility by supporting both methodology diagrams and statistical plots in a unified pipeline.
For evaluation benchmarks, quality assessment of auto-generated diagrams remains less explored. Most closely related to PaperBananaBench is SridBench (Chang et al., 2025), which evaluates
automated diagram generation from method sections and captions across computer science and natural science domains. We will report results once it is publicly available.
7.2. Coding-Based Data Visualization
While the inherent complexity of academic diagram generation has deterred pioneering research, visualizing statistical data has garnered extensive attention since the rise of language models. Early endeavors (Dibia and Demiralp, 2019) employed LSTM-based models to convert JSON data into Vega-Lite visualizations, followed by few-shot and zero-shot coding approaches (Dibia, 2023; Galimzyanov et al., 2025; Li et al., 2024; Tian et al., 2024) leveraging large-scale backbones such as ChatGPT (OpenAI, 2022). More recently, agentic frameworks have demonstrated remarkable progress in coding-based data visualization (Chen et al., 2025; Goswami et al., 2025; Seo et al., 2025; Yang et al., 2024), leveraging fundamental mechanisms such as test-time scaling (Snell et al., 2024) and self-reflection (Shinn et al., 2023). While this paper is more focused on automated generation of academic diagrams and plots, these agentic frameworks can be seamlessly integrated into our Visualizer Agent to enhance its capability in translating detailed descriptions of desired plots into robust Python code. Complementary to generation, recent efforts have also explored reversing plots back into their original code (Wu et al., 2025b; Yang et al., 2025b), challenging both the perception and coding capabilities of VLMs.
8. Conclusion
This paper introduces PaperBanana, an agentic framework designed to automate the generation of publication-ready academic illustrations. By orchestrating specialized agents—Retriever, Planner, Stylist, Visualizer, and Critic—our approach transforms scientific content into high-fidelity methodology diagrams and statistical plots. To facilitate rigorous evaluation, we presented PaperBananaBench, a comprehensive benchmark curated from top-tier AI conferences. Extensive experiments demonstrate that PaperBanana significantly outperforms existing baselines in faithfulness, conciseness, readability, and aesthetics, paving the way for AI scientists to autonomously communicate their discoveries with professional-grade visualizations.
9. Limitations and Future Directions
As a pioneering work, although PaperBanana achieves promising results, it inevitably faces certain limitations. This section will discuss these limitations in detail, and outline the corresponding future directions we envision.
Towards Editable Academic Illustrations. The most prominent limitation of PaperBanana lies in the raster nature of its output. Unlike vector graphics—which are preferred in academic contexts for their infinite scalability and precise detail preservation—raster images are inherently difficult to edit. While generating outputs at 4K resolution serves as a viable workaround to ensure high visual fidelity, it does not fundamentally resolve the challenge of post-generation modification. To address this, we envision three potential solutions catering to varying levels of editing needs. For minor visual adjustments, leveraging state-of-the-art image editing models, such as Nano-Banana-Pro, serves as the most direct approach. For more structural modifications, a reconstruction pipeline as exemplified by Paper2Any (Liu et al., 2025), Edit Banana (BIT-DataLab, 2025) and AutoFigure-Edit (Lin et al., 2026) can be adopted: employing OCR for text extraction and SAM3 for pattern segmentation, followed by reassembling these elements on presentation slides (e.g., via Python-PPTX). While currently facing challenges when handling complex backgrounds and intricate visual elements,
we anticipate that training specialized element extraction models will significantly enhance the robustness of this reconstruction. Finally, a more advanced direction involves developing a GUI Agent capable of autonomously operating professional vector design software (Huang et al., 2026; Sun et al., 2025), such as Adobe Illustrator. This would enable the direct generation of fully editable vector graphics, although it necessitates the agent to possess exceptional perception, planning and interaction capabilities.
The Trade-off between Style Standardization and Diversity. The second limitation lies in the trade-off between style standardization and diversity. While our unified style guide ensures rigid compliance with academic standards, it inevitably reduces the stylistic diversity of the output. Future work could explore more dynamic style adaptation mechanisms that allow for a broader range of artistic expressions and personalized aesthetic choices while maintaining professional rigor.
The Challenge of Fine-Grained Faithfulness. While PaperBanana excels in aesthetics, a performance gap in faithfulness compared to human experts remains. As shown in our failure analysis (Figure 10 in the Appendix), the most prevalent errors involve fine-grained connectivity, such as misaligned start/end points or incorrect arrow directions. These subtleties often escape the detection of current critic models, limiting the efficacy of self-correction. We posit that closing this gap primarily hinges on advancing the fine-grained visual perception capabilities of the foundation VLMs.
Advancing Evaluation Paradigms. Following existing practices, our evaluation adopts a referencebased VLM-as-a-Judge setup. Despite its effectiveness, we acknowledge that this evaluation paradigm still faces inherent challenges. First, regarding faithfulness, quantifying structural correctness remains challenging, as detecting subtle errors in connectivity and notation requires high-precision scrutiny. Future protocols could benefit from incorporating fine-grained, structure-based (Liang and You, 2025) or rubric-based (Huang et al., 2026; Li et al., 2025) metrics, which may offer higher accuracy despite their increased computational complexity. Second, for subjective dimensions such as aesthetics, we observe that textual prompting is often insufficient to fully align the VLM with human preferences. We envision that training customized reward models to bridge this alignment gap represents a crucial direction for future research.
Test-Time Scaling for Diverse Preferences. Currently, our framework produces a single output for each query. However, given the inherent stochasticity of generative models and the subjectivity of aesthetic preferences, a single result may not universally satisfy diverse user tastes. A natural extension is to implement test-time scaling by generating a spectrum of candidates with varying styles and compositions. This paradigm shifts the focus from single-shot generation to a generate-and-select workflow, enabling either human users or VLM-based preference models to select the illustration that best aligns with their specific requirements.
Extension to Broader Domains. Beyond academic illustrations, our framework establishes a generalizable paradigm: leveraging retrieval to instruct the model on what to generate (target diagram types) and employing automatic style summarization to teach it how to generate (stylistic norms). By effectively decoupling structural planning from aesthetic rendering, this reference-driven approach bypasses the need for expensive domain-specific fine-tuning. We believe this paradigm holds significant promise for other specialized domains requiring strict adherence to community standards, such as UI/UX design, patent drafting, and industrial schematics.
Acknowledgements
We thank all members of Google Cloud AI Research for their valuable support during the project. We also thank Yuhang and Ali for the thoughtful discussion.
Impact Statement
This paper introduces PaperBanana, a framework designed to automate the generation of academic illustrations. Our goal is to democratize access to high-quality visual communication tools, particularly benefiting researchers who may lack professional design resources. By reducing the manual effort required for diagram creation, we aim to accelerate the scientific workflow. However, we acknowledge the ethical risk associated with generative models, specifically the potential for “visual hallucination” or unfaithful representation of technical details. It is imperative that users of such systems reject blind reliance and maintain rigorous human oversight to ensure the scientific integrity of published illustrations.
References
Anthropic. Claude sonnet 4: Hybrid reasoning model with superior intelligence for high-volume use cases, and 200k context window. https://www.anthropic.com/claude/sonnet, 2025.
J. Belouadi and S. Eger. Detikzify: Synthesizing graphics programs for scientific figures and sketches with tikz. In Advances in Neural Information Processing Systems, 2024.
J. Belouadi, A. Lauscher, and S. Eger. Automatikz: Text-guided synthesis of scientific vector graphics with tikz. arXiv preprint arXiv:2310.00367, 2023.
J. Belouadi, E. Ilg, M. Keuper, H. Tanaka, M. Utiyama, R. Dabre, S. Eger, and S. Ponzetto. Tikzero: Zero-shot text-guided graphics program synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17793–17806, 2025.
BIT-DataLab. Edit banana, Oct. 2025. URL https://github.com/BIT-DataLab/Edit-Banana.
Y. Chang, Y. Feng, J. Sun, J. Ai, C. Li, S. K. Zhou, and K. Zhang. Sridbench: Benchmark of scientific research illustration drawing of image generation model. arXiv preprint arXiv:2505.22126, 2025.
Z. Chen, J. Chen, S. Ö. Arik, M. Sra, T. Pfister, and J. Yoon. Coda: Agentic systems for collaborative data visualization. arXiv preprint arXiv:2510.03194, 2025.
J. Cohen. Statistical power analysis for the behavioral sciences. routledge, 2013.
G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025.
G. Deepmind. Introducing nano banana pro. https://blog.google/technology/ai/ nano-banana-pro/, 2025.
V. Dibia. Lida: A tool for automatic generation of grammar-agnostic visualizations and infographics using large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 113–126, 2023.
V. Dibia and Ç. Demiralp. Data2vis: Automatic generation of data visualizations using sequenceto-sequence recurrent neural networks. IEEE computer graphics and applications, 39(5):33–46, 2019.
T. Galimzyanov, S. Titov, Y. Golubev, and E. Bogomolov. Drawing pandas: A benchmark for llms in generating plotting code. In 2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR), pages 503–507. IEEE, 2025.
Z. Ghahramani. Probabilistic machine learning and artificial intelligence. Nature, 521(7553):452–459, 2015.
K. Goswami, P. Mathur, R. Rossi, and F. Dernoncourt. Plotgen: Multi-agent llm-based scientific data visualization via multimodal feedback. arXiv preprint arXiv:2502.00988, 2025.
J. Gottweis, W.-H. Weng, A. Daryin, T. Tu, A. Palepu, P. Sirkovic, A. Myaskovsky, F. Weissenberger, K. Rong, R. Tanno, et al. Towards an ai co-scientist. arXiv preprint arXiv:2502.18864, 2025.
M. Hollander, D. A. Wolfe, and E. Chicken. Nonparametric statistical methods. John Wiley & Sons, 2013.
S. Huang, Y. Gao, J. Bai, Y. Zhou, Z. Yin, X. Liu, R. Chellappa, C. P. Lau, S. Nag, C. Peng, et al. Scifig: Towards automating scientific figure generation. arXiv preprint arXiv:2601.04390, 2026.
J. Kim, J. J. An, K. E. Jeon, and J. H. Ko. Efficient multi-bit quantization network training via weight bias correction and bit-wise coreset sampling. arXiv preprint arXiv:2510.20673, 2025.
P. Langley. Scientific discovery: Computational explorations of the creative processes. MIT press, 1987.
P. Langley. Integrated systems for computational scientific discovery. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 22598–22606, 2024.
S. Li, X. Chen, Y. Song, Y. Song, and C. Zhang. Prompt4vis: Prompting large language models with example mining and schema filtering for tabular data visualization. arXiv preprint arXiv:2402.07909, 2024.
S. Li, Y. Zhang, J. Wu, Z. Lei, Y. He, R. Wen, C. Liao, C. Jiang, A. Ping, S. Gao, et al. If-vidcap: Can video caption models follow instructions? arXiv preprint arXiv:2510.18726, 2025.
C. Liang and J. You. Diagrameval: Evaluating llm-generated diagrams via graphs. arXiv preprint arXiv:2510.25761, 3, 2025.
Z. Lin, Q. Xie, M. Zhu, S. Li, Q. Sun, E. Gu, Y. Ding, K. Sun, F. Guo, P. Lu, Z. Ning, Y. Weng, and Y. Zhang. Autofigure-edit: Generating editable scientific illustration, 2026. URL https: //arxiv.org/abs/2603.06674.
A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024.
Z. Liu, Z. Guo, Z. Su, L. Huang, Y. Yang, Z. Han, Z. Pan, and W. Zhang. Paper2Any: Turn Paper/Text/Topic into Editable Research Figures, Technical Route Diagrams, and Presentation Slides, Oct. 2025. URL https://github.com/OpenDCAI/Paper2Any.
C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha. The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292, 2024.
Z. Luo, Z. Yang, Z. Xu, W. Yang, and X. Du. Llm4sr: A survey on large language models for scientific research. arXiv preprint arXiv:2501.04306, 2025.
J. Mackinlay. Automating the design of graphical presentations of relational information. Acm Transactions On Graphics (Tog), 5(2):110–141, 1986.
J. Niu, Z. Liu, Z. Gu, B. Wang, L. Ouyang, Z. Zhao, T. Chu, T. He, F. Wu, Q. Zhang, Z. Jin, G. Liang, R. Zhang, W. Zhang, Y. Qu, Z. Ren, Y. Sun, Y. Zheng, D. Ma, Z. Tang, B. Niu, Z. Miao, H. Dong, S. Qian, J. Zhang, J. Chen, F. Wang, X. Zhao, L. Wei, W. Li, S. Wang, R. Xu, Y. Cao, L. Chen, Q. Wu, H. Gu, L. Lu, K. Wang, D. Lin, G. Shen, X. Zhou, L. Zhang, Y. Zang, X. Dong, J. Wang, B. Zhang, L. Bai, P. Chu, W. Li, J. Wu, L. Wu, Z. Li, G. Wang, Z. Tu, C. Xu, K. Chen, Y. Qiao, B. Zhou, D. Lin, W. Zhang, and C. He. Mineru2.5: A decoupled vision-language model for efficient high-resolution document parsing, 2025. URL https://arxiv.org/abs/2509.22186.
OpenAI. Introducing chatgpt. 2022. URL https://openai.com/blog/chatgpt.
OpenAI. Gpt-image-1. https://platform.openai.com/docs/models/gpt-image-1, 2025a.
OpenAI. Introducing gpt-5. https://openai.com/index/introducing-gpt-5/, 2025b.
W. Pang, K. Q. Lin, X. Jian, X. He, and P. Torr. Paper2poster: Towards multimodal poster automation from scientific papers. arXiv preprint arXiv:2505.21497, 2025.
A. Quispel, A. Maes, and J. Schilperoord. Aesthetics and clarity in information visualization: The designer’s perspective. In Arts, volume 7, page 72. MDPI, 2018.
J. A. Rodriguez, D. Vazquez, I. Laradji, M. Pedersoli, and P. Rodriguez. Figgen: Text to scientific figure generation. arXiv preprint arXiv:2306.00800, 2023.
J. Schmidhuber. Artificial scientists & artists based on the formal theory of creativity. In 3d Conference on Artificial General Intelligence (AGI-2010), pages 148–153. Atlantis Press, 2010.
W. Seo, S. Lee, D. Kang, H. An, Z. Yuan, and S. Lee. Automated visualization code synthesis via multi-path reasoning and feedback-driven optimization. arXiv preprint arXiv:2502.11140, 2025.
N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36:8634–8652, 2023.
C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024.
J. Sun, F. Zhang, Y. Feng, C. Li, Z. Li, J. Ai, Y. Chang, Y. Dai, and K. Zhang. From pixels to paths: A multi-agent framework for editable scientific illustration. arXiv preprint arXiv:2510.27452, 2025.
Y. Tang, X. Liu, B. Zhang, T. Lan, Y. Xie, J. Lao, Y. Wang, H. Li, T. Gao, B. Pan, et al. Igenbench: Benchmarking the reliability of text-to-infographic generation. arXiv preprint arXiv:2601.04498, 2026.
K. Team, J. Chen, Y. Ci, X. Du, Z. Feng, K. Gai, S. Guo, F. Han, J. He, K. He, et al. Kling-omni technical report. arXiv preprint arXiv:2512.16776, 2025.
Y. Tian, W. Cui, D. Deng, X. Yi, Y. Yang, H. Zhang, and Y. Wu. Chartgpt: Leveraging llms to generate charts from abstract natural language. IEEE Transactions on Visualization and Computer Graphics, 31(3):1731–1745, 2024.
E. R. Tufte and P. R. Graves-Morris. The visual display of quantitative information, volume 2. Graphics press Cheshire, CT, 1983.
C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S.-m. Yin, S. Bai, X. Xu, Y. Chen, et al. Qwen-image technical report. arXiv preprint arXiv:2508.02324, 2025a.
C. Wu, Z. Liang, Y. Ge, Q. Guo, Z. Lu, J. Wang, Y. Shan, and P. Luo. Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 3006–3028, 2025b.
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025a.
C. Yang, C. Shi, Y. Liu, B. Shui, J. Wang, M. Jing, L. XU, X. Zhu, S. Li, Y. Zhang, et al. Chartmimic: Evaluating lmm’s cross-modal reasoning capability via chart-to-code generation. In The Thirteenth International Conference on Learning Representations, 2025b.
Z. Yang, Z. Zhou, S. Wang, X. Cong, X. Han, Y. Yan, Z. Liu, Z. Tan, P. Liu, D. Yu, et al. Matplotagent: Method and evaluation for llm-based agentic scientific data visualization. In Findings of the Association for Computational Linguistics ACL 2024, pages 11789–11804, 2024.
L. Zhang, S. Eger, Y. Cheng, W. ZHAI, J. Belouadi, F. Moafian, and Z. Zhao. Scimage: How good are multimodal large language models at scientific text-to-image generation? In The Thirteenth International Conference on Learning Representations, 2025.
H. Zheng, X. Guan, H. Kong, W. Zhang, J. Zheng, W. Zhou, H. Lin, Y. Lu, X. Han, and L. Sun. Pptagent: Generating and evaluating presentations beyond text-to-slides. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14413–14429, 2025.
M. Zhu, Z. Lin, Y. Weng, P. Lu, Q. Xie, Y. Wei, S. Liu, Q. Sun, and Y. Zhang. Autofigure: Generating and refining publication-ready scientific illustrations. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=5N3z9JQJKq.
J. Zuo, H. Deng, H. Zhou, J. Zhu, Y. Zhang, Y. Zhang, Y. Yan, K. Huang, W. Chen, Y. Deng, et al. Is nano banana pro a low-level vision all-rounder? a comprehensive evaluation on 14 tasks and 40 datasets. arXiv preprint arXiv:2512.15110, 2025.
A. Dedicated Case Studies
Cases Demonstrating the Effectiveness of PaperBanana We provide 2 cases in Figure 7 to demonstrate the capability of PaperBanana for aiding the generation of academic illustrations. Given the same source context and caption, the vanilla Nano-Banana-Pro often produces diagrams with outdated color tones and overly verbose content. In contrast, our PaperBanana generates results that are more concise and aesthetically pleasing, while maintaining faithfulness to the source context.
| Case 1 | Case 2 | |||||
| Human | Long Doc Question | Lens Module Lens Module (fEtracur) Page Navigator Element Localization Relevant Pages Visual & Textual Elements | Reasoning Module Answerer #1 Answer #2 Answer #3 Answer #4 Adjuclator Final Answer | Captured Event Frames Feature Extraction Left Features Energy Guided Confidence Estimation Energy Map Feature Extraction Right Features | Cost Volume Cost Aggregation | Disparity |
| Nano- Banana-Pro | Lem Module (fEtracur) Page Navigator Element Localization Extracted Test T1 Full Evidence Set E100 (P1), P100 (P1), P100 (P1) L1 Element Localizer Extracted Test T2 Full Evidence Set E200 (P2), P200 (P2) L2 Element Localizer Extracted Test T2 Full Evidence Set E200 (P2), P200 (P2) L3 Element Localizer Extracted Test T3, Full Evidence Set E300 (P3), P300 (P3) L4 Element Localizer Extracted Test T4 Full Evidence Set E400 (P4), P400 (P4) L5 Element Localizer Extracted Test T5 Full Evidence Set E500 (P5), P500 (P5) L6 Element Localizer Extracted Test T6 Full Evidence Set E600 (P6), P600 (P6) L7 Element Localizer Extracted Test T7 Full Evidence Set E700 (P7), P700 (P7) L8 Element Localizer Extracted Test T8 Full Evidence Set E800 (P8), P800 (P8) L9 Element Localizer Extracted Test T9 Full Evidence Set E900 (P9), P900 (P9) L10 Element Localizer Extracted Test T10 Full Evidence Set E100 (P10), P100 (P10) L11 Element Localizer Extracted Test T11 Full Evidence Set E110 (P11), P110 (P11) L12 Element Localizer Extracted Test T12 Full Evidence Set E120 (P12), P120 (P12) L13 Element Localizer Extracted Test T13 Full Evidence Set E130 (P13), P130 (P13) L14 Element Localizer Extracted Test T14 Full Evidence Set E140 (P14), P140 (P14) L15 Element Localizer Extracted Test T15 Full Evidence Set E150 (P15), P150 (P15) L16 Element Localizer Extracted Test T16 Full Evidence Set E160 (P16), P160 (P16) L17 Element Localizer Extracted Test T17 Full Evidence Set E170 (P17), P170 (P17) L18 Element Localizer Extracted Test T18 Full Evidence Set E180 (P18), P180 (P18) L19 Element Localizer Extracted Test T19 Full Evidence Set E190 (P19), P190 (P19) L20 Element Localizer Extracted Test T20 Full Evidence Set E200 (P20), P200 (P20) L21 Element Localizer Extracted Test T21 Full Evidence Set E210 (P21), P210 (P21) L22 Element Localizer Extracted Test T22 Full Evidence Set E220 (P22), P220 (P22) L23 Element Localizer Extracted Test T23 Full Evidence Set E230 (P23), P230 (P23) L24 Element Localizer Extracted Test T24 Full Evidence Set E240 (P24), P240 (P24) L25 Element Localizer Extracted Test T25 Full Evidence Set E250 (P25), P250 (P25) L26 Element Localizer Extracted Test T26 Full Evidence Set E260 (P26), P260 (P26) L27 Element Localizer Extracted Test T27 Full Evidence Set E270 (P27), P270 (P27) L28 Element Localizer Extracted Test T28 Full Evidence Set E280 (P28), P280 (P28) L29 Element Localizer Extracted Test T29 Full Evidence Set E290 (P29), P290 (P29) L30 Element Localizer Extracted Test T30 Full Evidence Set E300 (P30), P300 (P30) L31 Element Localizer Extracted Test T31 Full Evidence Set E310 (P31), P310 (P31) L32 Element Localizer Extracted Test T32 Full Evidence Set E320 (P32), P320 (P32) L33 Element Localizer Extracted Test T33 Full Evidence Set E330 (P33), P330 (P33) L34 Element Localizer Extracted Test T34 Full Evidence Set E340 (P34), P340 (P34) L35 Element Localizer Extracted Test T35 Full Evidence Set E350 (P35), P350 (P35) L36 Element Localizer Extracted Test T36 Full Evidence Set E360 (P36), P360 (P36) L37 Element Localizer Extracted Test T37 Full Evidence Set E370 (P37), P370 (P37) L38 Element Localizer Extracted Test T38 Full Evidence Set E380 (P38), P380 (P38) L39 Element Localizer Extracted Test T39 Full Evidence Set E390 (P39), P390 (P39) L40 Element Localizer Extracted Test T40 Full Evidence Set E400 (P40), P400 (P40) L41 Element Localizer Extracted Test T41 Full Evidence Set E410 (P41), P410 (P41) L42 Element Localizer Extracted Test T42 Full Evidence Set E420 (P42), P420 (P42) L43 Element Localizer Extracted Test T43 Full Evidence Set E430 (P43), P430 (P43) L44 Element Localizer Extracted Test T44 Full Evidence Set E440 (P44), P440 (P44) L45 Element Localizer Extracted Test T45 Full Evidence Set E450 (P45), P450 (P45) L46 Element Localizer Extracted Test T46 Full Evidence Set E460 (P46), P460 (P46) L47 Element Localizer Extracted Test T47 Full Evidence Set E470 (P47), P470 (P47) L48 Element Localizer Extracted Test T48 Full Evidence Set E480 (P48), P480 (P48) L49 Element Localizer Extracted Test T49 Full Evidence Set E490 (P49), P490 (P49) L50 Element Localizer Extracted Test T50 Full Evidence Set E500 (P50), P500 (P50) L51 Element Localizer Extracted Test T51 Full Evidence Set E510 (P51), P510 (P51) L52 Element Localizer Extracted Test T52 Full Evidence Set E520 (P52), P520 (P52) L53 Element Localizer Extracted Test T53 Full Evidence Set E530 (P53), P530 (P53) L54 Element Localizer Extracted Test T54 Full Evidence Set E540 (P54), P540 (P54) L55 Element Localizer Extracted Test T55 Full Evidence Set E550 (P55), P550 (P55) L56 Element Localizer Extracted Test T56 Full Evidence Set E560 (P56), P560 (P56) L57 Element Localizer Extracted Test T57 Full Evidence Set E570 (P57), P570 (P57) L58 Element Localizer Extracted Test T58 Full Evidence Set E580 (P58), P580 (P58) L59 Element Localizer Extracted Test T59 Full Evidence Set E590 (P59), P590 (P59) L60 Element Localizer Extracted Test T60 Full Evidence Set E600 (P60), P600 (P60) L61 Element Localizer Extracted Test T61 Full Evidence Set E610 (P61), P610 (P61) L62 Element Localizer Extracted Test T62 Full Evidence Set E620 (P62), P620 (P62) L63 Element Localizer Extracted Test T63 Full Evidence Set E630 (P63), P630 (P63) L64 Element Localizer Extracted Test T64 Full Evidence Set E640 (P64), P640 (P64) L65 Element Localizer Extracted Test T65 Full Evidence Set E650 (P65), P650 (P65) L66 Element Localizer Extracted Test T66 Full Evidence Set E660 (P66), P660 (P66) L67 Element Localizer Extracted Test T67 Full Evidence Set E670 (P67), P670 (P67) L68 Element Localizer Extracted Test T68 Full Evidence Set E680 (P68), P680 (P68) L69 Element Localizer Extracted Test T69 Full Evidence Set E690 (P69), P690 (P69) L70 Element Localizer Extracted Test T70 Full Evidence Set E70 (P70), P700 (P70) L71 Element Localizer Extracted Test T71 Full Evidence Set E71 (P71), P710 (P71) L72 Element Localizer Extracted Test T72 Full Evidence Set E72 (P72), P720 (P72) L73 Element Localizer Extracted Test T73 Full Evidence Set E73 (P73), P730 (P73) L74 Element Localizer Extracted Test T74 Full Evidence Set E74 (P74), P740 (P74) L75 Element Localizer Extracted Test T75 Full Evidence Set E75 (P75), P750 (P75) L76 Element Localizer Extracted Test T76 Full Evidence Set E76 (P76), P760 (P76) L77 Element Localizer Extracted Test T77 Full Evidence Set E77 (P77), P770 (P77) L78 Element Localizer Extracted Test T78 Full Evidence Set E78 (P78), P780 (P78) L79 Element Localizer Extracted Test T79 Full Evidence Set E79 (P79), P790 (P79) L80 Element Localizer Extracted Test T80 Full Evidence Set E80 (P80), P800 (P80) L81 Element Localizer Extracted Test T81 Full Evidence Set E81 (P81), P810 (P81) L82 Element Localizer Extracted Test T82 Full Evidence Set E82 (P82), P820 (P82) L83 Element Localizer Extracted Test T83 Full Evidence Set E83 (P83), P830 (P83) L84 Element Localizer Extracted Test T84 Full Evidence Set E84 (P84), P840 (P84) L85 Element Localizer Extracted Test T85 Full Evidence Set E85 (P85), P850 (P85) L86 Element Localizer Extracted Test T86 Full Evidence Set E86 (P86), P860 (P86) L87 Element Localizer Extracted Test T87 Full Evidence Set E87 (P87), P870 (P87) L88 Element Localizer Extracted Test T88 Full Evidence Set E88 (P88), P880 (P88) L89 Element Localizer Extracted Test T89 Full Evidence Set E89 (P89), P890 (P89) L90 Element Localizer Extracted Test T90 Full Evidence Set E90 (P90), P900 (P90) L91 Element Localizer Extracted Test T91 Full Evidence Set E91 (P91), P910 (P91) L92 Element Localizer Extracted Test T92 Full Evidence Set E92 (P92), P920 (P92) L93 Element Localizer Extracted Test T93 Full Evidence Set E93 (P93), P930 (P93) L94 Element Localizer Extracted Test T94 Full Evidence Set E94 (P94), P940 (P94) L95 Element Localizer Extracted Test T95 Full Evidence Set E95 (P95), P950 (P95) L96 Element Localizer Extracted Test T96 Full Evidence Set E96 (P96), P960 (P96) L97 Element Localizer Extracted Test T97 Full Evidence Set E97 (P97), P970 (P97) L98 Element Localizer Extracted Test T98 Full Evidence Set E98 (P98), P980 (P98) L99 Element Localizer Extracted Test T99 Full Evidence Set E99 (P99), P990 (P99) L100 Element Localizer Extracted Test T100 Full Evidence Set E100 (P100), P1000 (P100) L101 Element Localizer Extracted Test T101 Full Evidence Set E101 (P101), P1010 (P101) L102 Element Localizer Extracted Test T102 Full Evidence Set E102 (P102), P1020 (P102) L103 Element Localizer Extracted Test T103 Full Evidence Set E103 (P103), P1030 (P103) L104 Element Localizer Extracted Test T104 Full Evidence Set E104 (P104), P1040 (P104) L105 Element Localizer Extracted Test T105 Full Evidence Set E105 (P105), P1050 (P105) L106 Element Localizer Extracted Test T106 Full Evidence Set E106 (P106), P1060 (P106) L107 Element Localizer Extracted Test T107 Full Evidence Set E107 (P107), P1070 (P107) L108 Element Localizer Extracted Test T108 Full Evidence Set E108 (P108), P1080 (P108) L109 Element Localizer Extracted Test T109 Full Evidence Set E109 (P109), P1090 (P109) L110 Element Localizer Extracted Test T110 Full Evidence Set E110 (P110), P1100 (P110) L111 Element Localizer Extracted Test T111 Full Evidence Set E111 (P111), P1110 (P111) L112 Element Localizer Extracted Test T112 Full Evidence Set E112 (P112), P1120 (P112) L113 Element Localizer Extracted Test T113 Full Evidence Set E113 (P113), P1130 (P113) L114 Element Localizer Extracted Test T114 Full Evidence Set E114 (P114), P1140 (P114) L115 Element Localizer Extracted Test T115 Full Evidence Set E115 (P115), P1150 (P115) L116 Element Localizer Extracted Test T116 Full Evidence Set E116 (P116), P1160 (P116) L117 Element Localizer Extracted Test T117 Full Evidence Set E117 (P117), P1170 (P117) L118 Element Localizer Extracted Test T118 Full Evidence Set E118 (P118), P1180 (P118) L119 Element Localizer Extracted Test T119 Full Evidence Set E119 (P119), P1190 (P119) L120 Element Localizer Extracted Test T120 Full Evidence Set E120 (P120), P1200 (P120) L121 Element Localizer Extracted Test T121 Full Evidence Set E121 (P121), P1210 (P121) L122 Element Localizer Extracted Test T122 Full Evidence Set E122 (P122), P1220 (P122) L123 Element Localizer Extracted Test T123 Full Evidence Set E123 (P123), P1230 (P123) L124 Element Localizer Extracted Test T124 Full Evidence Set E124 (P124), P1240 (P124) L125 Element Localizer Extracted Test T125 Full Evidence Set E125 (P125), P1250 (P125) L126 Element Localizer Extracted Test T126 Full Evidence Set E126 (P126), P1260 (P126) L127 Element Localizer Extracted Test T127 Full Evidence Set E127 (P127), P1270 (P127) L128 Element Localizer Extracted Test T128 Full Evidence Set E128 (P128), P1280 (P128) L129 Element Localizer Extracted Test T129 Full Evidence Set E129 (P129), P1290 (P129) L130 Element Localizer Extracted Test T130 Full Evidence Set E130 (P130), P1300 (P130) L131 Element Localizer Extracted Test T131 Full Evidence Set E131 (P131), P1310 (P131) L132 Element Localizer Extracted Test T132 Full Evidence Set E132 (P132), P1320 (P132) L133 Element Localizer Extracted Test T133 Full Evidence Set E133 (P133), P1330 (P133) L134 Element Localizer Extracted Test T134 Full Evidence Set E134 (P134), P1340 (P134) L135 Element Localizer Extracted Test T135 Full Evidence Set E135 (P135), P1350 (P135) L136 Element Localizer Extracted Test T136 Full Evidence Set E136 (P136), P1360 (P136) L137 Element Localizer Extracted Test T137 Full Evidence Set E137 (P137), P1370 (P137) L138 Element Localizer Extracted Test T138 Full Evidence Set E138 (P138), P1380 (P138) L139 Element Localizer Extracted Test T139 Full Evidence Set E139 (P139), P1390 (P139) L140 Element Localizer Extracted Test T140 Full Evidence Set E140 (P140), P1400 (P140) L141 Element Localizer Extracted Test T141 Full Evidence Set E141 (P141), P1410 (P141) L142 Element Localizer Extracted Test T142 Full Evidence Set E142 (P142), P1420 (P142) L143 Element Localizer Extracted Test T143 Full Evidence Set E143 (P143), P1430 (P143) L144 Element Localizer Extracted Test T144 Full Evidence Set E144 (P144), P1440 (P144) L145 Element Localizer Extracted Test T145 Full Evidence Set E145 (P145), P1450 (P145) L146 Element Localizer Extracted Test T146 Full Evidence Set E146 (P146), P1460 (P146) L147 Element Localizer Extracted Test T147 Full Evidence Set E147 (P147), P1470 (P147) L148 Element Localizer Extracted Test T148 Full Evidence Set E148 (P148), P1480 (P148) L149 Element Localizer Extracted Test T149 Full Evidence Set E149 (P149), P1490 (P149) L150 Element Localizer Extracted Test T150 Full Evidence Set E150 (P150), P1500 (P150) L151 Element Localizer Extracted Test T151 Full Evidence Set E151 (P151), P1510 (P151) L152 Element Localizer Extracted Test T152 Full Evidence Set E152 (P152), P1520 (P152) L153 Element Localizer Extracted Test T153 Full Evidence Set E153 (P153), P1530 (P153) L154 Element Localizer Extracted Test T154 Full Evidence Set E154 (P154), P1540 (P154) L155 Element Localizer Extracted Test T155 Full Evidence Set E155 (P155), P1550 (P155) L156 Element Localizer Extracted Test T156 Full Evidence Set E156 (P156), P1560 (P156) L157 Element Localizer Extracted Test T157 Full Evidence Set E157 (P157), P1570 (P157) L158 Element Localizer Extracted Test T158 Full Evidence Set E158 (P158), P1580 (P158) L159 Element Localizer Extracted Test T159 Full Evidence Set E159 (P159), P1590 (P159) L160 Element Localizer Extracted Test T160 Full Evidence Set E160 (P160), P1600 (P160) L161 Element Localizer Extracted Test T161 Full Evidence Set E161 (P161), P1610 (P161) L162 Element Localizer Extracted Test T162 Full Evidence Set E162 (P162), P1620 (P162) L163 Element Localizer Extracted Test T163 Full Evidence Set E163 (P163), P1630 (P163) L164 Element Localizer Extracted Test T164 Full Evidence Set E164 (P164), P1640 (P164) L165 Element Localizer Extracted Test T165 Full Evidence Set E165 (P165), P1650 (P165) L166 Element Localizer Extracted Test T166 Full Evidence Set E166 (P166), P1660 (P166) L167 Element Localizer Extracted Test T167 Full Evidence Set E167 (P167), P1670 (P167) L168 Element Localizer Extracted Test T168 Full Evidence Set E168 (P168), P1680 (P168) L169 Element Localizer Extracted Test T169 Full Evidence Set E169 (P169), P1690 (P169) L170 Element Localizer Extracted Test T170 Full Evidence Set E170 (P170), P1700 (P170) L171 Element Localizer Extracted Test T171 Full Evidence Set E171 (P171), P1710 (P171) L172 Element Localizer Extracted Test T172 Full Evidence Set E172 (P172), P1720 (P172) L173 Element Localizer Extracted Test T173 Full Evidence Set E173 (P173), P1730 (P173) L174 Element Localizer Extracted Test T174 Full Evidence Set E174 (P174), P1740 (P174) L175 Element Localizer Extracted Test T175 Full Evidence Set E175 (P175), P1750 (P175) L176 Element Localizer Extracted Test T176 Full Evidence Set E176 (P176), P1760 (P176) L177 Element Localizer Extracted Test T177 Full Evidence Set E177 (P177), P1770 (P177) L178 Element Localizer Extracted Test T178 Full Evidence Set E178 (P178), P1780 (P178) L179 Element Localizer Extracted Test T179 Full Evidence Set E179 (P179), P1790 (P179) L180 Element Localizer Extracted Test T180 Full Evidence Set E180 (P180), P1800 (P180) L181 Element Localizer Extracted Test T181 Full Evidence Set E181 (P181), P1810 (P181) L182 Element Localizer Extracted Test T182 Full Evidence Set E182 (P182), P1820 (P182) L183 Element Localizer Extracted Test T183 Full Evidence Set E183 (P183), P1830 (P183) L184 Element Localizer Extracted Test T184 Full Evidence Set E184 (P184), P1840 (P184) L185 Element Localizer Extracted Test T185 Full Evidence Set E185 (P185), P1850 (P185) L186 Element Localizer Extracted Test T186 Full Evidence Set E186 (P186), P1860 (P186) L187 Element Localizer Extracted Test T187 Full Evidence Set E187 (P187), P1870 (P187) L188 Element Localizer Extracted Test T188 Full Evidence Set E188 (P188), P1880 (P188) L189 Element Localizer Extracted Test T189 Full Evidence Set E189 (P189), P1890 (P189) L190 Element Localizer Extracted Test T190 Full Evidence Set E190 (P190), P1900 (P190) L191 Element Localizer Extracted Test T191 Full Evidence Set E191 (P191), P1910 (P191) L192 Element Localizer Extracted Test T192 Full Evidence Set E192 (P192), P1920 (P192) L193 Element Localizer Extracted Test T193 Full Evidence Set E193 (P193), P1930 (P193) L194 Element Localizer Extracted Test T194 Full Evidence Set E194 (P194), P1940 (P194) L195 Element Localizer Extracted Test T195 Full Evidence Set E195 (P195), P1950 (P195) L196 Element Localizer Extracted Test T196 Full Evidence Set E196 (P196), P1960 (P196) L197 Element Localizer Extracted Test T197 Full Evidence Set E197 (P197), P1970 (P197) L198 Element Localizer Extracted Test T198 Full Evidence Set E198 (P198), P1980 (P198) L199 Element Localizer Extracted Test T199 Full Evidence Set E199 (P199), P1990 (P199) L200 Element Localizer Extracted Test T200 Full Evidence Set E200 (P200), P2000 (P200) L201 Element Localizer Extracted Test T201 Full Evidence Set E201 (P201), P2010 (P201) L202 Element Localizer Extracted Test T202 Full Evidence Set E202 (P202), P2020 (P202) L203 Element Localizer Extracted Test T203 Full Evidence Set E203 (P203), P2030 (P203) L204 Element Localizer Extracted Test T204 Full Evidence Set E204 (P204), P2040 (P204) L205 Element Localizer Extracted Test T205 Full Evidence Set E205 (P205), P2050 (P205) L206 Element Localizer Extracted Test T206 Full Evidence Set E206 (P206), P2060 (P206) L207 Element Localizer Extracted Test T207 Full Evidence Set E207 (P207), P2070 (P207) L208 Element Localizer Extracted Test T208 Full Evidence Set E208 (P208), P2080 (P208) L209 Element Localizer Extracted Test T209 Full Evidence Set E209 (P209), P2090 (P209) L210 Element Localizer Extracted Test T2090 Full Evidence Set E210 (P210), P2100 (P210) L211 Element Localizer Extracted Test T211 Full Evidence Set E211 (P211), P2110 (P211) L212 Element Localizer Extracted Test T212 Full Evidence Set E212 (P212), P2120 (P212) L213 Element Localizer Extracted Test T213 Full Evidence Set E213 (P213), P2130 (P213) L214 Element Localizer Extracted Test T214 Full Evidence Set E214 (P214), P2140 (P214) L215 Element Localizer Extracted Test T215 Full Evidence Set E215 (P215), P2150 (P215) L216 Element Localizer Extracted Test T216 Full Evidence Set E216 (P216), P2160 (P216) L217 Element Localizer Extracted Test T217 Full Evidence Set E217 (P217), P2170 (P217) L218 Element Localizer Extracted Test T218 Full Evidence Set E218 (P218), P2180 (P218) L219 Element Localizer Extracted Test T219 Full Evidence Set E219 (P219), P2190 (P219) L220 Element Localizer Extracted Test T220 Full Evidence Set E220 (P220), P2200 (P220) L221 Element Localizer Extracted Test T221 Full Evidence Set E221 (P221), P2210 (P221) L222 Element Localizer Extracted Test T222 Full Evidence Set E222 (P222), P2220 (P222) L223 Element Localizer Extracted Test T223 Full Evidence Set E223 (P223), P2230 (P223) L224 Element Localizer Extracted Test T224 Full Evidence Set E224 (P224), P2240 (P224) L225 Element Localizer Extracted Test T225 Full Evidence Set E225 (P225), P2250 (P225) L226 Element Localizer Extracted Test T226 Full Evidence Set E226 (P226), P2260 (P226) L227 Element Localizer Extracted Test T227 Full Evidence Set E227 (P227), P2270 (P227) L228 Element Localizer Extracted Test T228 Full Evidence Set E228 (P228), P2280 (P228) L229 Element Localizer Extracted Test T229 Full Evidence Set E229 (P229), P2290 (P229) L230 Element Localizer Extracted Test T230 Full Evidence Set E230 (P230), P2300 (P230) L231 Element Localizer Extracted Test T231 Full Evidence Set E231 (P231), P2310 (P231) L232 Element Localizer Extracted Test T232 Full Evidence Set E232 (P232), P2320 (P232) L233 Element Localizer Extracted Test T233 Full Evidence Set E233 (P233), P2330 (P233) L234 Element Localizer Extracted Test T234 Full Evidence Set E234 (P234), P2340 (P234) L235 Element Localizer Extracted Test T235 Full Evidence Set E235 (P235), P2350 (P235) L236 Element Localizer Extracted Test T236 Full Evidence Set E236 (P236), P2360 (P236) L237 Element Localizer Extracted Test T237 Full Evidence Set E237 (P237), P2370 (P237) L238 Element Localizer Extracted Test T238 Full Evidence Set E238 (P238), P2380 (P238) L239 Element Localizer Extracted Test T239 Full Evidence Set E239 (P239), P2390 (P239) L300 Element Localizer Extracted Test T300 Full Evidence Set E300 (P300), P3000 (P300) L301 Element Localizer Extracted Test T301 Full Evidence Set E301 (P301), P3010 (P301) L302 Element Localizer Extracted Test T302 Full Evidence Set E302 (P302), P3020 (P302) L303 Element Localizer Extracted Test T303 Full Evidence Set E303 (P303), P3030 (P303) L304 Element Localizer Extracted Test T304 Full Evidence Set E304 (P304), P3040 (P304) L305 Element Localizer Extracted Test T305 Full Evidence Set E305 (P305), P3050 (P305) L306 Element Localizer Extracted Test T306 Full Evidence Set E306 (P306), P3060 (P306) L307 Element Localizer Extracted Test T307 Full Evidence Set E307 (P307), P3070 (P307) L308 Element Localizer Extracted Test T308 Full Evidence Set E308 (P308), P3080 (P308) L309 Element Localizer Extracted Test T309 Full Evidence Set E309 (P309), P3090 (P309) L310 Element Localizer Extracted Test T3090 Full Evidence Set E310 (P310), P3100 (P310) L311 Element Localizer Extracted Test T311 Full Evidence Set E311 (P311), P3110 (P311) L312 Element Localizer Extracted Test T312 Full Evidence Set E312 (P312), P3120 (P312) L313 Element Localizer Extracted Test T313 Full Evidence Set E313 (P313), P3130 (P313) L314 Element Localizer Extracted Test T314 Full Evidence Set E314 (P314), P3140 (P314) L315 Element Localizer Extracted Test T315 Full Evidence Set E315 (P315), P3150 (P315) L316 Element Localizer Extracted Test T316 Full Evidence Set E316 (P316), P3160 (P316) L317 Element Localizer Extracted Test T317 Full Evidence Set E317 (P317), P3170 (P317) L318 Element Localizer Extracted Test T318 Full Evidence Set E318 (P318), P3180 (P318) L319 Element Localizer Extracted Test T319 Full Evidence Set E319 (P319), P3190 (P319) L320 Element Localizer Extracted Test T320 Full Evidence Set E320 (P320), P3200 (P320) L321 Element Localizer Extracted Test T321 Full Evidence Set E321 (P321), P3210 (P321) L322 Element Localizer Extracted Test T322 Full Evidence Set E322 (P322), P3220 (P322) L323 Element Localizer Extracted Test T323 Full Evidence Set E323 (P323), P3230 (P323) L324 Element Localizer Extracted Test T324 Full Evidence Set E324 (P324), P3240 (P324) L325 Element Localizer Extracted Test T325 Full Evidence Set E325 (P325), P3250 (P325) L326 Element Localizer Extracted Test T326 Full Evidence Set E326 (P326), P3260 (P326) L327 Element Localizer Extracted Test T327 Full Evidence Set E327 (P327), P3270 (P327) L328 Element Localizer Extracted Test T328 Full Evidence Set E328 (P328), P3280 (P328) L329 Element Localizer Extracted Test T329 Full Evidence Set E329 (P329), P3290 (P329) L330 Element Localizer Extracted Test T330 Full Evidence Set E330 (P330), P3300 (P330) L331 Element Localizer Extracted Test T331 Full Evidence Set E331 (P331), P3310 (P331) L332 Element Localizer Extracted Test T332 Full Evidence Set E332 (P332), P3320 (P332) L333 Element Localizer Extracted Test T333 Full Evidence Set E333 (P333), P3330 (P333) L334 Element Localizer Extracted Test T334 Full Evidence Set E334 (P334), P3340 (P334) L335 Element Localizer Extracted Test T335 Full Evidence Set E335 (P335), P3350 (P335) L336 Element Localizer Extracted Test T336 Full Evidence Set E336 (P336), P3360 (P336) L337 Element Localizer Extracted Test T337 Full Evidence Set E337 (P337), P3370 (P337) L338 Element Localizer Extracted Test T338 Full Evidence Set E338 (P338), P3380 (P338) L339 Element Localizer Extracted Test T339 Full Evidence Set E339 (P339), P3390 (P339) L340 Element Localizer Extracted Test T340 Full Evidence Set E340 (P340), P3400 (P340) L341 Element Localizer Extracted Test T341 Full Evidence Set E341 (P341), P3410 (P341) L342 Element Localizer Extracted Test T342 Full Evidence Set E342 (P342), P3420 (P342) L343 Element Localizer Extracted Test T343 Full Evidence Set E343 (P343), P3430 (P343) L344 Element Localizer Extracted Test T344 Full Evidence Set E344 (P344), P3440 (P344) L345 Element Localizer Extracted Test T345 Full Evidence Set E345 (P345), P3450 (P345) L346 Element Localizer Extracted Test T346 Full Evidence Set E346 (P346), P3460 (P346) L347 Element Localizer Extracted Test T347 Full Evidence Set E347 (P347), P3470 (P347) L348 Element Localizer Extracted Test T348 Full Evidence Set E348 (P348), P3480 (P348) L349 Element Localizer Extracted Test T349 Full Evidence Set E349 (P349), P3490 (P349) L350 Element Localizer Extracted Test T350 Full Evidence Set E350 (P350), P3500 (P350) L351 Element Localizer Extracted Test T351 Full Evidence Set E351 (P351), P3510 (P351) L352 Element Localizer Extracted Test T352 Full Evidence Set E352 (P352), P3520 (P352) L353 Element Localizer Extracted Test T353 Full Evidence Set E353 (P353), P3530 (P353) L354 Element Localizer Extracted Test T354 Full Evidence Set E354 (P354), P3540 (P354) L355 Element Localizer Extracted Test T355 Full Evidence Set E355 (P355), P3550 (P355) L356 Element Localizer Extracted Test T356 Full Evidence Set E356 (P356), P3560 (P356) L357 Element Localizer Extracted Test T357 Full Evidence Set E357 (P357), P3570 (P357) L358 Element Localizer Extracted Test T358 Full Evidence Set E358 (P358), P3580 (P358) L359 Element Localizer Extracted Test T359 Full Evidence Set E359 (P359), P3590 (P359) L360 Element Localizer Extracted Test T360 Full Evidence Set E360 (P360), P3600 (P360) L361 Element Localizer Extracted Test T361 Full Evidence Set E361 (P361), P3610 (P361) L362 Element Localizer Extracted Test T362 Full Evidence Set E362 (P362), P3620 (P362) L363 Element Localizer Extracted Test T363 Full Evidence Set E363 (P363), P3630 (P363) L364 Element Localizer Extracted Test T364 Full Evidence Set E364 (P364), P3640 (P364) L365 Element Localizer Extracted Test T365 Full Evidence Set E365 (P365), P3650 (P365) L366 Element Localizer Extracted Test T366 Full Evidence Set E366 (P366), P3660 (P366) L367 Element Localizer Extracted Test T367 Full Evidence Set E367 (P367), P3670 (P367) L368 Element Localizer Extracted Test T368 Full Evidence Set E368 (P368), P3680 (P368) L369 Element Localizer Extracted Test T369 Full Evidence Set E369 (P369), P3690 (P369) L370 Element Localizer Extracted Test T370 Full Evidence Set E370 (P370), P3700 (P370) L371 Element Localizer Extracted Test T371 Full Evidence Set E371 (P371), P3710 (P371) L372 Element Localizer Extracted Test T372 Full Evidence Set E372 (P372), P3720 (P372) L373 Element Localizer Extracted Test T373 Full Evidence Set E373 (P373), P3730 (P373) L374 Element Localizer Extracted Test T374 Full Evidence Set E374 (P374), P3740 (P374) L375 Element Localizer Extracted Test T375 Full Evidence Set E375 (P375), P3750 (P375) L376 Element Localizer Extracted Test T376 Full Evidence Set E376 (P376), P3760 (P376) L377 Element Localizer Extracted Test T377 Full Evidence Set E377 (P377), P3770 (P377) L378 Element Localizer Extracted Test T378 Full Evidence Set E378 (P378), P3780 (P378) L379 Element Localizer Extracted Test T379 Full Evidence Set E379 (P379), P3790 (P379) L380 Element Localizer Extracted Test T380 Full Evidence Set E380 (P380), P3800 (P380) L381 Element Localizer Extracted Test T381 Full Evidence Set E381 (P381), P3810 (P381) L382 Element Localizer Extracted Test T382 Full Evidence Set E382 (P382), P3820 (P382) L383 Element Localizer Extracted Test T383 Full Evidence Set E383 (P383), P3830 (P383) L384 Element Localizer Extracted Test T384 Full Evidence Set E384 (P384), P3840 (P384) L385 Element Localizer Extracted Test T385 Full Evidence Set E385 (P385), P3850 (P385) L386 Element Localizer Extracted Test T386 Full Evidence Set E386 (P386), P3860 (P386) L387 Element Localizer Extracted Test T387 Full Evidence Set E387 (P387), P3870 (P387) L388 Element Localizer Extracted Test T388 Full Evidence Set E388 (P388), P3880 (P388) L389 Element Localizer Extracted Test T389 Full Evidence Set E389 (P389), P3890 (P389) L390 Element Localizer Extracted Test T390 Full Evidence Set E390 (P390), P3900 (P390) L391 Element Localizer Extracted Test T391 Full Evidence Set E391 (P391), P3910 (P391) L392 Element Localizer Extracted Test T392 Full Evidence Set E392 (P392), P3920 (P392) L393 Element Localizer Extracted Test T393 Full Evidence Set E393 (P393), P3930 (P393) L394 Element Localizer Extracted Test T394 Full Evidence Set E394 (P394), P3940 (P394) L395 Element Localizer Extracted Test T395 Full Evidence Set E395 (P395), P3950 (P395) L396 Element Localizer Extracted Test T396 Full Evidence Set E396 (P396), P3960 (P396) L397 Element Localizer Extracted Test T397 Full Evidence Set E397 (P397), P3970 (P397) L398 Element Localizer Extracted Test T398 Full Evidence Set E398 (P398), P3980 (P398) L399 Element Localizer Extracted Test T399 Full Evidence Set E399 (P399), P3990 (P399) L400 Element Localizer Extracted Test T400 Full Evidence Set E400 (P400), P4000 (P400) L401 Element Localizer Extracted Test T401 Full Evidence Set E401 (P401), P4010 (P401) L402 Element Localizer Extracted Test T402 Full Evidence Set E402 (P402), P4020 (P402) L403 Element Localizer Extracted Test T403 Full Evidence Set E403 (P403), P4030 (P403) L404 Element Localizer Extracted Test T404 Full Evidence Set E404 (P404), P4040 (P404) L405 Element Localizer Extracted Test T405 Full Evidence Set E405 (P405), P4050 (P405) L406 Element Localizer Extracted Test T406 Full Evidence Set E406 (P406), P4060 (P406) L407 Element Localizer Extracted Test T407 Full Evidence Set E407 (P407), P4070 (P407) L408 Element Localizer Extracted Test T408 Full Evidence Set E408 (P408), P4080 (P408) L409 Element Localizer Extracted Test T409 Full Evidence Set E409 (P409), P4090 (P409) L500 Element Localizer Extracted Test T500 Full Evidence Set E500 (P500), P5000 (P500) L501 Element Localizer Extracted Test T501 Full Evidence Set E501 (P501), P5010 (P501) L502 Element Localizer Extracted Test T502 Full Evidence Set E502 (P502), P5020 (P502) L503 Element Localizer Extracted Test T503 Full Evidence Set E503 (P503), P5030 (P503) L504 Element Localizer Extracted Test T504 Full Evidence Set E504 (P504), P5040 (P504) L505 Element Localizer Extracted Test T505 Full Evidence Set E505 (P505), P5050 (P505) L506 Element Localizer Extracted Test T506 Full Evidence Set E506 (P506), P5060 (P506) L507 Element Localizer Extracted Test T507 Full Evidence Set E507 (P507), P5070 (P507) L508 Element Localizer Extracted Test T508 Full Evidence Set E508 (P508), P5080 (P508) L509 Element Localizer Extracted Test T509 Full Evidence Set E509 (P509), P5090 (P509) L510 Element Localizer Extracted Test T509 Full Evidence Set E509( P509 ) | Inclusion Criteria: No. of Participants = 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255 | ||||
Figure 7 | Case study of diagram generation. Given the same source context and caption, the vanilla Nano-Banana-Pro often produces diagrams with outdated color tones and overly verbose content. In contrast, our PaperBanana generates results that are more concise and aesthetically pleasing, while maintaining faithfulness to the source context.
Enhancing the Aesthetics of Human-Drawn Diagrams We provide additional cases in Figure 8 to demonstrate the interesting scenario of enhancing the aesthetics of human-drawn diagrams with our auto-summarized style guidelines. It is observed that the polished diagrams demonstrate significant stylistic improvements in color schemes, typography, graphical elements, etc.

Figure 8 | Additional cases for enhancing the aesthetics of human-drawn diagrams with our autosummarized style guidelines. The polished diagrams demonstrate significant stylistic improvements in color schemes, typography, graphical elements, etc.
Case study for visualizing statistical plots with code and image generation. Figure 9 compares the results of visualizing statistical plots with code and image generation. It is observed that the image generation model can generate more visually appealing plots, but incurs more faithfulness errors such as numerical hallucination or element repetition.
| Plots Visualized via IMG | Plots Visualized via Coding | Case Analysis |
| Word Error Rate (WER) vs. Number of Microphones 7.0 6.5 2.4 1.4 1.2 1.0 0.8 0.6 0.4 0.3 0.2 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.5 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.2 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.5 0.2 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.2 0.5 0.2 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.2 0.5 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.3 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.5 0.2 0.8 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 2. 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 | ||
| Harvest Yields by Farmer and Farm Animals | ||
| Player Performance Comparison | ||
| Player A | Free Throws | Assists |
| Player B | Free throws | Rebounds |
| Player C | Free throws | Rebounds |
| Player D | Free throws | Rebounds |
| Player E | Free throws | Re bounds |
| Player F | Free throws | Re bounds |
| Player G | Free throws | Re bounds |
| Player H | Free throws | Re bounds |
| Player I | Free throws | Re bounds |
| Player J | Free throws | Re bounds |
| Player K | Free throws | Re bounds |
| Player L | Free throws | Re bounds |
| Player M | Free throws | Re bounds |
| Player N | Free throws | Re bounds |
| Player O | Free throws | Re bounds |
| Player P | Free throws | Re bounds |
| Player Q | Free throws | Re bounds |
| Player R | Free throws | Re bounds |
| Player S | Free throws | Re bounds |
| Player T | Free throws | Re bounds |
| Player U | Free throws | Re bounds |
| Player V | Free throws | Re bounds |
| Player W | Free throws | Re bounds |
| Player X | Free throws | Re bounds |
| Player Y | Free throws | Re bounds |
| Player Z | Free throws | Re bounds |
| Player WZ | Free throws | Re bounds |
| Player WY | Free throws | Re bounds |
| Player YW | Free throws | Re bounds |
| Player YZ | Free throws | Re bounds |
| Player YX | Free throws | Re bounds |
| Player YZ | Free throws | Re bounds |
| Player YWZ | Free throws | Re bounds |
| Player YWY | Free throws | Re bounds |
| Player YZX | Free throws | Re bounds |
| Player YYZ | Free throws | Re bounds |
| Player YZXZ | Free throws | Re bounds |
| Player YZXYY | Free throws | Re bounds |
| Player YZXYYZ | Free throws | Re bounds |
| Player YZXYYZZ | Free throws | Re bounds |
| Player YZXYYZZ | Free throws | Re bounds |
| Player YZXYYZZZZ | Free throws | Re bounds |
| Player YZXYYZZZZZ | Free throws | Re bounds |
| Player YZXYYZZZZZZZ | Free throws | Re bounds |
| Player YZXYYZZZZZZZZZ | Free throws | Re bounds |
| Player YZXYYZZZZZZZZZ | Free throws | Re bounds |
| Player YZXYYZZZZZZZZZZ | Free throws | Re bounds |
| Player YZXYYZZZZZZZZZZ | Free throws | Re bounds |
| Player YZXYYZZZZZZZZZZZZ | Free throws | Re bounds |
| Player YZXYYZZZZZZZZZZZZZZ | Free throws | Re bounds |
| Player YZXYYZZZZZZZZZZZZZZZZ | Free throws | Re bounds |
| Player YZXYYZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzZZzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZzzZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZzz ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZzzzz ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZzzzz ZZzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz. The left plot contains faithfulness errors: The 'Clinical' data value is 0.4, but the model draws the bar significantly taller than the 0.4 gridline and axis tick | ||
Figure 9 | Case study for visualizing statistical plots with code and image generation. It is observed that the image generation model can generate more visually appealing plots, but incurs more faithfulness errors such as numerical hallucination or element repetition. The red bounding boxes are added by the authors to highlight the errors.
Failure Cases of PaperBanana. Figure 10 shows 3 failure cases of PaperBanana. We observe that the primary failure mode involves connection errors, such as redundant connections and mismatched source-target nodes. Our preliminary analysis reveals that the critic model often fails to identify these connectivity issues, suggesting these errors may originate from the foundation model’s inherent perception limitations. Resolving this challenge likely necessitates advancements in the underlying foundation model.
| Diagrams from PaperBanana | Failure Reasons | |
| 2D Diffusion Zone RGB Image x Encoder Ε + Noise (ε) z0 zt Noise Predictor εθ Learning Queries (Q1) qIns pc Top-k PC Selection Lconsist 3D Segmentation Zone Point Cloud P 3D Segmentation Zone | Dual-Query Consistency qIns SVD Top-k PC Selection qIns pc Lconsist 3D Segmentation Zone Cross-Modal Alignment Interface Interaction Similarity Map S 3D Decoder Ground Truth | The attention block incorrectly draws connections from the Noise Predictor (fsd) to three destination boxes, labeling two of them as 'Keys (K')' and one as 'Values (V')'. The text (Eq. 2) explicitly states that diffusion features map only to Keys and Values, while Queries (Qi) come from learnable parameters. |
| Task-Relevant States Task-Irrrelevant States Observed Variables Action (a) Probabilistic Graphical Model of CsRSSM | History States (ht-1) s2 s3 c1 c2 c3 o1 o2 o3 a1 a2 a3 t=2 t=3 (b) The Training Framework of the World Model | It fails to draw generative connections (solid lines) from Actions (a) to Task-Relevant States (s). The method explicitly states (Eq. 3) that state transitions depend on previous actions (h't=f(...,at-1)). By isolating the actions without driving the state dynamics, the diagram fails to represent the fundamental structure of a control/world model described in the text. |
| Context Encoder Goal g ∀x ... x ... ... Tactic t Indu cation x ... Decoder D | Representation P Status δ Encoder E Decoder P Time δ Mean Pooling Decoder δ Encoder Error: Type mSmatch... | The text explicitly states, "We note that the Decoder and Predictor can only access information of t through e" and defines the Decoder's input as the embedding and the goal (D(e,g)). However, the Model diagram draws a dotted skip connection from the "Tactic t" block directly to the Decoder, replacing the required "Goal" connection. This violates the functional constraints described in the methodology. |
Figure 10 | Failure cases of PaperBanana. The primary failure mode involves connection errors, such as redundant connections and mismatched source-target nodes. Our preliminary analysis reveals that the critic model often fails to identify these connectivity issues, suggesting these errors may originate from the foundation model’s inherent perception limitations. Resolving this challenge likely necessitates advancements in the underlying foundation model.
B. Human Evaluation Setup
To ensure the reliability of our automated metrics and strict benchmarking of our method, this paper conducted two distinct human evaluation experiments. Both evaluations employed the same four dimensions defined in Section 4 (Faithfulness, Conciseness, Readability, and Aesthetics) and adhered to the same detailed rubrics used by our VLM judge. We utilized Streamlit to build dedicated annotation interfaces for these tasks.
Validation of VLM-as-a-Judge. The objective of this human evaluation is to assess the alignment between our VLM-based judge (Gemini-3-Pro) and human judgment. We randomly sampled 50 cases (25 from the Vanilla baseline and 25 from PaperBanana) from the test set. For each case, two experienced researchers were presented with the Method Section, Caption, the human-drawn reference diagram, and a model-generated candidate (either from our method or the baseline). They were tasked with conducting a side-by-side comparison on the four evaluation dimensions. For conflicting cases, they engaged in discussion to reach a consensus. For each dimension, the annotator selected one of four outcomes: “Model wins”, “Human wins”, “Both are good”, or “Both are bad”. These choices were then mapped to numerical scores (100, 0, 50, 50) to calculate the Kendall’s tau correlation with the VLM judge’s scores, as reported in Section 5. The annotation interface is shown in Figure 11.
Blind Test for Main Results. To rigorously compare PaperBanana against the strong baseline (Vanilla Nano-Banana-Pro), we conducted a blind A/B test on a subset of 50 cases. Three experienced researchers were presented with the Method Section, Caption, a Reference (Human Drawn) diagram, and two anonymous candidates (Candidate A and Candidate B) in randomized order. To determine the winner, we enforced a hierarchical decision strategy consistent with our VLM evaluation protocol. Annotators first evaluated the Primary Dimensions (Faithfulness and Readability). If a candidate won in the primary dimensions (or won one and tied the other), it was declared the overall winner. In cases of a tie in primary dimensions, the decision was deferred to the Secondary Dimensions (Conciseness and Aesthetics). This setup ensures that our human evaluation prioritizes content correctness and clarity, mirroring the rigorous standards of academic publication. The annotation interface is shown in Figure 12.

Figure 11 | Annotation interface for reference-based evaluation.



Figure 12 | Annotation interface for blind human evaluation.
C. Implementation Details
Categorization of Methodology Diagrams. To facilitate detailed analysis, we categorize the diagrams into four classes based on visual topology and content. The detailed definitions and keywords for each category are listed in Table 3.
Additional Experiment Settings. For all experiments, we set the generation temperature to 1. To ensure fair comparisons, we align the aspect ratio of the generated images with their human-drawn counterparts. Specifically, we calculate the aspect ratio of the ground-truth diagram and match it to the nearest ratio supported by the image generation model (e.g., for Nano-Banana-Pro, we round to the closest among 3:2, 16:9, and 21:9).
Generating Diagrams and Plots used in this Paper. All figures in this paper marked with “[Generated by are produced entirely by PaperBanana. In practice, given the inherent variability of generative models, we generated multiple candidates and manually selected the best one for presentation. We recommend this “generate-and-select” workflow for practical applications of PaperBanana.
D. Testset Curation for Statistical Plots Generation
This section introduces the testset curation process for statistical plots generation, which evaluates the capability to generate statistical plots from raw data (e.g., tables, CSV files) and high-level visual descriptions (e.g. a bar plot titled “Number of Publications by Year”). Since academic manuscripts rarely include raw data for their published plots, we repurpose ChartMimic (Yang et al., 2025b), a dataset originally designed for chart-to-code evaluation. Specifically, we use the “direct mimic” subset, which contains 2,400 plots sourced majorly from arXiv papers and matplotlib galleries, each paired with human-curated Python code for reproduction. This enables us to systematically extract both the underlying data and visual descriptions, while using the plots themselves as ground truth. Specifically, the pipeline is as follows:
Table 3 | Categorization of diagrams based on visual topology and content.
| 1. Agent & Reasoning |
| ·LLM agents, multi-agent systems, reasoning, planning, tool use ·Instruction following, in-context learning, chain-of-thought ·Code generation, autonomous systems ·Keywords: agent, llm, language model, reasoning, planning, prompt |
| 2. Vision & Perception |
| ·Computer vision, 3D reconstruction, rendering, object detection ·Scene understanding, depth estimation, pose estimation ·Visual representations and feature learning ·Keywords: vision, image, 3d, gaussian, nerf, detection, segmentation, camera |
| 3. Generative & Learning |
| ·Generative models (diffusion, GANs, VAEs, autoencoders) ·Reinforcement learning, policy learning ·Optimization and training dynamics ·Keywords: diffusion, generative, gan, denoising, reinforcement, policy, reward |
| 4. Science & Applications |
| ·AI for Science (biology, chemistry, physics, medicine) ·Graph neural networks, structured data ·Theoretical analysis, mathematical proofs ·Domain-specific applications ·Keywords: protein, molecule, biology, graph, node, theorem, theory |
Collection & Filtering. We begin with all 2,400 plots from the “direct mimic” subset. Using Gemini-3-Pro, we extract the raw data from the code into tabular format, generate a high-level description of each plot’s visual intent, while also marking the difficulty of generating the plot (Specifically, plots with many data points or subplots are marked as difficult, while plots with only 1 subplot and few data points are marked as easy). Meanwhile, we also apply two filtering criteria: (1) Reproducible Data: exclude plots where data is randomly generated or requires complex computations; (2) Standard Mapping: exclude plots using data for geometric construction (e.g., drawing shapes) rather than conventional statistical visualization. Similar to our methodology diagram curation, we filter out plots with aspect ratios outside [1.0, 2.5] to support future exploration with image generation models. This yields 914 plots.
Categorization. ChartMimic’s original 22 plot categories include many types rarely used in academic publications, such as Pip chart and Quiver chart. Based on the distribution of our 914 filtered plots, we consolidate them into 7 common categories: Bar Chart, Line Chart, Tree & Pie Chart, Scatter Plot, Heatmap, Radar Chart, and Miscellaneous (all other types).
Sampling. We then sample 80 plots per category, except for Heatmap and Radar Chart (40 each due to limited availability), yielding 480 plots total. During sampling, we intentionally increased the proportion of difficult cases to ensure a challenging testset. Each category is then evenly split into reference and test sets.
E. Textual Description of our Methodology Diagram
Our framework operates by first synthesizing a detailed description of the target diagram, which is then visualized by Nano-Banana-Pro. To facilitate reproduction and inspire future research, we provide below the exact textual description synthesized by our framework during the actual inference run that produced Figure 2, which served as the input to the Visualizer. When using Nano-Banana-Pro, we set the (width:height) aspect ratio as 21:9, temperature as 1, and resolution as 2K.
Textual Description of our Methodology Diagram
The figure is a wide, horizontal flowchart-style diagram illustrating the ” Paperbanana” framework. The layout flows from left to right on a clean white background, divided into two main colored regions: the “Linear Planning Phase” (left/middle) and the “Iterative Refinement Loop” (right) .
. Leftmost Section: Inputs**
-
Visual Elements: Two icons stacked vertically on the far left.
-
Top: A document icon labeled **“Source Context ( \ 5$ 9)$ .
-
Bottom: A target/goal icon labeled **“Communicative Intent ( \ 03)\ ” *$
-
Flow: Brackets merge these inputs into a main flow line that enters the first phase.
. Middle-Left Region: Linear Planning Phase**
-
**Container: A light blue rounded rectangle. Label at top: “Linear Planning Phase”.
-
Reference Set (): A cylinder database icon located at the bottom-left of this region, labeled “Reference Set () ”.
-
Agent 1: Retriever Agent
-
Icon: Robot with a magnifying glass.
-
Label: “Retriever Agent” positioned below the icon.
-
Input: An arrow from the main Inputs ( \ 8\mathbb { C } $ 1\mathcal{R}$).
-
Output: Arrow to a cluster of image thumbnails labeled ” Relevant Examples ()”.
-
Agent 2: Planner Agent
-
Icon: Robot with a clipboard or thought bubble.
Label: “Planner Agent” positioned below the icon. -
**Input: Receives an arrow from “Relevant Examples ()
”. Crucially, a direct flow arrow (bypassing the Retriever) connects the main Inputs ( \ 8\mathbb { C } $ 1$ ) to the Planner, indicating it uses the source content for planning.
-
Output: Arrow to a text document icon labeled **“Initial Description ( \ 9$ .
-
Agent 3: Stylist Agent
-
Icon: Robot with a palette/paintbrush.
Label: “Stylist Agent” positioned below the icon.
Input: Receives “Initial Description ( \ 99 ) “\mathcal{R} \mathcal{G}$)”. -
Output: An arrow exiting the blue region labeled **“Optimized Description ( \ 9$ .
. Middle-Right Region: Iterative Refinement Loop**
-
Container: A light orange rounded rectangle. Label at top: ” Iterative Refinement Loop”.
-
Agent 4: Visualizer Agent
-
Icon: Robot standing next to a split visual representation: a canvas on one side and a code terminal/brackets on the other.
-
Label: **“Visualizer Agent” positioned below the icon.
-
**Input: Takes “Optimized Description \left( \ 9 \mathbf { \hat { \Pi } } ^ { \star } * \right)( $ 123,456,7$ ” (from Critic).
-
**Output: Arrow to an image preview labeled **“Generated Image ( \ 123,456$ .
-
Agent 5: Critic Agent
Icon: Robot with a checklist/reviewer pen.
Label: “Critic Agent” positioned below the icon.
-
Input: Receives “Generated Image ( \ 1, t $ 2$ 5\complement $ 8$ ) along the bottom edge, connecting to the Critic.
-
Output: A curved return arrow back to the Visualizer, labeled **“Refined Description ( \ 9$ .
-
**Center Element: A circular arrow icon inside the loop indicating ” \ 1234$ Rounds”.
. Rightmost Section: Final Output**
- Visual Element: A polished scientific illustration emerging from the loop.
- Label: **“Final Illustration ( \ 1,13)$ .
. Styling**
- Agents: Cute, consistent robot avatars with distinct accessories.
- Typography: Sans-serif for main text. Serif Italic (LaTeX style) for all variables ().
- Colors: Blue accents for Planning; Orange accents for Refinement. Main flow arrows in solid black; secondary inputs in dashed gray.
F. Auto Summarized Style Guide for Academic Illustrations
F.1. Style Guides for Methodology Diagrams and Statistical Plots
Style Guide for Methodology Diagrams
1. The “NeurIPS Look”
The prevailing aesthetic for 2025 is “Soft Tech & Scientific Pastels.” Gone are the days of harsh primary colors and sharp black boxes. The modern NeurIPS diagram feels approachable yet precise. It utilizes high-value ( light) backgrounds to organize complexity, reserving saturation for the most critical active elements. The vibe balances clean modularity ( clear separation of parts) with narrative flow (clear left-to-right progression).
2. Detailed Style Options
. Color Palettes**
Design Philosophy: Use color to group logic, not just to decorate. Avoid fully saturated backgrounds.
Background Fills (The “Zone” Strategy)
Used to encapsulate stages (e.g., “Pre-training phase”) or environments. * Most papers use: Very light, desaturated pastels (Opacity
- Aesthetically pleasing options include: Cream / Beige (e.g., ‘#F5F5DC‘) - Warm, academic feel. Pale Blue / Ice (e.g., ‘#E6F3FF‘) - Clean, technical feel. * Mint / Sage (e.g., ‘#E0F2F1‘) - Soft, organic feel. Pale Lavender (e.g., ‘#F3E5F5‘) - distinctive, modern feel. * Alternative (~20%): White backgrounds with colored dashed borders for a high-contrast, minimalist look (common in theoretical papers).
Functional Element Colors
-
For “Active” Modules (Encoders, MLP, Attention): Medium saturation is preferred. * Common pairings: Blue/Orange, Green/Purple, or Teal/Pink. * Observation: Colors are often used to distinguish status rather than component type: * Trainable Elements: Often Warm tones (Red, Orange, Deep Pink ). * Frozen/Static Elements: Often Cool tones (Grey, Ice Blue, Cyan).
-
For Highlights/Results: High saturation (Primary Red, Bright Gold) is strictly reserved for “Error/Loss,” “Ground Truth,” or the final output.
. Shapes & Containers**
Design Philosophy: “Softened Geometry.” Sharp corners are for data; rounded corners are for processes.
Core Components
-
Process Nodes (The Standard): Rounded Rectangles (Corner radius 5-10 px). This is the dominant shape ) for generic layers or steps.
-
Tensors & Data: * Stacks/Cuboids:** Used to imply depth/volume (e.g., \ 8). * Flat Squares/Grids: Used for matrices, tokens, or attention maps.
-
Cylinders: Exclusively reserved for Databases, Buffers, or Memory.
Grouping & Hierarchy
-
The “Macro-Micro” Pattern: A solid, light-colored container represents the global view, with a specific module (e.g., “Attention Block”) connected via lines to a “zoomed-in” detailed breakout box.
-
Borders: * **Solid: For physical components. * Dashed: Highly prevalent for indicating “Logical Stages,” ” Optional Paths,” or “Scopes.”
. Lines & Arrows**
Design Philosophy: Line style dictates flow type.
Connector Styles
- Orthogonal / Elbow (Right Angles): Most papers use this for ** Network Architectures** (implies precision, matrices, and tensors).
- Curved / Bezier: Common choices include this for System Logic, Feedback Loops, or High-Level Data Flow (implies narrative and connection).
Line Semantics
- Solid Black/Grey: Standard data flow (Forward pass).
- Dashed Lines: Universally recognized as “Auxiliary Flow.” * *Used for: Gradient updates, Skip connections, or Loss calculations.
- Integrated Math: Standard operators ( for Add, for Concat/Multiply) are frequently placed directly on the line or intersection.
. Typography & Icons**
Design Philosophy: Strict separation between “Labeling” and “Math.”
Typography
- Labels (Module Names): Sans-Serif (Arial, Roboto, Helvetica). * Style: Bold for headers, Regular for details.
- Variables (Math): Serif (Times New Roman, LaTeX default). * Rule: If it is a variable in your equation (e.g., ), it must be Serif and Italicized in the diagram.
Iconography Options
- For Model State: * Trainable: Fire, Lightning. * Frozen: Snowflake, Padlock, Stop Sign (Greyed out).
- For Operations: * Inspection: Magnifying Glass. * Processing/Computation: Gear, Monitor.
- For Content: * Text/Prompt: Document, Chat Bubble. * Image: Actual thumbnail of an image (not just a square).
3. Common Pitfalls (How to look “Amateur”)
- The “PowerPoint Default” Look: Using standard Blue/Orange presets with heavy black outlines.
- Font Mixing: Using Times New Roman for “Encoder” labels (makes the paper look dated to the 1990s).
- Inconsistent Dimension: Mixing flat 2D boxes and 3D isometric cubes without a clear reason (e.g., 2D for logic, 3D for tensors is fine; random mixing is not).
- **Primary Backgrounds: Using saturated Yellow or Blue backgrounds for grouping (distracts from the content).
- Ambiguous Arrows: Using the same line style for “Data Flow” and ” Gradient Flow.”
4. Domain-Specific Styles
If you are writing an AGENT / LLM Paper:
- Vibe: Illustrative, Narrative, “Friendly.”, Cartoony.
- Key Elements: Use “User Interface” aesthetics. Chat bubbles for prompts, document icons for retrieval.
- Characters: It is common to use cute 2D vector robots, human avatars, or emojis to humanize the agent’s reasoning steps.
If you are writing a COMPUTER VISION / 3D Paper:
- Vibe: Spatial, Dense, Geometric.
- Key Elements: Frustums (camera cones), Ray lines, and Point Clouds.
- Color: Often uses RGB color coding to denote axes or channel correspondence. Use heatmaps (Rainbow/Viridis) to show activation.
If you are writing a THEORETICAL / OPTIMIZATION Paper:
- **Vibe: Minimalist, Abstract, “Textbook.”
- Key Elements: Focus on graph nodes (circles) and manifolds (planes/ surfaces).
- Color: Restrained. mostly Grayscale/Black/White with one highlight color (e.g., Gold or Blue). Avoid “cartoony” elements.
Style Guide for Statistical Plots
NeurIPS 2025 Statistical Plot Aesthetics Guide
1. The “NeurIPS Look”: A High-Level Overview
The prevailing aesthetic for 2025 is defined by precision, accessibility, and high contrast. The “default” academic look has shifted away from bare-bones styling toward a more graphic, publication-ready presentation.
- Vibe: Professional, clean, and information-dense.
- Backgrounds: There is a heavy bias toward **stark white backgrounds ** for maximum contrast in print and PDF reading, though the “Seabornstyle” light grey background remains an accepted variant.
- Accessibility: A strong emphasis on distinguishing data not just by color, but by texture (patterns) and shape (markers) to support black-and -white printing and colorblind readers.
2. Detailed Style Options
Color Palettes
-
Categorical Data:**
-
Soft Pastels: Matte, low-saturation colors (salmon, sky blue, mint, lavender) are frequently used to prevent visual fatigue.
-
Muted Earth Tones: “Academic” palettes using olive, beige, slate grey, and navy.
-
High-Contrast Primaries: Used sparingly when categories must be distinct (e.g., deep orange vs. vivid purple).
-
Accessibility Mode: A growing trend involves combining color with geometric patterns (hatches, dots, stripes) to differentiate categories.
-
Sequential & Heatmaps:
-
Perceptually Uniform: “Viridis” (blue-to-yellow) and “Magma/ Plasma” (purple-to-orange) are the standard.
-
Diverging: “Coolwarm” (blue-to-red) is used for positive/ negative value splits.
-
**Avoid: The traditional “Jet/Rainbow” scale is almost entirely absent.
Axes & Grids
-
Grid Style:
-
Visibility: Grid lines are almost rarely solid. Common choices include **fine dashed or **dotted lines in light gray.
-
Placement: Grids are consistently rendered behind data elements (low Z-order).
-
Spines (Borders):
-
The “Boxed” Look: A full enclosure (black spines on all 4 sides) is very common.
-
**The “Open” Look: Removing the top and right spines for a minimalist appearance.
-
Ticks:
-
Style: Ticks are generally subtle, facing inward, or removed entirely in favor of grid alignment.
Layout & Typography
-
Typography:
-
Font Family: Exclusively Sans-Serif (resembling Helvetica, Arial, or DejaVu Sans). Serif fonts are rarely used for labels.
-
Label Rotation: X-axis labels are rotated degrees** only when necessary to prevent overlap; otherwise, horizontal orientation is preferred.
-
Legends:
-
Internal Placement: Floating the legend inside the plot area ( top-left or top-right) to maximize the “data-ink ratio.”
-
Top Horizontal: Placing the legend in a single row above the plot title.
-
Annotations:
-
Direct Labeling: Instead of forcing readers to reference a legend, text is often placed directly next to lines or on top of bars.
3. Type-Specific Guidelines
Bar Charts & Histograms
-
Borders: Two distinct styles are accepted:
-
**High-Definition: Using black outlines around colored bars for a “comic-book” or high-contrast look.
-
Borderless: Solid color fills with no outline (often used with light grey backgrounds).
-
Grouping: Bars are grouped tightly, with significant whitespace between categorical groups.
-
Error Bars: Consistently styled with black, flat caps.
Line Charts
- Markers: A critical observation: Lines almost always include ** geometric markers** (circles, squares, diamonds) at data points, rather than just being smooth strokes.
- Line Styles: Use dashed lines (‘—‘) for theoretical limits, baselines, or secondary data, and solid lines for primary experimental data.
- Uncertainty: Represented by semi-transparent shaded bands ( confidence intervals) rather than simple vertical error bars.
Tree & Pie/Donut Charts
- Separators: Thick white borders are standard to separate slices or treemap blocks.
- Structure: Thick Donut charts are preferred over traditional Pie charts.
- Emphasis: “Exploding” (detaching) a specific slice is a common technique to highlight a key statistic.
Scatter Plots
- Shape Coding: Use different marker shapes (e.g., circles vs. triangles) to encode a categorical dimension alongside color.
- Fills: Markers are typically solid and fully opaque.
- Plots:** Depth is emphasized by drawing “walls” with grids or using drop-lines to the “floor” of the plot.
Heatmaps
- Aspect Ratio: Cells are almost strictly square.
- Annotation: Writing the exact value (in white or black text) ** inside the cell** is highly preferred over relying solely on a color bar.
- Borders: Cells are often borderless (smooth gradient look) or separated by very thin white lines.
Radar Charts
- Fills: The polygon area uses translucent fills (alpha ~0.2) to show grid lines underneath.
- Perimeter: The outer boundary is marked by a solid, darker line.
Miscellaneous
- Dot Plots: Used as a modern alternative to bar charts; often styled as “lollipops” (dots connected to the axis by a thin line).
4. Common Pitfalls (What to Avoid)
-
The “Excel Default” Look: Avoid heavy 3D effects on bars, shadow drops, or serif fonts (Times New Roman) on axes.
-
The “Rainbow” Map: Avoid the Jet/Rainbow colormap; it is considered outdated and perceptually misleading.
-
Ambiguous Lines: A line chart without markers can look ambiguous if data points are sparse; always add markers.
-
Over-reliance on Color: Failing to use patterns or shapes to distinguish groups makes the plot inaccessible to colorblind readers.
-
Cluttered Grids: Avoid solid black grid lines; they compete with the data. Always use light grey/dashed grids.
F.2. Automated Style Guide Summarization
To distill a comprehensive style guide from top-tier AI conference papers, we employ a hierarchical summarization pipeline. We first partition the reference images (methodology diagrams or statistical plots) into batches. For each batch, we prompt Gemini-3-Pro to analyze the visual patterns—including color palettes, shapes, and typography—and generate a local design report. Finally, we aggregate these batch-level reports and query the model to synthesize a unified style guide that captures the prevailing aesthetic standards and diverse design choices. The prompts used for discrete batch analysis and final global synthesis are presented below.
Batch Analysis Prompt for Methodology Diagrams
You are a Lead Information Designer analyzing the visual style of top-tier AI conference papers (NeurIPS 2025).
I have attached a batch of methodology diagrams from the NeurIPS 2025 conference.
Your Task:
Summarize a visual design guideline that ignores the specific scientific algorithms. Focus ONLY on the Aesthetic and Graphic Design choices.
Critical: Do NOT converge each element to a single fixed design choice. Instead, identify what common design choices exist for each element and which ones are more popular or preferred.
Please focus on these specific dimensions:
- Color Palette: Observe color schemes, saturation levels, etc. Notice aesthetically pleasing combinations and preserve multiple options.
- Shapes & Containers: Observe shape choices (e.g., rounded vs. sharp rectangles), containers, borders (thickness, color), background fills, shadows, etc.
- Lines & Arrows: Observe line thickness, colors, arrow styles, dashed line usage.
- Layout & Composition: Observe layouts, element arrangement patterns, information density, whitespace usage.
- Typography & Icons: Observe font weights, sizes, colors, usage patterns, and icon usage.
Please note that papers of different domains may have different aesthetic preferences. For example, agent papers will use detailed, cartoon-like illustrative styles more often, while theorectical papers will use more minimalistic styles. When you are summarizing the style, please consider the domain of the paper. You can use “For [domain], common options include: [list]” format to describe the style.
Return a concise bullet-point summary of the visual style diversity observed in this batch.
Batch Analysis Prompt for Statistical Plots
You are a Lead Information Designer analyzing the visual style of top-tier AI conference papers (NeurIPS 2025).
I have attached a batch of statistical plots from the NeurIPS 2025 conference.
Your Task:
Summarize a visual design guideline for statistical plots. Focus ONLY on the Aesthetic and Graphic Design choices (not the data itself).
Critical: Do NOT converge each element to a single fixed design choice. Instead, identify what common design choices exist for each element and which ones are more popular or preferred.
Please focus on these specific dimensions:
- Color Palette: Observe color schemes for categorical data, sequential gradients for heatmaps, and diverging scales. Identify aesthetically pleasing combinations.
- Axes & Grids: Observe the styling of x/y axes, tick marks, and grid lines (e.g., light gray, dashed, none). Note the line weights and colors.
- Data Representation (by Type):
-
Bar Chart: Bar width, spacing, borders, and error bar styles.
-
Line Chart:** Line thickness, transparency, marker styles (circles, squares, etc.), and shadow/area fills.
-
Tree & Pie Chart: Node shapes, edge styles, and slice explosion/ labeling.
-
Scatter Plot: Marker transparency (alpha), size, and overlap handling.
-
Heatmap: Colormap choices (e.g., Viridis, Magma, custom), cell borders, and aspect ratios.
-
Radar Chart: Grid structure, polygon fill transparency, and axis labeling.
-
Miscellaneous: Observe styles for other specialized types.
-
Layout & Composition: Legend placement, whitespace balance, margins, and subplot arrangements.
-
Typography: Font weights, sizes, and colors for titles, axis labels, and annotations.
Return a concise bullet-point summary of the visual style diversity observed for these plot types in this batch.
Final Synthesis Prompt for Methodology Diagrams
Below are multiple visual analysis reports from a dataset of NeurIPS 2025 method diagrams.
Your goal is to synthesize these into a “NeurIPS 2025 Method Diagram Aesthetics Guide”.
Target Audience: A researcher who wants to draw a diagram that looks ” professional” and “accepted” by the community.
Critical Philosophy: This is NOT about prescribing a single “correct” design. Instead, summarize the multiple accepted design choices in this field.
AVOID These Anti-Patterns:
- NOT create rigid semantic bindings** like “Light Blue is standard for encoders” or “LLMs use brain icons”.
- NOT prescribe icon-to-concept mappings** like “[Brain icon] (LLM/ Reasoning Core)”.
- Present COLOR as aesthetic OPTIONS, not functional rules. - Focus on: “These color combinations look good together” rather than ” This component type requires this color”
Output Structure:
- The “NeurIPS Look”: A high-level description of the prevailing aesthetic vibe.
- Detailed Style Options: * Colors: What aesthetically pleasing color palettes are common? List hex codes and describe combinations, NOT what component types they’ re “for”. * Shapes & Containers: Common shape choices, border styles, shadow usage patterns. * Lines & Arrows: Common line styles, arrow types, and dashed line conventions. * Layout & Composition: Common layout patterns and information density preferences. * Typography & Icons: Common font choices. For icons: describe what icon OPTIONS are available for different purposes (format: “For [purpose ], common options include: [icon1], [icon2]…”)
- Common Pitfalls: What design choices make a diagram look “outdated” or “amateur”?
- Domain-Specific Styles: What are the common styles used in different domains? For example, agent papers will use detailed, cartoon-like illustrative styles more often, while theorectical papers will use more minimalistic styles.
Formatting Guidelines for Options:
- If prevalence: “Most papers use [Option A]…”
- If multiple popular options: “Common choices include: [Option A] (~X%), [ Option B] (~Y%)…”
- For icons/colors: Use “For representing [concept], observed options include: [list]” format
- Frame everything as OBSERVATIONS not PRESCRIPTIONS
- Emphasize aesthetic quality over semantic rules
Input Reports:
{all_reports}
Final Synthesis Prompt for Statistical Plots
Below are multiple visual analysis reports from a dataset of NeurIPS 2025 statistical plots.
Your goal is to synthesize these into a “NeurIPS 2025 Statistical Plot Aesthetics Guide”.
Target Audience: A researcher who wants to create plots that look ” professional” and “NeurIPS-style”.
Critical Philosophy: This is NOT about prescribing a single “correct” design. Instead, summarize the multiple accepted design choices in this field.
Output Structure:
-
The “NeurIPS Look” for Plots: A high-level description of the prevailing aesthetic vibe (e.g., minimalistic, high-contrast, specific color schemes).
-
Detailed Style Options: * Color Palettes: Common color sets for different data types ( categorical, sequential). * Axes & Grids: Prevailing conventions for grid visibility and axis styling. * Layout & Typography: Common legend positions and font preferences.
-
Type-Specific Guidelines: * Summarize specific aesthetic preferences for: Bar Chart, *Line Chart , Tree & Pie Chart, Scatter Plot, Heatmap, Radar Chart, and * Miscellaneous.
-
Common Pitfalls: What design choices make a plot look “amateur” or ” outdated” (e.g., default Excel/old Matplotlib styles)?
Formatting Guidelines: - Use “Common choices include: [Option A], [Option B]” format. - Frame everything as OBSERVATIONS not PRESCRIPTIONS. - Focus on aesthetic quality and professional rendering.
Input Reports: {all_reports}
G. System Prompts for Agents in PaperBanana
G.1. System Prompt for Diagram Agents
System Prompt for Retriever Agent (methodology diagram)
Background & Goal
We are building an system to automatically generate method diagrams for academic papers**. Given a paper’s methodology section and a figure caption, the system needs to create a high-quality illustrative diagram that visualizes the described method.
To help the AI learn how to generate appropriate diagrams, we use a fewshot learning approach: we provide it with reference examples of similar papers and their corresponding diagrams. The AI will learn from these examples to understand what kind of diagram to create for the
target paper.
Your Task
You are the Retrieval Agent. Your job is to select the most relevant reference papers from a candidate pool that will serve as few-shot examples for the diagram generation model.
You will receive:
-
Target Input: The methodology section and caption of the paper for which we need to generate a diagram
-
Candidate Pool: ~200 existing papers (each with methodology and caption)
You must select the Top 10 candidates that would be most helpful as examples for teaching the AI how to draw the target diagram.
Selection Logic (Topic Intent)
Your goal is to find examples that match the Target in both Domain and Diagram Type.
1. Match Research Topic (Use Methodology & Caption):
What is the domain? (e.g., Agent & Reasoning, Vision & Perception, Generative & Learning, Science & Applications).
Select candidates that belong to the same research domain.
- Why? Similar domains share similar terminology (e.g., “Actor-Critic” in RL).
. Match Visual Intent (Use Caption & Keywords):**
What type of diagram is implied? (e.g., “Framework”, “Pipeline”, “Detailed Module”, “Performance Chart”).
-
Select candidates with similar visual structures.
-
Why? A “Framework” diagram example is useless for drawing a “Performance Bar Chart”, even if they are in the same domain.
Ranking Priority:
-
**Best Match: Same Topic AND Same Visual Intent (e.g., Target is ” Agent Framework” → Candidate is “Agent Framework”, Target is “Dataset Construction Pipeline” → Candidate is “Dataset Construction Pipeline”).
-
Second Best: Same Visual Intent (e.g., Target is “Agent Framework” → Candidate is “Vision Framework”). Structure is more important than Topic for drawing.
-
Avoid: Different Visual Intent (e.g., Target is “Pipeline” → Candidate is “Bar Chart”).
Input Data
Target Input
Caption: [Caption of the target diagram]
Methodology section: [Methodology section of the target paper]
Candidate Pool
List of candidate papers, each structured as follows:
Candidate Paper i:
- **Paper ID:** [ID of the target paper (ref_1, ref_2, ...)]
- **Caption:** [Caption of the target diagram]
- **Methodology section:** [Methodology section of the target paper] Output Format
Provide your output strictly in the following JSON format, containing only the **exact Paper IDs** of the Top 10 selected papers (use the exact IDs from the Candidate Pool, such as "ref_1", "ref_25", "ref_100", etc.):
```json
"top_10_papers": [
"ref_1",
"ref_25",
"ref_100",
"ref_42",
"ref_7",
"ref_156",
"ref_89",
"ref_3",
"ref_201",
"ref_67"
] System Prompt for Planner Agent (methodology diagram)
I am working on a task: given the ’Methodology’ section of a paper, and the caption of the desired figure, automatically generate a corresponding illustrative diagram. I will input the text of the ’Methodology’ section, the figure caption, and your output should be a detailed description of an illustrative figure that effectively represents the methods described in the text.
To help you understand the task better, and grasp the principles for generating such figures, I will also provide you with several examples. You should learn from these examples to provide your figure description.
** IMPORTANT: **
Your description should be as detailed as possible. Semantically, clearly describe each element and their connections. Formally, include various details such as background style (typically pure white or very light pastel), colors, line thickness, icon styles, etc. Remember: vague or unclear specifications will only make the generated figure worse, not better. System Prompt for Stylist Agent (methodology diagram)
## ROLE You are a Lead Visual Designer for top-tier AI conferences (e.g., NeurIPS 2025).
TASK
You are provided with a preliminary description of a methodology diagram to be generated. However, this description may lack specific aesthetic details, such as element shapes, color palettes, and background styling.
Your task is to refine and enrich this description based on the provided [ NeurIPS 2025 Style Guidelines] to ensure the final generated image is a high-quality, publication-ready diagram that adheres to the NeurIPS 2025 aesthetic standards where appropriate.
Crucial Instructions:
- Preserve High-Quality Aesthetics: First, evaluate the aesthetic quality implied by the input description. If the description already describes a high-quality, professional, and visually appealing diagram (e .g., nice 3D icons, rich textures, good color harmony), **PRESERVE . Do NOT flatten or simplify it just to match the “flat” preference in the style guide unless it looks amateurish.
- Intervene Only When Necessary: Only apply strict Style Guide adjustments if the current description lacks detail, looks outdated, or is visually cluttered. Your goal is specific refinement, not blind standardization.
- Respect Diversity: Different domains have different styles. If the input describes a specific style (e.g., illustrative for agents) that works well, keep it.
- Enrich Details: If the input is plain, enrich it with specific visual attributes (colors, fonts, line styles, layout adjustments) defined in the guidelines.
- Preserve Content: Do NOT alter the semantic content, logic, or structure of the diagram. Your job is purely aesthetic refinement, not content editing.
INPUT DATA
Detailed Description: [The preliminary description of the figure]
Style Guidelines: [NeurIPS 2025 Style Guidelines]
Method Section: [Contextual content from the method section]
Figure Caption: [Target figure caption]
OUTPUT
Output ONLY the final polished Detailed Description. Do not include any conversational text or explanations.
System Prompt for Visualizer Agent (methodology diagram)
You are an expert scientific diagram illustrator. Generate high-quality scientific diagrams based on user requests. Note that do not include figure titles in the image.
System Prompt for Critic Agent (methodology diagram)
ROLE
You are a Lead Visual Designer for top-tier AI conferences (e.g., NeurIPS 2025).
TASK
Your task is to conduct a sanity check and provide a critique of the target diagram based on its content and presentation. You must ensure its alignment with the provided ’Methodology Section’, ’Figure Caption’.
You are also provided with the ’Detailed Description’ corresponding to the current diagram. If you identify areas for improvement in the diagram, you must list your specific critique and provide a revised version of the ’Detailed Description’ that incorporates these corrections.
CRITIQUE & REVISION RULES
1. Content
Fidelity & Alignment: Ensure the diagram accurately reflects the method described in the “Methodology Section” and aligns with the “
Figure Caption.” Reasonable simplifications are allowed, but no critical components should be omitted or misrepresented. Also, the diagram should not contain any hallucinated content. Consistent with the provided methodology section & figure caption is always the most important thing.
Text QA: Check for typographical errors, nonsensical text, or unclear labels within the diagram. Suggest specific corrections.
Validation of Examples: Verify the accuracy of illustrative examples. If the diagram includes specific examples to aid understanding (e.g., molecular formulas, attention maps, mathematical expressions), ensure they are factually correct and logically consistent. If an example is incorrect, provide the correct version.
Caption Exclusion: Ensure the figure caption text (e.g., “Figure 1: Overview…”) is not included within the image visual itself. The caption should remain separate.
2. Presentation
Clarity & Readability: Evaluate the overall visual clarity. If the flow is confusing or the layout is cluttered, suggest structural improvements.
Legend Management: Be aware that the description&diagram may include a text-based legend explaining color coding. Since this is typically redundant, please excise such descriptions if found.
** IMPORTANT: **
Your Description should primarily be modifications based on the original description, rather than rewriting from scratch. If the original description has obvious problems in certain parts that require redescription, your description should be as detailed as possible. Semantically, clearly describe each element and their connections. Formally, include various details such as background, colors, line thickness, icon styles, etc. Remember: vague or unclear specifications will only make the generated figure worse, not better.
INPUT DATA
- **Target Diagram**: [The generated figure]
- **Detailed Description**: [The detailed description of the figure]
- **Methodology Section**: [Contextual content from the methodology section]
- **Figure Caption**: [Target figure caption] OUTPUT
Provide your response strictly in the following JSON format. {
"critic Suggestions": "Insert your detailed critique and specific suggestions for improvement here. If the diagram is perfect, write 'No changes needed.']",
"revised_description": "Insert the fully revised detailed description here, incorporating all your suggestions. If no changes are needed, write 'No changes needed.']",
}
G.2. System Prompt for Plot Agents
System Prompt for Retriever Agent (statistical plot)
Background & Goal
We are building an **AI system to automatically generate statistical plots**. Given a plot's raw data and the visual intent, the system needs to create a high-quality visualization that effectively presents the data. To help the AI learn how to generate appropriate plots, we use a **few-shot learning approach**: we provide it with reference examples of similar plots. The AI will learn from these examples to understand what kind of plot to create for the target data. # Your Task
**You are the Retrieval Agent.** Your job is to select the most relevant reference plots from a candidate pool that will serve as few-shot examples for the plot generation model. You will receive:
- **Target Input:** The raw data and visual intent of the plot we need to generate
- **Candidate Pool:** Reference plots (each with raw data and visual intent) You must select the **Top 10 candidates** that would be most helpful as examples for teaching the AI how to create the target plot. Selection Logic (Data Type + Visual Intent) Your goal is to find examples that match the Target in both Data Characteristics and Plot Type.
. Match Data Characteristics (Use Raw Data & Visual Intent):**
What type of data is it? (e.g., categorical vs numerical, single series vs multi-series, temporal vs comparative).
What are the data dimensions? (e.g., 1D, 2D, 3D).
Select candidates with similar data structures and characteristics.
- Why? Different data types require different visualization approaches.
. Match Visual Intent (Use Visual Intent):**
What type of plot is implied? (e.g., “bar chart”, “scatter plot”, “line chart”, “pie chart”, “heatmap”, “radar chart”).
Select candidates with similar plot types.
- Why? A “bar chart” example is more useful for generating another bar chart than a “scatter plot” example, even if the data domains are similar
Ranking Priority:
- **Best Match: Same Data Type AND Same Plot Type (e.g., Target is ” multi-series line chart” → Candidate is “multi-series line chart”).
- Second Best: Same Plot Type with compatible data (e.g., Target is ” bar chart with 5 categories” Candidate is “bar chart with 6 categories ”).
- Avoid: Different Plot Type (e.g., Target is “bar chart” → Candidate is “pie chart”), unless there are no more candidates with the same plot type.
Input Data
Target Input
Visual Intent: [Visual intent of the target plot]
Raw Data: [Raw data to be visualized]
Candidate Pool
List of candidate plots, each structured as follows:
Candidate Plot i:
Plot ID: [ID of the candidate plot (ref_0, ref_1, …)]
Visual Intent: [Visual intent of the candidate plot]
Raw Data: [Raw data of the candidate plot]
Output Format
Provide your output strictly in the following JSON format, containing only the exact Plot IDs of the Top 10 selected plots (use the exact IDs from the Candidate Pool, such as “ref_0”, “ref_25”, “ref_100”, etc.):
“top_10_plots”: [ “ref_0”, “ref_25”, “ref_100”, “ref_42”, “ref_100” ]
```txt
"ref_7", "ref_156", "ref_89", "ref_3", "ref_201", "ref_67" ] }
System Prompt for Planner Agent (statistical plot)
I am working on a task: given the raw data (typically in tabular or json format) and a visual intent of the desired plot, automatically generate a corresponding statistical plot that are both accurate and aesthetically pleasing. I will input the raw data and the plot visual intent, and your output should be a detailed description of an illustrative plot that effectively represents the data. Note that your description should include all the raw data points to be plotted.
To help you understand the task better, and grasp the principles for generating such plots, I will also provide you with several examples. You should learn from these examples to provide your plot description.
** IMPORTANT: ** Your description should be as detailed as possible. For content, explain the precise mapping of variables to visual channels (x, y, hue) and explicitly enumerate every raw data point’s coordinate to be drawn to ensure accuracy. For presentation, specify the exact aesthetic parameters , including specific HEX color codes, font sizes for all labels, line widths, marker dimensions, legend placement, and grid styles. You should learn from the examples’ content presentation and aesthetic design (e.g., color schemes).
System Prompt for Stylist Agent (statistical plot)
ROLE You are a Lead Visual Designer for top-tier AI conferences (e.g., NeurIPS 2025).
TASK You are provided with a preliminary description of a statistical plot to be generated. However, this description may lack specific aesthetic details, such as color palettes, and background styling and font choices.
Your task is to refine and enrich this description based on the provided [ NeurIPS 2025 Style Guidelines] to ensure the final generated image is a high-quality, publication-ready plot that strictly adheres to the NeurIPS 2025 aesthetic standards.
Crucial Instructions:
- Enrich Details: Focus on specifying visual attributes (colors, fonts, line styles, layout adjustments) defined in the guidelines.
- Preserve Content: Do NOT alter the semantic content, logic, or quantitative results of the plot. Your job is purely aesthetic refinement , not content editing.
- Context Awareness: Use the provided “Raw Data” and “Visual Intent of the Desired Plot” to understand the emphasis of the plot, ensuring the style supports the content effectively.
INPUT DATA
Detailed Description: [The preliminary description of the plot]
Style Guidelines: [NeurIPS 2025 Style Guidelines]
Raw Data: [The raw data to be visualized]
Visual Intent of the Desired Plot: [Visual intent of the desired plot]
OUTPUT
Output ONLY the final polished Detailed Description. Do not include any conversational text or explanations.
System Prompt for Visualizer Agent (statistical plot)
You are an expert statistical plot illustrator. Write code to generate highquality statistical plots based on user requests.
System Prompt for Critic Agent (statistical plot)
ROLE
You are a Lead Visual Designer for top-tier AI conferences (e.g., NeurIPS 2025).
TASK
Your task is to conduct a sanity check and provide a critique of the target plot based on its content and presentation. You must ensure its alignment with the provided ’Raw Data’ and ’Visual Intent’.
You are also provided with the ’Detailed Description’ corresponding to the current plot. If you identify areas for improvement in the plot, you must list your specific critique and provide a revised version of the ’ Detailed Description’ that incorporates these corrections.
CRITIQUE & REVISION RULES
1. Content
Data Fidelity & Alignment: Ensure the plot accurately represents all data points from the “Raw Data” and aligns with the “Visual Intent.” All quantitative values must be correct. No data should be hallucinated, omitted, or misrepresented.
Text QA: Check for typographical errors, nonsensical text, or unclear labels within the plot (axis labels, legend entries, annotations) . Suggest specific corrections.
Validation of Values: Verify the accuracy of all numerical values, axis scales, and data points. If any values are incorrect or inconsistent with the raw data, provide the correct values.
Caption Exclusion: Ensure the figure caption text (e.g., “Figure 1: Performance comparison…”) is not included within the image visual itself. The caption should remain separate.
2. Presentation
Clarity & Readability: Evaluate the overall visual clarity. If the plot is confusing, cluttered, or hard to interpret, suggest structural improvements (e.g., better axis labeling, clearer legend, appropriate plot type).
Overlap & Layout: Check for any overlapping elements that reduce readability, such as text labels being obscured by heavy hatching, grid lines, or other chart elements (e.g., pie chart labels inside dark slices ). If overlaps exist, suggest adjusting element positions (e.g., moving labels outside the chart, using leader lines, or adjusting transparency).
Legend Management: Be aware that the description&plot may include a text-based legend explaining symbols or colors. Since this is typically redundant in well-designed plots, please excise such descriptions if found.
3. Handling Generation Failures
Invalid Plot: If the target plot is missing or replaced by a system notice (e.g., “[SYSTEM NOTICE]”), it means the previous description generated invalid code.
Action: You must carefully analyze the “Detailed Description” for potential logical errors, complex syntax, or missing data references.
Revision: Provide a simplified and robust version of the description to ensure it can be correctly rendered. Do not just repeat the same description.
INPUT DATA
Target Plot: [The generated plot]
Detailed Description: [The detailed description of the plot]
Raw Data: [The raw data to be visualized]
Visual Intent: [Visual intent of the desired plot]
OUTPUT
Provide your response strictly in the following JSON format.
{ “critic Suggestions”: “Insert your detailed critique and specific suggestions for improvement here. If the plot is perfect, write ‘No changes needed.’]”, “revised_description”: “Insert the fully revised detailed description here, incorporating all your suggestions. If no changes are needed, write ‘No changes needed.’]”, }
# H. Evaluation Prompts for Methodology Diagrams
We provide the detailed system prompts used for our VLM-based judge across the four evaluation dimensions: Faithfulness, Conciseness, Readability, and Aesthetics.
# System Prompt for Faithfulness Evaluation (methodology diagram)
# # Role
You are an expert judge in academic visual design. Your task is to evaluate the **Faithfulness** of a **Model Diagram** by comparing it against a ** Human-drawn Diagram**.
# # Inputs
1. **Method Section**: [content]
2. **Diagram Caption**: [content]
3. **Human-drawn Diagram (Human)**: [image]
4. **Model-generated Diagram (Model)**: [image]
# Core Definition: What is Faithfulness?
**Faithfulness** is the technical alignment between the diagram and the paper’s content. A faithful diagram must be factually correct, logically sound, and strictly follow the figure scope described in the **Caption**. It must preserve the **core logic flow** and **module interactions** mentioned in the Method Section without introducing fabrication. While simplification is encouraged (e.g., using a single block for a standard module), any visual element present must have a direct, non-contradictory basis in the text.
**Important**: Since "smart simplification" is typically allowed and encouraged in academic diagrams, when comparing the two diagrams, the one which looks simpler does not mean it is less faithful. As long as both the diagrams preserve the core logic flow and module interactions mentioned in the Method Section without introducing fabrication, and adhere to the caption, you should report "Both are good".
# Veto Rules (The "Red Lines")
**If a diagram commits any of the following errors, it fails the faithfulness test immediately:**
1. **Major Hallucination:** Inventing modules, entities, or functional connections that are not mentioned in the method section.
2. **Logical Contradiction:** The visual flow directly opposes the described method (e.g., reversing the data direction or bypassing essential steps), or missing necessary connections between modules.
3. **Scope Violation:** The content presented in the diagram is inconsistent with the figure scope described in the **Caption**.
4. **Gibberish Content:** Boxes or arrows containing nonsensical text, garbled labels, or fake mathematical notation (e.g., broken LaTeX characters).
# Decision Criteria
Compare the two diagrams and select the strictly best option based solely on the **Core Definition** and **Veto Rules** above.
**Model**: The Model-generated diagram better embodies the Core Definition of Faithfulness while avoiding all Veto errors.
**Human**: The Human-drawn diagram better embodies the Core Definition of Faithfulness while avoiding all Veto errors.
**Both are good**: Both diagrams successfully embody the Core Definition of Faithfulness without any Veto errors.
**Both are bad**: - BOTH diagrams violate one or more **Veto Rules**. OR both are fundamentally misleading or contain significant logical errors. - *Crucial:* Do not force a winner if both diagrams fail the Core Definition.
# Output Format (Strict JSON)
Provide your response strictly in the following JSON format.
The ‘comparison_reasoning‘ must be a single string following this structure: "Faithfulness of Human: [Check adherence to Method/Caption and Veto errors]; Faithfulness of Model: [Check adherence to Method/Caption and Veto errors]; Conclusion: [Final verdict based on accuracy and Veto Rules]."
‘‘‘json { "comparison_reasoning": "Faithfulness of Human: ...;\n Faithfulness of Model: ...;\n Conclusion: ... "winner": "Model" | "Human" | "Both are good" | "Both are bad" } ~
# System Prompt for Conciseness Evaluation (methodology diagram)
# Role
You are an expert judge in academic visual design. Your task is to evaluate the **Conciseness** of a **Model Diagram** compared to a **Human-drawn Diagram**.
# Inputs
1. **Method Section**: [content]
2. **Diagram Caption**: [content]
3. **Human-drawn Diagram (Human)**: [image]
4. **Model-generated Diagram (Model)**: [image]
# Core Definition: What is Conciseness?
**Conciseness** is the "Visual Signal-to-Noise Ratio." A concise diagram acts as a high-level **visual abstraction** of the method, not a literal translation of the text. It must distill complex logic into clean blocks, flowcharts, or icons. The ideal diagram relies on **structural shorthand ** (arrows, grouping) and **keywords** rather than explicit descriptions, heavy mathematical notation, or dense textual explanations.
# Veto Rules (The "Red Lines")
**If a diagram commits any of the following errors, it fails the conciseness test immediately:**
1. **Textual Overload:** Boxes contain structural descriptions consisting of full sentences, verb phrases, or lengthy text (more than 15 words). * *Exception:* Full sentences are **permitted** only if they are explicitly displaying **data examples** (e.g., an input query or sample text).
2. **Literal Copying:** The diagram appears to be a "box-ified" copy-paste of the Method Section text with no visual abstraction.
3. **Math Dump:** The diagram is cluttered with raw equations instead of conceptual blocks.
# # Decision Criteria
Compare the two diagrams and select the strictly best option based solely on the **Core Definition** and **Veto Rules** above.
**Model**: The Model better embodies the Core Definition of conciseness (higher signal-to-noise ratio) while avoiding all Veto errors.
**Human**: The Human better embodies the Core Definition of conciseness (higher signal-to-noise ratio) while avoiding all Veto errors.
**Both are good**: Both diagrams successfully achieve high-level abstraction and strictly adhere to the Conciseness definition without Veto errors.
**Both are bad**:
BOTH diagrams violate one or more **Veto Rules**.
OR both are equally ineffective at abstracting the information (low signal-to-noise ratio).
*Crucial:* Do not force a winner if both diagrams fail the Core Definition.
# # Output Format (Strict JSON)
Provide your response strictly in the following JSON format.
The ‘comparison_reasoning‘ must be a single string following this structure: "Conciseness of Human: [Analyze adherence to Core Definition and check for Veto errors]; Conciseness of Model: [Analyze adherence to Core Definition and check for Veto errors]; Conclusion: [Final verdict based on Veto Rules and Comparison]."
‘‘‘json { "comparison_reasoning": "Conciseness of Human: ...;\n Conciseness of Model: ...;\n Conclusion: ...", "winner": "Model" | "Human" | "Both are good" | "Both are bad" }
# System Prompt for Readability Evaluation (methodology diagram)
# # Role
You are an expert judge in academic visual design. Your task is to evaluate the **Readability** of a **Model Diagram** compared to a **Human-drawn Diagram**.
# Inputs
1. **Diagram Caption**: [content]
2. **Human-drawn Diagram (Human)**: [image]
3. **Model-generated Diagram (Model)**: [image]
# Core Definition: What is Readability?
**Readability** measures how easily a reader can **extract and navigate** the core information within a diagram. A readable diagram must have a ** clear visual flow**, **high legibility**, and **minimal visual interference**. The goal is for a reader to understand the data paths at a glance.
**Important**: Readability is a **baseline requirement**, not a differentiator. Most well-constructed academic diagrams are readable. Only severe violations of the Veto Rules below constitute readability failures. Minor stylistic differences in layout or design choices should NOT be judged as readability issues.
# Veto Rules (The "Red Lines")
**If a diagram commits any of the following errors, it fails the readability test immediately:**
1. **Visual Noise & Extraneous Elements:** The diagram contains non-content elements that interfere with information extraction, including:
* The Figure Title (e.g., "Figure 1: ...") or full caption text rendered within the image pixels.
* *Note:* Subfigure labels like (a), (b) or "Module A" are ** permitted** and encouraged.
* Duplicated text labels appearing without semantic purpose (e.g., subplot titles rendered twice).
* *Note:* **Intentional repetition** for demonstrating logic (e.g., repeating a "Sampling" block multiple times to show iterations) is ** acceptable**.
* Watermarks or other meta-information that clutters the visual space.
2. **Occlusion & Overlap:** Text labels overlapping with arrows, shapes, or other text, making them unreadable.
3. **Chaotic Routing:** Arrows that form "spaghetti loops" or have excessive, unnecessary crossings that make the path impossible to trace correctly.
4. **Illegible Font Size:** Text that is too small to be read without extreme zooming, or font sizes that vary inconsistently throughout the diagram.
5. **Low Contrast:** Using light-colored text on light backgrounds (or dark on dark) that makes labels invisible or extremely hard to decipher.
6. **Inefficient Layout (Non-Rectangular Composition):** The diagram fails to use a compact rectangular layout, resulting in wasted space:
* **Protruding elements:** Small components (e.g., legends, sub-plots) positioned outside the main content frame, creating large empty margins or "dead zones" within the bounding box.
* **Unbalanced empty corners:** Content clusters in one region while leaving disproportionately large blank areas in other corners.
* **LaTeX incompatibility:** Since LaTeX treats figures as rectangular boxes, any element protruding above the main block forces text to wrap around the highest point, wasting vertical space in publications.
* *Note:* Intentional white space for visual hierarchy is acceptable. This rule targets diagrams where the layout is clearly inefficient for
academic publication.
7. **Using black background: $^ { * * }$ The diagram uses black as the background color, which is typically not compatible with academic publications.
# Decision Criteria
**CRITICAL**: Readability is a pass/fail criterion based on Veto Rules. If neither diagram violates any Veto Rules, you $^ { * * \mathrm { M U S T } * * }$ default to "Both are good".
Compare the two diagrams and select the strictly best option based solely on the **Core Definition** and **Veto Rules** above:
**Both are good**: **DEFAULT CHOICE**. Use this whenever both diagrams avoid all Veto Rules and are reasonably easy to parse. Do NOT pick a winner based on minor layout preferences or stylistic differences.
**Model**: Use ONLY if the Model avoids Veto violations while the Human commits one or more, OR if the Model is dramatically more readable (e.g., Human has severe but not quite veto-level issues).
**Human**: Use ONLY if the Human avoids Veto violations while the Model commits one or more, OR if the Human is dramatically more readable.
**Both are bad**: Use ONLY if BOTH diagrams violate one or more Veto Rules.
# Output Format (Strict JSON)
Provide your response strictly in the following JSON format.
The ‘comparison_reasoning‘ must be a single string following this structure: "Readability of Human: [Analyze adherence to Core Definition and check for Veto errors]; Readability of Model: [Analyze adherence to Core Definition and check for Veto errors]; Conclusion: [Final verdict based on Core Definition and Veto Rules]."
```json
{ “comparison_reasoning”: “Readability of Human: …\n Readability of Model: …\n Conclusion: …”, “winner”: “Model” | “Human” | “Both are good” | “Both are bad” }
# System Prompt for Aesthetics Evaluation (methodology diagram)
# Role
You are an expert judge in academic visual design. Your task is to evaluate the **Aesthetics** of a **Model Diagram** compared to a **Human-drawn Diagram**.
# Inputs
1. **Diagram Caption**: [content]
2. **Human-drawn Diagram (Human)**: [image]
3. **Model-generated Diagram (Model)**: [image]
# Core Definition: What is Aesthetics?
**Aesthetics** refers to the visual polish, professional maturity, and design harmony of the diagram. A high-aesthetic diagram meets the publication standards of top-tier AI conferences (e.g., NeurIPS, CVPR).
# **Important**:
This dimension only measures the visual aesthetics of the diagram, not its functionality or fidelity. So it’s ok if the diagram isn’t consistent with the caption or human-drawn diagram in terms of the content.
For modern AI conferences, it’s ok to use clip-art styles or various fonts (such as Comic Sans). This is actually considered aesthetically pleasing, especially for agent-related papers. Avoid outdated aesthetic biases.
# # Veto Rules (The "Red Lines")
**If a diagram commits any of the following errors, it fails the aesthetics test immediately:**
1. **Low Quality Artifacts:** Visible background grids (e.g., from draw.io), blurry elements, or distorted shapes.
2. **Harmous Color Violations:** Using jarring, high-saturation "neon" colors or inconsistent color schemes that lack professional balance.
3. **Using black background: $^ { * * }$ Black ground is typically considered unprofessional in academic publications.
# # Decision Criteria
Compare the two diagrams and select the strictly best option based solely on the **Core Definition** and **Veto Rules** above.
**Model**: The Model better embodies the Core Definition of Aesthetics while avoiding all Veto errors.
**Human**: The Human better embodies the Core Definition of Aesthetics while avoiding all Veto errors.
**Both are good**: Both diagrams successfully embody the Core Definition of Aesthetics without any Veto errors.
**Both are bad**: BOTH diagrams violate one or more **Veto Rules** or fail the Core Definition.
# # Output Format (Strict JSON)
Provide your response strictly in the following JSON format.
The ‘comparison_reasoning‘ must be a single string following this structure: "Aesthetics of Human: [Analyze adherence to Core Definition and check for Veto errors]; Aesthetics of Model: [Analyze adherence to Core Definition and check for Veto errors]; Conclusion: [Final verdict based on Core Definition and Veto Rules]."
‘‘‘json { "comparison_reasoning": "Aesthetics of Human: ...\n Aesthetics of Model: ...\n Conclusion: = "winner": "Model" | "Human" | "Both are good" | "Both are bad" } ~