[]Jake Poznanski []Aman Rangapur []Jon Borchardt []Jason Dunkelberger []Regan Huff []Daniel Lin []Christopher Wilhelm []Kyle Lo []Luca Soldaini []Allen Institute for AI, Seattle, USA {jakep|kylel|lucas}@allenai.org indicates core contributors.
![[Uncaptioned image]](https://arxiv.org/html/2502.18443v3/x5.png)
[Uncaptioned image]
: Unlocking Trillions of Tokens in PDFs with Vision Language Models
Abstract
PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models. However, these documents come in a diversity of types with differing formats and visual layouts that pose a challenge when attempting to extract and faithfully represent the underlying content for language model use. Traditional open source tools often produce lower quality extractions compared to vision language models (VLMs), but reliance on the best VLMs can be prohibitively costly (e.g., over 176 USD. To aid comparison with existing systems, we also introduce olmOCR-Bench, a curated set of 1,400 PDFs capturing many content types that remain challenging even for the best tools and VLMs, including formulas, tables, tiny fonts, old scans, and more. We find outperforms even top VLMs including GPT-4o, Gemini Flash 2 and Qwen-2.5-VL. We openly release all components of: our fine-tuned VLM model, training code and data, an efficient inference pipeline that supports vLLM and SGLang backends, and benchmark olmOCR-Bench.
| Code | allenai/olmocr | Weights & Data | allenai/olmocr | Demo | olmocr.allenai.org |
|---|
1 Introduction
Access to clean, coherent textual data is a crucial component in the life cycle of modern language models (LMs). During model development, LMs require training on trillions of tokens derived from billions of documents 1 2 3; errors from noisy or low fidelity content extraction and representation can result in training instabilities or even worse downstream performance 4 3 5. During inference, LMs are often prompted with plain text representations of relevant document context to ground user prompts; for example, consider information extraction 6 or AI reading assistance 7 over a user-provided document and cascading downstream errors due to low quality representation of the source document.
While the internet remains a valuable source of textual content for language models, large amounts of content are not readily available through web pages. Electronic documents (e.g., PDF, PS, DjVu formats) and word processing files (e.g., DOC, ODT, RTF) are widely-used formats to store textual content. However, these formats present a unique challenge: unlike modern web standards, they encode content to facilitate rendering on fixed-size physical pages, at the expense of preserving logical text structure. For example, consider the PDF format, which originated as a means to specify how digital documents should be printed onto physical paper. As seen in Figure 2, PDFs store not units of text—headings, paragraphs, or other meaningful prose elements—but single characters alongside their spacing, placement, and any metadata used for visual rendering on a page. As more and more documents became digital, users have relied this file format to create trillions of documents 8; yet, these documents remain difficult to leverage in LM pipelines because PDFs lack basic structure necessary for coherent prose, such as ground truth reading order.

Figure 1: Performance-to-cost of compared to a range of methods for PDF linearization and content extraction. Baselines include open and closed-source specialized tools and general VLMs prompted to perform this task. Performance is calculated on olmOCR-Bench, while Cost is calculated using commercial API pricing or the L40S GPU hourly rate (full details in Appendix B ). As uses a fine-tuned Qwen 2 VL model (7B), they share the same inference cost; performance differences are a result of fine-tuning on our dataset olmOCR-mix-0225.
Figure 2: Example of how PDFs represent textual content, such as this paper title, as individual glyphs with metadata.
Faithful content extraction and representation of digitized print documents has long been of interest, with early research efforts in the 1950s, and first commercial optical character recognition (OCR) tools debuting in the late 1970s 9. The release of Tesseract in 2006 represented a significant milestone, as the first high-quality, open-source OCR toolkit 10. The current landscape of PDF extraction toolkits can be partitioned in pipeline-based systems and end-to-end models. Pipeline-based systems (MinerU, 11; Marker, 12) are comprised of multiple ML components (e.g., section segmentation, table parsing) chained together; some, such as Grobid 13, VILA 14, and PaperMage 15, are tailored to scientific papers. On the other hand, end-to-end models parse a document with a single model. For example, Nougat 16 and GOT Theory 2.0 17 take images of PDF pages as input, and return plain text. Notably, while pipeline-based systems have historically focused on simply faithful extraction, end-to-end-systems have also made strides to enable linearization of this content—prescribing a flattening of this content to adhere to logical reading order—which can be quite challenging for layout-rich documents with many floating elements (e.g. multi-column documents with floating diagrams, headers, footnotes, and more). Recently, rapid advances in the proprietary LMs have led to significant improvements in end-to-end text extraction capabilities 18 19. However, this capability comes at a steep price: for example, converting a million pages using GPT-4o can cost over $6,200 USD.1
We introduce, a general-purpose context extraction and linearization toolkit to convert PDFs or images of documents into clean plain text suitable for language model development. Our contributions in this work are as follows:
- Data. We create olmOCR-mix-0225, a collection of 260,000 crawled PDF pages paired with their OCR output by GPT-4o, that we use to train our models. These documents represent a diverse set of publicly available PDFs, with a skew towards academic papers, public domain books, legal documents, brochures, and more.
- Benchmark. We develop olmOCR-Bench, a comprehensive benchmark for evaluating document extraction tools. Unlike existing evaluation methods, olmOCR-Bench uses simple, natural binary rules, like software unit tests, that enable direct comparisons across different OCR systems without relying on fuzzy gold reference matching or LLM-as-judge for evaluation. The benchmark covers 1,400 PDF pages with over 7,000 unit-test cases spanning diverse document types.
- Model and Code. We fine-tune Qwen2-VL-7B-Instruct 20 on olmOCR-mix-0225, producing olmOCR-7B-0225-preview. We package our VLM in the Python toolkit, written to scale efficiently from one to hundreds of GPUs using SGLang 21 and vLLM 22 inference engines. achieves state-of-the-art performance on our benchmark, even outperforming Qwen-2.5-VL-7B while remaining more cost-effective than existing alternatives, including commercial APIs; can produce high-quality plain text at less than $176 per million PDF pages.
- Downstream Use. We demonstrate real-world impact by applying to process the 7.9M original PDFs in peS2o 23, a widely-used corpus of linearized scientific articles used in language model pretraining. We show that training on the newly extracted data called olmOCR-peS2o can improve language model pretraining, observable even in downstream benchmark performance.
2 Creating and Training on olmOCR mix
We face two challenges in data acquisition necessary for developing a VLM for our task: (1) acquiring a large, diverse set of PDFs and (2) obtaining their linearized plain text as supervision targets.
2.1 Crawling PDFs
| Source | Uniquedocs | Totalpages |
|---|---|---|
| Web crawled PDFs | 96,929 | 240,940 |
| Internet Archive books | 5,896 | 17,701 |
| Total | 102,825 | 258,641 |
Table 1: olmOCR-mix-0225 composition by source. Web crawled PDFs are sampled from a set of over 240 million documents crawled from public websites. Books in the Internet Archive set are in the public domain.
| Document type | Fraction |
|---|---|
| Academic | 55.9% |
| Brochure | 11.2% |
| Legal | 10.2% |
| Books | 6.8% |
| Table | 5.6% |
| Diagram | 4.7% |
| Slideshow | 1.9% |
| Other | 3.7% |
Table 2: olmOCR-mix-0225 PDFs breakdown by document type. Estimated by sampling 707 pages, classified using gpt-4o-2024-11-20. Prompt in Appendix E.3.
We randomly sample PDFs from an internal dataset of 240 million PDFs crawled from public internet sites, as well as PDFs of public domain books sourced from the Internet Archive. While the web crawled set is often born-digital documents, PDFs from the Internet Archive consist of image scans. We then perform a set of filters: Using the Lingua package 24, we identify and filter out non-English documents. Further, we remove any document that failed to be parsed by pypdf, contains spam keywords, is a fillable form, or whose text is too short.2 We then sampled (up to) three pages uniformly at random from each PDF. We summarize the data distribution in Tables 2 and 2.
2.2 Generating Linearized Plain Text
Obtaining supervision targets for converting PDF to plain text presents a fundamental challenge. First, human annotation is prohibitively expensive and can be error-prone. Second, existing tools that extract content from PDF internals don’t work on document images, but also don’t provide reliable ground truth due to extraction errors from brittle heuristics. In this work, we turned to data generation using GPT-4o to reliably convert PDF pages to linearized plain text.3
Yet, GPT-4o does not produce sufficiently high-fidelity plain text on its own; for high-density pages or complex layouts, we found it is prone to omitting content, rewriting or completing content in a manner unfaithful to the original, or captioning images when not instructed to do so. To help guide GPT-4o generations, we experiment with augmenting the visual input (PDF page raster) with text blocks and position information extracted from the page. As mentioned in A, we refer to this approach as document-anchoring.
We use the pypdf 25 library to extract a representation of the page’s structure from the PDF’s internal data. We note that this representation is highly noisy: reading order is not preserved and main content is interwoven with boilerplate text and PDF rendering-related artifacts. We sample blocks from this long extraction to add to the prompt until maximum input length is exceeded; we prioritize text blocks and images which are located at the start and end of the document.
Finally, we instruct GPT-4o to respond with structured output to our requests. We report the full JSON schema in Appendix E.1. This forces the model to first extract page metadata, such as language, page orientation, and presence of tables, before generating the text of the page in a natural reading order. This format allows for more efficient processing of output; further, we found it crucial to ensure that GPT-4o does not generate captions of images when no text is present on the page. Overall, we find document-anchoring indeed improves the output quality of GPT-4o according to our benchmark (§3) reported in Table 4.
2.3 Model Training
Fine-tuning
While document-anchoring could be used to prompt any language model, its performance may depend on the model (Table 4), making it best suited as a data generation technique. This leaves open the question whether a smaller, specialized VLM can be as accurate as optimized prompting of a larger, general-purpose model.
Starting from a Qwen2-VL-7B-Instruct checkpoint, we fine-tune olmOCR-7B-0225-preview on olmOCR-mix-0225. Training is implemented using Hugging Face’s transformers library 26. We use an effective batch size of 4, learning rate of 1e-6, AdamW optimizer, and a cosine annealing schedule for 10,000 steps (roughly 1.2 epochs).4 We use single node with 8 x NVIDIA H100 (80GB) GPUs. A single training run took 16 node hours, with all training experiments totaling 365 node hours.
During fine-tuning, we slightly alter the document-anchoring prompt, removing some instructions and shrinking the image size so that PDF pages are rendered to a maximum dimension of 1024 pixels on the longest edge. The simplified text prompt is in Appendix E.2. The prompt is capped to 6,000 characters, so a typical prompt uses about 1,000 tokens to encode a page image, 1,800 tokens for the anchor text, for about 3,000 total input tokens. Each training example was truncated to 8,192 tokens to cover cases when the prompt was unusually large. Loss was masked so only the final response tokens participated in the loss calculation.
We keep the same structured JSON output that was present in the outputs of olmOCR-mix-0225. More training evaluations are noted in Appendix C.
3 Building olmOCR-Bench
We develop olmOCR-Bench to systematically evaluate PDF linearization and content extraction performance across diverse tools and models. olmOCR-Bench operates by assessing a series of predefined pass-or-fail “unit-tests”— Given an input whole PDF, does the plain text output satisfy a specific property or contain a specific element? Each test is designed to be simple, unambiguous, and deterministically machine-verifiable. This avoids reliance on model-based evaluators which can be biased towards favoring their own generations 27. It also avoids use of soft metric (e.g., edit distance, ROUGE) comparisons against reference text which might fail to reveal fine-grained yet semantically important content extraction errors, as is the case with incorrect math formulas (e.g., vs ). olmOCR-Bench comprises 1,402 distinct PDF documents derived from diverse source repositories, covered by 7,010 unique test cases. Some test patterns apply to any document type (e.g., presence, absence, reading order) while others are motivated by particular challenging yet important content extraction targets (e.g., tables, math formulas); see Table 3 for a breakdown.
3.1 Unit Test Categories
We designed 5 distinct test categories, each designed to assess specific aspects of linearization and context extraction performance. We describe the test definitions and scoring methods below:
- Text Presence: Verifies that a text segment (typically spanning 1-3 sentences) is correctly identified within the plain text output. Soft/fuzzy matching is allowed, as well as specifying if the text must be in the first or last characters of the document. Case-sensitive by default.
- Text Absence: Verifies that a text segment is successfully excluded from the plain text output. This category primarily targets peripheral content such as recurring headers, footers, and pagination markers. Soft/fuzzy matching is allowed, as well as specifying if the text must be in the first or last characters of the document. Not case-sensitive by default.
- Natural Reading Order: Verifies the order between two text segments. For instance, on a PDF with multiple news articles on one page, we can test for whether the first sentence of the first article appears after the heading of that article; yet such tests can be designed to not penalize for the order of the articles themselves. Soft matching is allowed, case-sensitive by default.
- Table Accuracy: Checks that the plain text output contains a table with a cell with a given value, and that its neighboring cells have certain properties. For instance, one can validate this page has a table with a cell containing “4.5%” and above that is a cell containing “2.4%”. Both Markdown and HTML based tables are supported, though many cases depend on rowspan and colspan information being preserved, which is possible only in HTML based tables.
- Math Formula Accuracy: Checks that the plain text output contains a given math equation. We render a reference equation using KaTeX in a headless browser and extract all rendered symbols and their (visual) bounding boxes. Then we check if a matching collection of symbols, with the same relative orientations, exists anywhere in the final OCR document. For instance, if searching for on a page, we look for an equation where appears to the left of a , appears to the left of , appears above , and so on. This is similar to the method described by 28, but ours is simpler due to the test being Pass/Fail only.
- Baseline: Each PDF document by default also receives a baseline or default test case, which checks that some plain text output containing alphanumeric characters was actually produced for that page, that such output does not have a string of repeating N grams at the end (longer than 30), and that the output does not contain any characters from the Chinese, Japanese, or Emoji Unicode charsets.5
In all cases where text is compared, we perform basic string normalization, such as converting
s to newlines, normalizing all whitespace to single ASCII spaces, removing Markdown bold/italics, normalizing quotes/hyphens to ASCII, and converting all Unicode to NFC format.
| Presence | Absence | Read Order | Table | Formula | Total tests | |
|---|---|---|---|---|---|---|
| arXiv Math (AM) | - | - | - | - | 2,927 | 2,927 |
| Old Scans Math (OSM) | - | - | - | - | 458 | 458 |
| Tables (TA) | - | - | - | 1,020 | - | 1,020 |
| Old Scans (OS) | 279 | 70 | 177 | - | - | 526 |
| Headers Footers (HF) | - | 753 | - | - | - | 753 |
| Multi Column (MC) | - | - | 884 | - | - | 884 |
| Long Tiny Text (LTT) | 442 | - | - | - | - | 442 |
| Total Tests | 721 | 823 | 1,061 | 1,020 | 3,385 | 7,010 |
Table 3: Counts of unit test types in olmOCR-Bench.
3.2 Sourcing Documents and Creating Tests
We define 7 distinct document types that we found (or its earlier iterations) often struggled to process and defined custom acquisition strategies for each (described below). We removed documents that both contained PII and were not meant for public dissemination; prompt in Appendix F.2.2. We also decontaminate against documents that appear in olmOCR-mix-0225 via URL level deduplication. To scale creation of test cases over these documents, we combined manual design and review with prompting GPT-4o; further details and prompts are in Appendix F. Visualize sample documents in Appendix F.3.
- arXiv Math (AR) We downloaded a recent set of papers from the math subset of arXiv, selecting manuscripts with a single TeX source file and corresponding rendered PDF. To select a candidate expression from a page to use in a test, we (1) ran to identify candidate pages with TeX, (2) match pages back to original TeX source, and (3) validate matched TeX rendering compatibility with KaTeX. We manually verify the final set of test cases to exclude instances where custom macros produce renderings that deviate from standard and to split multi-part equations into smaller test cases.
- Old Scans Math (OSM) We crawl old, public domain math textbooks from the Internet Archive 6, extracting random pages from these documents. We similarly use to find candidate pages with formulas, but this time manually annotate each formula on the page to use as test cases.
- Tables (TA) We sampled more documents from the same internal crawled PDF repository used to create olmOCR-mix-0225 and filtered to those which had tables using a simple prompt with Gemini-Flash-2.0. On pages with tables, we prompted Gemini-Flash-2.0 for the relationships between randomly chosen cells. We manually reviewed those tests for accuracy.
- Old Scans (OS) We sampled historical letters and typewritten documents with existing human transcriptions from the Library of Congress 7 digital archives. We then wrote a small script to generate Natural Reading Order cases consisting of sentences that were naturally before or after one another in the original human transcriptions. We manually added test cases to cover some headers/footers which should have been excluded from any OCR version of these documents. All of the test cases then underwent a second pass of human review for accuracy.
- Headers Footers (HF) We sampled documents from the same internally crawled PDF repository as olmOCR-mix-0225. We used DocLayout-YOLO 29 to identify page regions labeled as headers or footers using the abandon category. To extract the text from these header/footer regions, we visually mask out the rest of the document and prompt Gemini-Flash-2.0 for the content. These extracted snippets are added as test cases that should be absent in linearized output. We manually reviewed to remove mistakenly filtered text and to set conditions such as limiting the search area to the first N or last N characters. For example, if a page number “5” appears at the bottom of a page, we test to ensure that output plain text does not contain “5” in the last 20 characters, but still allow for a “5” that may appear earlier in the text.
- Multi Column (MC) We visually sample documents from our internal crawled PDF repository to find documents with multi-column layouts and multiple articles on one page. We use Claude-Sonnet-3.7 to render those pages to HTML, and from that HTML, we extract text segments before/after one another. We manually review each entry for accuracy. We purposely select simple text blocks from coherent regions of the document, and avoid including any math formulas, superscripts, or subscripts in these tests.
- Long Tiny Text (LTT) We crawled documents from the Internet Archive containing a large amount of dense, small print on a single page. Such documents include pages from a dictionary or pages of references from academic papers. We then generate test cases using Gemini-Flash-2.0 and verify them manually.
3.3 Scoring
We run each of the PDF pages across each of our tools and methods to produce a markdown or plain text document. As all tests are Pass/Fail, we simply report percentage of tests passed, macro-averaged by document type. We evaluate each of the tests to get a percentage correct score for each test source (plus the default baseline tests). The final score for each tool is the average of the percentage across each test source. This captures the difficulty we faced at times of finding and validating enough cases from each source, but we roughly feel that each source represents an important capability for an OCR system to have.
4 Evaluating
First, we evaluate on olmOCR-Bench against a range of linearization tools and VLMs (Section §4.1). We then quantify the usefulness for language modeling by continued pretraining on an OLMo 2 checkpoint 5 on content extracted and linearized with our toolkit (Section §4.2).
Additional evaluations, studying how faithful is to its teacher model (Section §C.1), and pairwise ELO comparison (Section §C.2) are available in the appendix.
4.1 olmOCR-Bench Results
From Table 4, we see that significantly outperforms both the best commercial dedicated OCR tool (Mistral) as well as both GPT-4o, its teacher model, and Qwen 2.5 VL, which is an update to Qwen 2 VL, which was the base model for olmOCR-7B-0225-preview. We note that we developed olmOCR-Bench after training olmOCR-7B-0225-preview to prevent unfairly iterating on the benchmark before comparing with other methods. Qualitatively, produces significantly cleaner plain text than specialized open-source tools (visualized in Appendix G).
| Model | AR | OSM | TA | OS | HF | MC | LTT | Base | Overall |
|---|---|---|---|---|---|---|---|---|---|
| GOT OCR | 52.7 | 52.0 | 0.2 | 22.1 | 93.6 | 42.0 | 29.9 | 94.0 | 48.3 ± 1.1 |
| Marker v1.7.5 | 76.0 | 57.9 | 57.6 | 27.8 | 84.9 | 72.9 | 84.6 | 99.1 | 70.1 ± 1.1 |
| MinerU v1.3.10 | 75.4 | 47.4 | 60.9 | 17.3 | 96.6 | 59.0 | 39.1 | 96.6 | 61.5 ± 1.1 |
| Mistral OCR API | 77.2 | 67.5 | 60.6 | 29.3 | 93.6 | 71.3 | 77.1 | 99.4 | 72.0 ± 1.1 |
| GPT-4o (No Anchor) | 51.5 | 75.5 | 69.1 | 40.9 | 94.2 | 68.9 | 54.1 | 96.7 | 68.9 ± 1.1 |
| GPT-4o (Anchored) | 53.5 | 74.5 | 70.0 | 40.7 | 93.8 | 69.3 | 60.6 | 96.8 | 69.9 ± 1.1 |
| Gemini Flash 2 (No Anchor) | 32.1 | 56.3 | 61.4 | 27.8 | 48.0 | 58.7 | 84.4 | 94.0 | 57.8 ± 1.1 |
| Gemini Flash 2 (Anchored) | 54.5 | 56.1 | 72.1 | 34.2 | 64.7 | 61.5 | 71.5 | 95.6 | 63.8 ± 1.2 |
| Qwen 2 VL (No Anchor) | 19.7 | 31.7 | 24.2 | 17.1 | 88.9 | 8.3 | 6.8 | 55.5 | 31.5 ± 0.9 |
| Qwen 2.5 VL (No Anchor) | 63.1 | 65.7 | 67.3 | 38.6 | 73.6 | 68.3 | 49.1 | 98.3 | 65.5 ± 1.2 |
| Ours (v0.1.75 No Anchor) | 71.5 | 71.4 | 71.4 | 42.8 | 94.1 | 77.7 | 71.0 | 97.8 | 74.7 ± 1.1 |
| Ours (v0.1.75 Anchored) | 74.9 | 71.2 | 71.0 | 42.2 | 94.5 | 78.3 | 73.3 | 98.3 | 75.5 ± 1.0 |
Table 4: Evaluation results on olmOCR-Bench grouped by document types. Best unit test pass rate in each column is bold. 95% CI calculated by bootstrapping with 10k samples.
4.2 Downstream Evaluation
To assess the impact of improved PDF linearization, we experiment using an intermediate checkpoint of OLMo-2-1124-7B and continued pretraining using content extracted from a fixed collection of PDFs but with different linearization tools. This ablation procedure has been used to assess data quality in 30 31 5.
For our baseline, we use PDF extracted tokens from peS2o 23, 58B tokens from academic papers derived using Grobid 13 from the S2ORC 32 paper collection and further cleaned with heuristics for language modeling. To represent, we identify the same documents used in peS2o, acquire their source PDFs from the upstream S2ORC pipeline, and reprocess them using. For these two versions of peS2o, we train the 7B checkpoint for another 50B tokens. As shown in Table 5, replacing the original peS2o tokens extracted via Grobid + rules with those processed used results in a +1.3 percentage point average improvement on widely-reported LM benchmark tasks, including MMLU 33, ARC C, DROP 34, HellaSwag 35, NaturalQuestions 36, WinoGrande 37.
| peS2o version | Average | MMLU | DROP | HSwag | NQ | WinoG | |
|---|---|---|---|---|---|---|---|
| (Soldaini and Lo, 2023) | 53.9 | 61.1 | 75.0 | 42.3 | 57.4 | 29.4 | 58.3 |
| olmOCR-peS2o | 55.2 | 61.1 | 76.4 | 43.7 | 62.6 | 29.1 | 58.0 |
Table 5: Comparison on OLMo 2 5 downstream evaluation tasks of OLMo-2-7B-1124 on 50B of original peS2o tokens vs 50B tokens from the same source PDFs but processed with.
4.3 Cost Evaluation
Finally, when considering real-world use, cost efficiency is just as important as performance. We present a summary of inference costs in Table 6. To contextualize the value of, at 1,000 tokens per page, to process all of peS2o PDFs can already cost $10.3M in H100 usage. In comparison, Mistral OCR is a commercial API tool specializing in this task, yet is over five times more expensive, making it even more prohibitive to use for language modeling. See Appendix B for details on pricing and cost calculations.
| Model | Hardware | Tokens/sec | Pages/USD | Cost per million pages |
| GPT-4o | API | - | 80 | $12,480 |
| Batch | - | 160 | $6,240 | |
| Marker v1.7.5 (Force OCR) | H100 | 332 | 674 | $1484 |
| Mistral OCR | API | - | 1,000 | $1,000 |
| MinerU | L40S | 238 | 1,678 | $596 |
| Gemini Flash 2 | API | - | 2,004 | $499 |
| Batch | - | 4,008 | $249 | |
| L40S | 906 | 5,697 | $176 | |
| H100 | 3,050 | 5,632 | $178 |
Table 6: Inference cost comparison against other OCR methods. NVIDIA L40S estimated at 2.69 per hour. We measured a 12% retry rate for. Full cost breakdown in Appendix B.
5 Related Work
Tools and Models for Linearizing PDFs to Plain Text.
Many tools have existed for this task, some are parsers of born-digital PDF internals while others are OCR tools on top of image rasters of PDF pages. As machine learning matured, more people started developing models that automate this PDF parsing; examples include LayoutLM 38 and VILA 39. Tools have been developed around use of these models, including PaperMage 40, or have updated to include their own custom trained models, like Grobid 13. Commercial API providers began integrating VLMs with document processing capabilities: OpenAI introduced GPT-4 Vision 41 in September 2023, Google launched Gemini 42 in December 2023 with significant enhancements throughout 2024 as multimodal models became more powerful and accessible. Despite these developments, there remained a notable absence that specifically train a VLM for this task and package this capability into comprehensive, production-ready software libraries. Our work addresses this gap, developing alongside concurrent efforts such as Mistral 43, Qwen VL 44 which we systematically evaluate against.
Benchmarking VLMs on Linearization.
Several benchmarks have been developed for evaluating document understanding and content extraction. Established datasets such as FUNSD 45 focus on form understanding with typewritten content, SROIE 46 concentrates on information extraction from scanned receipts, and RVL-CDIP 47 contains scanned documents. However, these datasets exhibit significant limitations: they are predominantly domain specific, targeting narrow document categories with constraint formatting variations, while our approach leverages a diverse corpus spanning multiple domains and document types. Additionally, traditional benchmarks often focus on isolated extraction tasks (e.g., exclusively evaluating tables with PubTabNet 48, or mathematical formulas with specialized detection frameworks 49), whereas our benchmark evaluates performance across a comprehensive spectrum of extraction challenges. Further, their evaluation method is brittle, typically relying on exact string matching against predefined gold-standard tokens (which makes it difficult to compare methods that produce different tokenizations). In contrast, our unit-test-style evaluation framework enables equitable assessment across diverse implementations regardless of their underlying tokenization mechanisms, providing a more generalizable evaluation paradigm for document understanding systems.
Linearization for Language Modeling.
Significant work has been done on how to curate data for language modeling, with significant efforts on topics like data filtering 1 2 3 and source mixing 50 51. However, relatively little attention has been directed toward understanding the impact of linearization processes on downstream model training. DCLM 3 and RefinedWeb 4 touched on some aspects of this challenge, utilizing tools like Resiliparse 52 and Trafilatura 53, but their approaches were restricted to web based textual content. OpenWebMath 54 showed accurate content extraction is important for specialized domains, like mathematical formulas. For PDF-based content specifically, similar research contributions remain limited, which motivates this work.
6 Conclusion
We introduce, an open-source toolkit for converting PDF documents into clean plain text. Our approach combines document-anchoring, a novel prompting technique that leverages available metadata in born-digital PDFs, with a fine-tuned 7B parameter vision language model to achieve results competitive with closed commercial solutions at a fraction of the cost. We openly release our training set olmOCR-mix-0225 to enable others to further develop their own VLMs.
To rigorously evaluate the system, we developed olmOCR-Bench, a benchmark of 7,010 test instances across 1,403 PDFs. It includes Pass/Fail unit tests for text presence, reading order, tables, formulas, and baseline functionality. The documents span categories from scientific papers to historical manuscripts, enabling robust assessment across diverse linearization and context extraction challenges.
Our released efficient inference pipeline contains everything needed to start converting anything from single documents to million-page archives of PDFs. We hope ’s ability to efficiently process millions of documents will help unlock new sources of training data for language models, particularly from high-quality PDF documents that are currently underrepresented in existing datasets that rely heavily solely on crawled web pages.
Acknowledgments
This work would not be possible without the support of our colleagues at Ai2. We thank Byron Bischoff, Aaron Sarnat, Huy Tran, Sam Skjonsberg, Eric Marsh, and Chris Newell for help setting up the live demo; Taira Anderson, Sruthi Sreeram for program management support; Will Smith and Crystal Nam for legal guidance; Michael Schmitz, Caitlin Wittlif and Carissa Schoenick for various indirect support. We also thank Benjamin Charles Germain Lee for helpful feedback and suggestions on evaluation and potential use cases. We are grateful for the extensive feedback provided by Hynek Kydlíček on the inference toolkit.
References
Appendix A Methodology
Approach
Many end-to-end OCR models, such as GOT Theory 2.0 17 and Nougat 16, exclusively rely on rasterized pages to convert documents to plain text; that is, they process images of the document pages as input to autoregressively decode text tokens. This approach, while offering great compatibility with image-only digitization pipelines, misses the fact that most PDFs are born-digital documents, thus already contain either digitized text or other metadata that would help in correctly linearizing the content.

Figure 3: Example of how document-anchoring works for a typical page. Relevant image locations and text blocks get extracted, concatenated, and inserted into the model prompt. When prompting a VLM for a plain text version of the document, the anchored text is used in conjunction with the rasterized image of a page.
In contrast, the pipeline leverages document text and metadata. We call this approach document-anchoring. Figure 3 provides an overview of our method; document-anchoring extracts coordinates of salient elements in each page (e.g., text blocks and images) and injects them alongside raw text extracted from the PDF binary file. Crucially, the anchored text is provide as input to any VLM alongside a rasterized image of the page.
Our approach increases the quality of our content extraction. We apply document-anchoring when prompting GPT-4o to collect silver training samples, when fine-tuning olmOCR-7B-0225-preview, and when performing inference with the toolkit.
Implementation
document-anchoring processes PDF document pages via the pypdf 25 library to extract a representation of the page’s structure from the underlying PDF. All of the text blocks and images in the page are extracted, including position information. Starting with the most relevant text blocks and images 8, these are sampled and added to the prompt of the VLM, up to a defined maximum character limit 9. This extra information is then available to the model when processing the document.
Overall, we find that using prompts constructed using document-anchoring results in significantly fewer hallucinations. Prompting with just the page image was prone to models completing unfinished sentences, or to invent larger texts when the image data was ambiguous. Finally, while document-anchoring helps with quality on born-digital documents, our pipeline maintains high performance on documents that do not have any digital metadata encoded in them. In these cases, the model will not have the benefit of seeing the internal structure of the PDF document, instead relying on just the rasterized image of a page to process the underlying document.
Appendix B Cost Estimates of PDF Extraction Systems
To estimate prices, we use rates provided by RunPod 10 as of February 2025. It prices a single on-demand NVIDIA L40S GPU at 2.69 USD per hour. Using these rates, costs (in USD) were computed as follows:
- GPT-4o: We evaluated GPT-4o in February 2025. We tested 1288 pages, which resulted in 3,093,315 input tokens at 833,599 output tokens. Priced at 10.00 per million output tokens, it resulted in a total of 8.03.
- Mistral OCR: As of May 2025, Mistral prices their OCR service at $1 per 1,000 pages, regardless of number of generated tokens.
- MinerU: We run the toolkit (version 1.3.10) on a single NVIDIA L40S GPU. It processed 1,288 pages in 58 minutes 22 seconds, costing $0.767.
- Marker: We run marker v1.7.5 using the marker command line with the force_ocr flag on 10,000 pages selected randomly from olmOCR-mix-0225. This took 5 hours, 31 minutes on an H100 node with 1 GPU, resulting in a price of $14.84 for 10,000 pages. 11
- Gemini Flash 2.0: As of February 2025, it is priced 0.40 per 1 million output tokens. In our testing over the same 1,288 pages used to evaluate GPT-4o, it cost in $0.643.
- : We tested the launch version of on both L40S and H100 GPUs. On L40s, it processed 1,288 test pages in 17 minutes, 10 seconds. The effective throughput of the model was 906 output tokens per second, plus a 12% reties rate. Overall, we estimate its costs at 0.229.
Appendix C Evaluation of Trained Models

Figure 4: Validation Loss - Web PDFs
We track validation loss during training of olmOCR-7B-0225-preview against a development subset of olmOCR-mix-0225 during fine-tuning; Figure 5 and Figure 5, show the loss curves for both the web PDFs and the Internet Archive books subsets. LoRA resulted in higher loss values compared to full fine-tuning, which we use for the final model.

Figure 6: Example of side-by-side evaluation tool used during development. The software used to create these comparisons is released as open-source software as part of.
To set hyperparameters and make other decisions during development, we relied on manual side-by-side evaluation as shown in Figure 6. A random selection of 20 to 50 documents were processed using two different methods, and were displayed side by side along with the render of the document page. We also open source our evaluation tool to support qualitative inspection of this visually-rich data.
C.1 Alignment with Teacher Model
To compare the output of olmOCR-7B-0225-preview to the GPT-4o silver data in olmOCR-mix-0225, we build a document similarity metric which splits a document into words, uses Hirschberg’s algorithm to align those words, and counts what proportion match.
We report alignment scores in Table 7. Overall, we find that olmOCR-7B-0225-preview has good alignment, 0.875 on average, with its teacher model. To calibrate this result, we also report GPT-4o self-alignment score of 0.954, which is simply from calling the model again; imperfect alignment here is due to resampling differences. In fact, we find that our model actually better mimics the content extraction and linearization of GPT-4o than its smaller counterpart GPT-4o mini.
When partitioning scores in low, medium, and high alignment buckets (Table 8), we find that most documents parsed with have medium to high alignment with GPT-4o. Increasing temperature unsurprisingly leads to a wider distribution of alignment scores, as noted by the increase of low matches for .
| Model | Temperature | Alignment |
|---|---|---|
| GPT-4o (self-alignment) | 0.1 | 0.954 |
| GPT-4o mini | 0.1 | 0.833 |
| 0.8 | 0.859 | |
| 0.1 | 0.875 |
Table 7: Page-weighted alignment between GPT-4o, GPT-4o mini, and our fine-tuned model. We find that olmOCR-7B-0225-preview is more consistent with respect to its teacher than GPT-4o mini. Note that GPT-4o does not achieves a perfect alignment against itself due to the probabilistic nature of autoregressive decoding.
| Name | Low match | Medium match | High match |
|---|---|---|---|
| GPT-4o (self alignment) | 38 | 218 | 965 |
| GPT-4o mini | 214 | 478 | 529 |
| () | 158 | 363 | 700 |
| () | 195 | 390 | 636 |
Table 8: Match-up between olmOCR and different models compared to the olmOCR-mix-0225 dataset. Low match indicates < 70% alignment, Medium match is 70-95% alignment, High match is >95% alignment.
C.2 Intrinsic Human Evaluation
Experimental setup
To compare against other common OCR methods, we collected pairwise human judgments of plain text produced by the three top ML-based PDF linearization tools—Marker, MinerU, and GOT-OCR 2.0—and calculating ELO ratings.
To create our evaluation set, we sample 2,017 new PDFs from the same distribution as used to create olmOCR-mix-0225 and run each PDF through and the linearization tools mentioned above. All other linearization tools were installed from either PyPI or Github according to their publicly available instructions as of January 14th, 2025. GOT-OCR 2.0 was configured in ‘format’ mode, but otherwise all comparisons were done against default settings.
We then sampled 2,000 comparison pairs (same PDF, different tool). We asked 11 data researchers and engineers at Ai2 to assess which output was the higher quality representation of the original PDF, focusing on reading order, comprehensiveness of content and representation of structured information. The user interface used is similar to that in Figure 6. Exact participant instructions are listed in Appendix C.3.
Evaluation results

Figure 7: ELO ranking of vs other popular PDF content extraction tools.
We collected a total of 452 judgments where a participant expressed a preference between two models (the remaining 1,548 pairs were either skipped for being too similar, or marked as invalid). On average, this is 75 judgments per pair of tools. We calculate ELO ratings starting from a base of 1500 and report the average of 100 simulations to avoid ordering effects in ELO calculations; for 95% confidence intervals, we use bootstrapping with 5000 resamples.
We visualize our results in Figure 7. achieves an ELO score over 1800, far exceeding all other PDF linearization tools.
C.3 ELO Evaluation Instructions
In Section C.2, we asked participants to compare the output of various common OCR tools against. Participants were given the instructions below, and presented with a document page, and the output of two random tools. They could then select which output was better, or select ‘Both Good’, ‘Both Bad’, or ‘Invalid PDF’ any of which would not count the comparison in the ELO ranking.
Instructions to participants
Compare the text in the two fields, and select which one better represents the contents of the document.
REMINDER: This is not about “the most faithful OCR”, but “this OCR output seems really useful for training LMs”
- Does the text capture all of the meaningful content in the document in a natural order?
- Are the words correct (no weird incorrect words or split words)
- Is the whitepsace sensical?
- Do the tables/equations look okay?
There is not a strict preference between Markdown and LaTeX, most importantly you should evaluate it on the text content, not which method was used to format it.
If you are not sure, or the document is in a language other than english, you can skip that entry, or mark “both good” “both bad”, “invalid pdf”.
ELO data
We compute pairwise win/loss statistics between models to estimate relative performance under head-to-head comparisons. As shown in Table 9, consistently outperforms other models such as Marker, GOTOCR, and MinerU, with the highest win rate of 71.4% against MinerU.
Table 9: Pairwise Win/Loss Statistics Between Models
| Model Pair | Wins | Win Rate (%) |
|---|---|---|
| vs. Marker | 49/31 | 61.3 |
| vs. GOTOCR | 41/29 | 58.6 |
| vs. MinerU | 55/22 | 71.4 |
| Marker vs. MinerU | 53/26 | 67.1 |
| Marker vs. GOTOCR | 45/26 | 63.4 |
| GOTOCR vs. MinerU | 38/37 | 50.7 |
| Total | 452 |
Appendix D Deploying
D.1 Inference Pipeline
To efficiently convert millions of documents, we develop the pipeline using SGLang 21 as the inference engine. The pipeline batches documents into work items of around 500 pages each. Each work item is then queued to run on a worker with access to a GPU for inference. Optionally, workers can coordinate using a shared cloud bucket 12, allowing for batch jobs that scale from single nodes to hundreds of nodes without the need for complicated queue management.
We summarize our efforts by comparing operational costs of against other API and local models in Table 6. Overall, we find to be significantly more efficient than other pipelines. It is over 32 times cheaper than GPT-4o in batch mode; compared to other purposed-built pipelines and models, is over 6 times cheaper than MinerU, and of the cost of marker.
To balance maintaining high GPU utilization while also ensuring work items are completed quickly, each worker queues up inference for all PDF pages in a work item simultaneously, and then waits until the SGLang server has no more pending requests before proceeding to another work item in the queue.
D.2 Increasing Robustness
We implement several heuristics to improve reliability of without compromising its throughput.
Prompt format
During inference, we use the same abbreviated prompt described in Section §2.3. This keeps the test time examples looking the same as what the model was trained on. If the additional tokens generated by document-anchoring cause the overall prompt to exceed 8,192 tokens, then we continue regenerating the document-anchoring tokens with exponentially lower character limits until the overall prompt is of acceptable length.
Retries
Unlike when we created olmOCR-mix-0225, we do not enforce a specific JSON schema during inference on our fine-tuned model. This is for two reasons: first, we find that open source tools designed to force decode a sequence into a particular schema are unreliable, and that enforcing a schema which is even slightly off from what the model expects can cause generations to go out-of-domain or collapse into repetitions. Second, and most importantly, we note that, since the model was extensively fine-tuned on the structured output, it reliably adheres to the required schema without constraints. For the rare cases when JSON parsing fails, we simply retry generating from the same input sequence.
Rotations
The output JSON schema includes fields for is_rotation_valid and rotation_correction. During inference, pipeline reads these two fields and if is_rotation_valid is set to true it will rotate the page by the amount specified in rotation_correction and reprocess the page.
Decoding
In developing, the most common failure we experience is outputs degenerating into endless repetitions of the same token, line, or paragraph. This failure is caught automatically when the model’s output either exceeds the maximum context length, or does not validate against our JSON schema. We find that increasing generation temperature from up to reduces the likelihood of repetitions occurring. Further, we modify to reprocess failed pages up to N times, falling back to a plain text-based PDF extraction if the pipeline repeatedly fails. This last mitigation is aided by the fact that document-anchoring randomly samples which anchors to include in the prompt; thus, resampling can sometimes help the page process correctly by removing potentially problematic meta tokens.
We note that one one limitation of this approach is that, if retries occur often, the total generation throughput could be significantly reduced. Further, letting generations repeat up to maximum sequence length uses significant memory within SGLang. In future work, we plan to detect repeated generations sooner than at the maximum context length limit, and abort promptly.
Appendix E olmOCR-mix-0225 and olmOCR-7B-0225-preview Prompts
E.1 olmOCR-mix-0225 construction prompt for GPT-4o
The prompt below was used to create the silver dataset, which we refer to as olmOCR-mix-0225 throughout the paper. This dataset consists of structured outputs generated by GPT-4o, using images of PDF pages along with additional layout-aware textual features produced by our document-anchoring pipeline. We use this synthetic data to fine-tune our model.
In this prompt, the placeholder {base_text} is replaced with the structured layout-aware text extracted from the PDF using document-anchoring. The prompt instructs GPT-4o to output the natural reading-order text of the page, while respecting document semantics, suppressing hallucinations, and formatting content like equations and tables appropriately.
Below is the image of one page of a PDF document, as well as some raw textual content that was previously extracted for it that includes position information for each image and block of text (The origin [0x0] of the coordinates is in the lower left corner of the image).
Just return the plain text representation of this document as if you were reading it naturally.
Turn equations into a LaTeX representation, and tables into markdown format. Remove the headers and footers, but keep references and footnotes.
Read any natural handwriting.
This is likely one page out of several in the document, so be sure to preserve any sentences that come from the previous page, or continue onto the next page, exactly as they are.
If there is no text at all that you think you should read, you can output null.
Do not hallucinate.
RAW_TEXT_START
{base_text}
RAW_TEXT_END
JSON Schema used to prompt GPT-4o
“json_schema”: {
“name”: “page_response”,
“schema”: {
“type”: “object”,
“properties”: {
“primary_language”: {
“type”: [“string”, “null”],
“description”: “The primary language of the text using two-letter codes or null if there is no text at all that you think you should read.”,
},
“is_rotation_valid”: {
“type”: “boolean”,
“description”: “Is this page oriented correctly for reading? Answer only considering the textual content, do not factor in the rotation of any charts, tables, drawings, or figures.”,
},
“rotation_correction”: {
“type”: “integer”,
“description”: “Indicates the degree of clockwise rotation needed if the page is not oriented correctly.”,
“enum”: [0, 90, 180, 270],
“default”: 0,
},
“is_table”: {
“type”: “boolean”,
“description”: “Indicates if the majority of the page content is in tabular format.”,
},
“is_diagram”: {
“type”: “boolean”,
“description”: “Indicates if the majority of the page content is a visual diagram.”,
},
“natural_text”: {
“type”: [“string”, “null”],
“description”: “The natural text content extracted from the page.”,
},
},
“additionalProperties”: False,
“required”: [
“primary_language”,
“is_rotation_valid”,
“rotation_correction”,
“is_table”,
“is_diagram”,
“natural_text”,
],
},
“strict”: True,
},
E.2 olmOCR-7B-0225-preview prompt
The prompt below is used to draw responses from our fine-tuned model during inference. As before, the placeholder {base_text} is replaced with the output of the document-anchoring pipeline i.e., layout-aware textual features extracted from the PDF page.
Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it.
Just return the plain text representation of this document as if you were reading it naturally.
Do not hallucinate.
RAW_TEXT_START
{base_text}
RAW_TEXT_END
E.3 olmOCR-mix-0225 Classification Prompt
The prompt and structured schema below was used to classify a sample of documents from olmOCR-mix-0225 as reported in Table 2.
This is an image of a document page, please classify it into one of the following categories that best overall summarizes its nature: academic, legal, brochure, slideshow, table, diagram, or other. Also determine the primary language of the document and your confidence in the classification (0-1).
class DocumentCategory(str, Enum):
ACADEMIC = “academic”
LEGAL = “legal”
BROCHURE = “brochure”
SLIDESHOW = “slideshow”
TABLE = “table”
DIAGRAM = “diagram”
OTHER = “other”
class DocumentClassification(BaseModel):
category: DocumentCategory
language: str
confidence: float
E.4 olmOCR-mix-0225 PII Prompt
We implemented comprehensive prompting for detecting personally identifiable information (PII) in the documents while cleaning the olmOCR-mix-0225:
You are a document analyzer that identifies Personally Identifiable Information
(PII) in documents.
Your task is to analyze the provided document image and determine:
1. Whether the document is intended for public release or dissemination
(e.g., research paper, public report, etc.)
2. If the document contains any PII
IDENTIFIERS FOR PII:
The following are considered identifiers that can make information PII:
- Names (full names, first names, last names, nicknames)
- Email addresses
- Phone numbers
PII THAT MUST CO-OCCUR WITH AN IDENTIFIER:
The following types of information should ONLY be marked as PII if they occur
ALONGSIDE an identifier (commonly, a person’s name):
- Addresses (street address, postal code, etc.)
- Biographical Information (date of birth, place of birth, gender, sexual
orientation, race, ethnicity, citizenship/immigration status, religion)
- Location Information (geolocations, specific coordinates)
- Employment Information (job titles, workplace names, employment history)
- Education Information (school names, degrees, transcripts)
- Medical Information (health records, diagnoses, genetic or neural data)
PII THAT OCCURS EVEN WITHOUT AN IDENTIFIER:
The following should ALWAYS be marked as PII even if they do not occur
alongside an identifier:
numbers, tax IDs)
- Financial Information (credit card numbers, bank account/routing numbers)
- Biometric Data (fingerprints, retina scans, facial recognition data,
voice signatures)
location are present together)
If the document is a form, then only consider fields which are filled out
with specific values as potential PII.
If this page does not itself contain PII, but references documents
(such as curriculum vitae, personal statements) that typically contain PII,
then do not mark it as PII.
Only consider actual occurrences of the PII within the document shown.
Appendix F Further details of olmOCR-Bench
F.1 Data sources
See Table 10 for further details about PDFs selected for each olmOCR-Bench category, the source, and the processing.
Table 10: Document source category breakdown of olmOCR-Bench
| Category | PDFs | Tests | Source | Extraction Method |
|---|---|---|---|---|
| arXiv_math | 522 | 2,927 | arXiv | Dynamic programming alignment |
| old_scans_math | 36 | 458 | Internet Archive | Script-generated + manual rules |
| tables_tests | 188 | 1,020 | Internal repository | gemini-flash-2.0 |
| old_scans | 98 | 526 | Library of Congress | Manual rules |
| headers_footers | 266 | 753 | Internal repository | DocLayout-YOLO + gemini-flash-2.0 |
| multi_column | 231 | 884 | Internal repository | claude-sonnet-3.7 + HTML rendering |
| long_tiny_text | 62 | 442 | Internet Archive | gemini-flash-2.0 |
| Total | 1,403 | 7,010 | Multiple sources |
F.2 Prompting Strategies and Implementation Details
This section provides comprehensive documentation of the prompting techniques and design strategies to make olmOCR-Bench. These prompting approaches were critical in generating test cases while utilizing LLMs and ensuring consistency across document categories.
F.2.1 Mathematical Expressions
For generating mathematical expression test cases from old scans, we employed direct prompts focused on precision. This concise prompt architecture proved effective in extracting LaTeX representations minimizing hallucination. The explicit instruction to use standard LaTeX delimiters ($$) ensured consistent formatting across the olmOCR-Bench.
Please extract the mathematical equations from the document without
omission. Always output the mathematical equations as Latex escaped
with $$. Do not hallucinate.
F.2.2 Multi-column
For Multi-column documents, we utilized a two-stage prompting strategy. The initial analytical stage established structural context:
Analyze this document and provide a detailed assessment of its structure.
Focus on the layout, headings, footers, and any complex formatting.
Please be precise.
This preliminary analysis was incorporated into a subsequent HTML rendering prompt:
Render this document as clean, semantic HTML. Here is the analysis of the
document structure:
{analysis_text}
Requirements:
1. Use appropriate HTML tags for headings, paragraphs, and lists.
2. Use