woladi

πŸ“Š macos-vision vs Tesseract β€” A 50-File OCR-to-Markdown Benchmark

🍎 Apple Vision vs πŸ“œ Tesseract on 50 PDFs, both routed through the same local LLM formatter, both feeding the formatter real layout coordinates. Result: on this corpus, Tesseract has lower CER (19.2% vs 26.5%), Apple Vision has higher LLM-judged structural quality (20% pass vs 12%), and the engines come out closer than any single small sample would suggest.


🎯 What We’re Measuring (And Why This Run Differs From the Last One)

This is the second version of this benchmark. The first version, posted earlier this week with 10 PDFs and a naΓ―ve Tesseract setup, concluded that Apple Vision dominated 8 / 10 to 0 / 10. That conclusion turned out to be premature.

The original setup gave Apple Vision a structural advantage that wasn’t really about OCR β€” it was about what each engine handed to the downstream formatter:

Tesseract is not a flat-line OCR engine. Its TSV (or hOCR / ALTO) output exposes word-level bounding boxes plus a four-level hierarchy: page β†’ block β†’ paragraph β†’ line β†’ word. For an apples-to-apples comparison, the Tesseract side has to use this hierarchy. This benchmark does that: every word’s coordinates flow into the same ParagraphGroup structure that VisionScribe builds for Apple Vision, and the formatter receives byte-identical <ocr_source> input shape from both engines. The only thing that varies is which engine produced the underlying text and coordinates.

The dataset was also bumped from 10 to 50 single-page PDFs to dampen the run-to-run variance of the LLM judge.


βš™οΈ The Pipeline (v2)

                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        πŸ“„ PDF ─────────►│  🍎 Apple Vision OCR           │──► blocks + paragraph IDs + (x,y) ──┐
                         β”‚  (macos-vision/VisionScribe)   β”‚                                       β”‚
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                       β”‚
                                                                                                  β–Ό
                                                                                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                                                                β”‚  πŸ¦™ Ollama (mistral-nemo)     β”‚
                                                                                β”‚  ParagraphGroup β†’ buildUser   │──► πŸ“ Markdown
                                                                                β”‚  same SYSTEM_PROMPT, temp=0   β”‚       β”‚
                                                                                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
                                                                                                  β–²                    β”‚
                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                       β”‚                    β”‚
        πŸ“„ PDF ─────────►│  πŸ“œ Tesseract 5.5.2            │──► TSV: words + bbox + par_num β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
                         β”‚  pdftoppm 300 DPI, --psm 1 tsv β”‚                                                            β”‚
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                                            β”‚
                                                                                                                       β–Ό
                                                                                                          πŸ§‘β€βš–οΈ Claude Haiku 4.5 judge
                                                                                                              CER + 1–10 structure score

Both branches converge at the same Ollama call with the same system prompt and the same buildUserContent() wrapping. The only asymmetry remaining is the underlying OCR β€” what each engine β€œsaw” on the page and how it segmented the result. That asymmetry is the thing being measured.


πŸ§ͺ Methodology

Β  Β 
Dataset 50 single-page PDFs from opendataloader-bench: 01030000000001.pdf … 01030000000050.pdf. Academic / archival pages β€” body text, headings, footnotes, figure captions, page numbers.
Ground truth Hand-curated Markdown shipped with the benchmark, one file per PDF.
Vision OCR macos-vision@1.3.2 (latest on npm) via VisionScribe.toMarkdown(). The Apple Vision Framework returns text blocks with bounding boxes and reading-order metadata; macos-vision groups them into ParagraphGroup objects (paragraphId, y, lines[]).
Tesseract OCR tesseract 5.5.2 (Leptonica 1.87.0) with --psm 1 tsv on PNGs rasterized at 300 DPI by pdftoppm from Poppler 26.04.0. The TSV output is parsed into the same ParagraphGroup shape: words are grouped by (block_num, par_num), paragraph y-coordinates are computed as avg(top) / page_height and normalized to [0, 1].
Formatter Ollama 0.23.3 running mistral-nemo:latest (digest e7e06d107c6c, 7.1 GB Q4_0 GGUF), temperature 0, top_p 1. Both engines call chat(ollamaOpts, SYSTEM_PROMPT, buildUserContent(paragraphs)) β€” same SYSTEM_PROMPT and buildUserContent imported live from the macos-vision repo.
CER Levenshtein(prediction, ground truth) Γ· len(ground truth), clamped to [0, 1]. Lower = better.
LLM judge Claude claude-haiku-4-5-20251001 via @anthropic-ai/sdk. Scores 1–10 on text accuracy + structure + completeness. Pass threshold: β‰₯ 8. Same prompt as eval/metrics.ts. Each prediction judged once (see Caveats).
Runtime Node 24.14.1, tsx for live import of TypeScript modules from the macos-vision source tree, macOS 25.4 (Darwin) on Apple Silicon.

The Vision-side numbers come from running VisionScribe.toMarkdown() over the 50 PDFs in a single sweep, not from the macos-vision repo’s previously-cached predictions. Both engines’ predictions were generated fresh in this benchmark run, so any Ollama drift between sessions is absorbed identically on both sides.


πŸ† Headline Results

Metric 🍎 macos-vision πŸ“œ Tesseract Winner
Avg CER (lower better) 26.5% 19.2% πŸ“œ by 7.3 pt
Median CER 19.3% 13.4% πŸ“œ by 5.9 pt
Avg LLM score (1–10) 4.90 4.74 🍎 by 0.16
Median LLM score 4 4 tied
Pass rate (LLM β‰₯ 8) 10 / 50 (20%) 6 / 50 (12%) 🍎
Head-to-head LLM wins 21 14 🍎
Head-to-head CER wins 13 24 πŸ“œ
Files with CER > 50% 13 6 πŸ“œ
Mean latency / file 25.3 s 29.3 s 🍎
Empty outputs 0 0 tied

This is a very different table from the v1 result. Three things stand out.

Tesseract reads characters more reliably across this corpus. Lower mean CER, lower median CER, more head-to-head wins on CER, and fewer catastrophic OCR failures (CER > 50%). On 24 of the 50 files Tesseract returned closer-to-ground-truth text than Apple Vision did.

Apple Vision still has the structural edge β€” but it’s a small one. Average LLM score is essentially tied (4.90 vs 4.74), but the pass-rate distinction (20% vs 12%) shows that when Apple Vision does work, it’s more likely to produce a document the judge will accept outright. Tesseract bunches around scores of 3–6; Apple Vision is more bimodal β€” it nails it or fails it.

Apple Vision is faster. Native API access is real. The Apple Vision OCR step is essentially zero-overhead from Node, while Tesseract requires raster + OCR + TSV parse before the same Ollama call.


πŸ“ˆ Distributions

LLM score distribution

score:  2   3   4   5   6   7   8   9  10
─────────────────────────────────────────
🍎      6  16   8   0   7   3   1   8   1
πŸ“œ      5  16   7   2  10   4   1   5   0

CER distribution

bucket:   <5%   5–10%  10–20%  20–30%  30–50%  >50%
────────────────────────────────────────────────────
🍎         16     7      3       3       8      13
πŸ“œ         12     9     16       1       6       6

The CER histograms tell the bimodality story:


πŸ” Notable Cases

Case 1 β€” Where Apple Vision dominates: prose with a clear title (file 24)

Ground truth is straightforward narrative β€” title β€œAt Home in Exile”, then plain body paragraphs about a trip to WrocΕ‚aw.

🍎 macos-vision (LLM 9, CER 2.1%):

# At Home in Exile

We received the clear impression from grim customs officials and money-changers at the border that we had entered a part of the world still not entirely recovered from post-War economic depression. […]

Anthea's navigation skills took us promptly to the clean and pleasant Tumski Hotel on the Sand Island near the oldest part of Wroclaw. […]

πŸ“œ Tesseract (LLM 3, CER 4.7%):

# At Home in Exile

## We received the clear impression...

- ...from grim customs officials and money-changers at the border…
- Roadside stands sold plaster garden statues, especially gnomes…

## Anthea's navigation skills took us promptly...

- ...to the clean and pleasant Tumski Hotel on the Sand Island…

Both engines read the page cleanly β€” CER is 2.1% for Vision, 4.7% for Tesseract. But the formatter, fed the Tesseract paragraph blocks, decided every paragraph’s opening clause was a heading and the rest of the paragraph was a bulleted continuation. The judge’s verdict: β€œthe prediction fundamentally misrepresents the structure by converting a prose narrative into an artificial heading-and-bullet-point format that doesn’t exist in the original.” This is a layout-inference failure even with bounding boxes present β€” Tesseract’s paragraph grouping put first lines at one y and continuations at another, and the LLM ran with that signal.

Case 2 β€” Where Tesseract dominates: stylized title typography (file 12)

Ground truth is a textbook page about Aladdin pantomime prints, with two figure captions and body prose between them.

🍎 macos-vision (LLM 2, CER 76.4%):

# 96 MACDONALD

## Taking Maddin mile Henderful Jamp?

FIGURE 5.1 Mr. Bologna Jun-r as Kalim Azack in Aladdin,
or FIGURE 5.2 Mr. Grimaldi as Kazrac (the Chinese slave) in
The Wonderful Lamp. Aladdin, or The Wonderful Lamp.

πŸ“œ Tesseract (LLM 6, CER 17.6%):

# Figure 5.1 Mr. Bologna Jun-r as Kalim Azack in Aladdin, or The Wonderful Lamp.

theatrical prints, which are informed by intercul-turation and illustrate the Orientalized look of the tale's theatrical life: […]

# Figure 5.2 Mr. Grimaldi as Kazrac (the Chinese slave) in Aladdin, or The Wonderful Lamp.

This is the failure mode that drives Apple Vision’s bimodal CER. The stylized β€œAladdin, or The Wonderful Lamp” title β€” large fancy display type β€” defeated the Apple Vision recognizer, which read it as β€œTaking Maddin mile Henderful Jamp?”. With a 76% CER on the page, no downstream pipeline can recover. Tesseract’s traditional pattern-matching handled the stylized type fine, kept the figure captions intact, and produced a document the judge gave a 6.

A similar pattern shows up on files 08 (Vision CER 56.8%, Tesseract 38.6%), 13 (74.9% vs 37.9%), 37 (56.9% vs 53.6%), and 42 (51.6% vs 15.7%). When Apple Vision misreads the page at the recognition level, Tesseract’s more uniform performance becomes a meaningful advantage.

Case 3 β€” Where both struggle: a table-of-contents page (file 16)

Ground truth is a flat list of chapter titles with page numbers.

Both engines restructured the list as a hierarchy with headings and bullet points; both lost the page numbers. Apple Vision LLM 4, Tesseract LLM 6. The judge: β€œpage numbers are entirely missing in the prediction” (Vision) and β€œloses page numbers entirely, omits numbered items” (Tesseract). The Tesseract version was judged slightly more readable, but neither passes.

This is a class of document (TOCs, indexes, tabular content) where the formatter’s strong prose bias works against both engines uniformly.


πŸ’‘ Why The Gap Shrank

Three things explain the difference between this 50-file fair-play run and the original 10-file unfair-play run.

Coordinates are necessary, not sufficient. v1 measured β€œVision + coordinates vs Tesseract + no coordinates.” That isn’t an OCR comparison; it’s a comparison of how much spatial information the formatter receives. Giving both engines the same shape of structured input β€” paragraph IDs and y-hints β€” closes that gap directly. Tesseract pass rate went up (0 β†’ 12%), Vision pass rate fell (80 β†’ 20%), and a 1.84Γ— LLM-score gap collapsed to 1.03Γ—.

Vision’s recognition has a brittle long tail. Apple Vision is excellent on body text and clean typography but has visible weaknesses on stylized titles, italics, and unusual layouts (a quarter of the corpus shows CER > 50%, which means the recognizer essentially gave up). Tesseract’s pattern matching is less sensitive to display typography and produces a more uniform error profile. CER-the-metric reflects this.

LLM-judge scores have meaningful variance. Single-run scores on the same prediction can differ by Β±2 points depending on which sentence the judge zeroes in on. A 10-file sample with such variance can swing aggregate pass rate by 30 percentage points. The 50-file aggregates here are much more stable, but the headline numbers should be read as point estimates with a meaningful confidence band, not exact measurements.

The honest read on the data: on this corpus, with this formatter, with this judge, the engines are within noise on overall quality. Apple Vision is faster, slightly more likely to produce a passing document, and produces more β€œperfect” outputs when it works. Tesseract is more consistent, reads characters more reliably, and has fewer catastrophic failures. Either is defensible depending on what you optimize for.


⚠️ Caveats


⏱️ Latency

Β  🍎 Apple Vision πŸ“œ Tesseract
Mean per file 25.3 s 29.3 s
Median 26.7 s 30.4 s
p95 39.7 s 45.0 s
Cost driver Ollama format (~95%) Ollama format (~85%) + rasterize/OCR/parse (~15%)

Apple Vision’s native OCR is essentially free from the caller’s perspective β€” the bottleneck is the same Ollama call on both sides. Tesseract adds a real but small overhead from the pdftoppm raster step (~0.7 s) and TSV parsing. On a per-page basis the gap is about four seconds. At scale (10,000 pages), that’s ~11 hours of difference β€” meaningful but not dramatic. Both pipelines are dominated by the LLM formatter, and any optimization effort is best spent there.


πŸ” Reproducibility

The Vision-side numbers can be reproduced with vanilla macos-vision:

cd macos-vision
npm install
npm run eval:setup
npm run eval -- --limit 50
npm run eval:report

For Tesseract, the sketch is:

brew install tesseract poppler          # tesseract 5.x + pdftoppm
ollama pull mistral-nemo                # same formatter macos-vision uses
export ANTHROPIC_API_KEY=sk-ant-...     # claude-haiku-4-5 as judge

for stem in 01030000000001 … 01030000000050; do
  pdftoppm -r 300 -png "bench/pdfs/$stem.pdf" "images/$stem"
  tesseract "images/$stem-1.png" "tsv/$stem" -l eng --psm 1 tsv
  # parse TSV β†’ ParagraphGroup β†’ buildUserContent β†’ ollama chat
done

The TSV-to-ParagraphGroup step is the critical part: words are grouped by (block_num, par_num), lines by line_num, and y-coordinates normalized by page_height to match the [0, 1] shape that buildUserContent expects. After that, the call signature into Ollama is identical to what VisionScribe.toMarkdown() does internally.

Full benchmark harness, including the eval set, ground truth, CER, LLM-judge, and report runner, is shipped with the eval/ directory in woladi/macos-vision.


πŸ“¦ Appendix β€” Selected Per-File Results

Top 5 Apple Vision wins by LLM delta

File 🍎 LLM / CER πŸ“œ LLM / CER Ξ”LLM
01030000000024 9 / 2.1% 3 / 4.7% +6
01030000000010 6 / 35.9% 3 / 28.1% +3
01030000000020 9 / 1.6% 6 / 2.4% +3
01030000000025 6 / 2.5% 3 / 1.8% +3
01030000000023 9 / 1.4% 7 / 1.2% +2

Top 5 Tesseract wins by LLM delta

File 🍎 LLM / CER πŸ“œ LLM / CER Ξ”LLM
01030000000008 2 / 56.8% 6 / 38.6% βˆ’4
01030000000012 2 / 76.4% 6 / 17.6% βˆ’4
01030000000013 3 / 74.9% 7 / 37.9% βˆ’4
01030000000037 3 / 56.9% 6 / 53.6% βˆ’3
01030000000042 3 / 51.6% 6 / 15.7% βˆ’3

Aggregates

                    🍎 macos-vision    πŸ“œ Tesseract
─────────────────────────────────────────────────
mean CER                26.5%               19.2%
median CER              19.3%               13.4%
mean LLM score          4.90                4.74
median LLM score        4                   4
pass rate (β‰₯ 8)         10/50 (20%)         6/50 (12%)
head-to-head CER        13 wins             24 wins  (13 ties)
head-to-head LLM        21 wins             14 wins  (15 ties)
catastrophic (CER>50%)  13 files            6 files
mean latency            25.3 s              29.3 s

Full per-file JSON (50 rows each) is available in the scratch directory used to produce this article:

/private/tmp/ocr-bench/reports/report-vision.json
/private/tmp/ocr-bench/reports/report-tesseract.json

These contain {file, cer, llmScore, llmReason, passed} per file, in the same shape as macos-vision’s eval/reports/.


βœ… Setup notes for anyone reproducing: the Tesseract branch in this benchmark imports chat, ping, SYSTEM_PROMPT, buildUserContent, computeCER, and llmJudge directly from the macos-vision source tree β€” no code is duplicated, so the formatter prompt and the scoring math are byte-identical to what npm run eval uses. No modifications were made to the macos-vision or woladi repos to produce these numbers; the benchmark scripts and intermediate predictions are intentionally kept in a scratch directory outside both repos.