
One OCR engine. Three trust models. A practical guide to choosing how much of your document processing stays on your machine.
Most document AI tools work like this:
π Your file β βοΈ Cloud API β π€ LLM
Your raw file β image, PDF, scan β travels to a third-party server before any processing happens. Youβre trusting the providerβs infrastructure, their logging policy, and every hop in between.
macos-vision-mcp changes the boundary:
π Your file β π Apple Vision (local) β π Extracted text β [your choice what happens next]
The file never leaves your Mac. What you decide to do with the extracted text is where your three options diverge β and where the privacy trade-offs actually live.
π File
β
βΌ
π macos-vision-mcp
Apple Vision OCR
(on-device, Neural Engine)
β
βΌ
π¦ Local LLM (Ollama)
mistral-nemo / llama3 / etc.
β
βΌ
β
Result β nothing left the machine
How it works: macos-vision-mcp extracts text locally via Apple Vision. The extracted text is passed directly to a local Ollama model for formatting, summarising, or answering questions. No network requests are made at any stage.
Privacy guarantee: Absolute, within your local system. No file content, no extracted text, no metadata touches any external server. The only thing that leaves the process boundary is the final response rendered in your terminal or Claude Desktop interface.
Performance reality: The macos-vision vs Tesseract benchmark ran exactly this configuration β Apple Vision OCR feeding mistral-nemo on Ollama as the downstream formatter. Mean latency was 25.3 seconds per page, with ~95% of that time spent in Ollama, not in OCR. Apple Visionβs native API access is essentially zero-overhead from Node.js; the bottleneck is the local model, not the extraction step.
When to use this:
The honest trade-off: Local models are weaker than frontier cloud models. mistral-nemo handles formatting and simple extraction well; it will struggle with nuanced reasoning, multi-document synthesis, or anything that requires the kind of world-knowledge that large frontier models carry. If your task requires Claude-level reasoning, Pipeline 1 is not the right fit β unless you can run a large enough local model.
π File
β
βΌ
π macos-vision-mcp
Apple Vision OCR
(on-device, Neural Engine)
β
βΌ
βοΈ Cloud LLM
(Claude / GPT-4 / Gemini)
β
βΌ
β
Result β file stayed local, text went to cloud
How it works: OCR runs locally. The extracted plain-text representation of the document is sent to a cloud LLM for reasoning, summarisation, or Q&A. The original file β pixels, layout, fonts, embedded metadata β never leaves the machine.
Privacy guarantee: Partial but meaningful. This is a significant improvement over uploading the file directly:
However, the extracted text still contains all PII in cleartext. Names, account numbers, diagnoses, addresses β whatever Apple Vision read off the page arrives at the cloud provider verbatim. If the document is sensitive, the cloud providerβs logging and retention policies matter.
When to use this:
The honest trade-off: The privacy win here is real but often overstated. If a document contains a Social Security Number, that SSN will appear in the text sent to the cloud just as clearly as it appeared in the original PDF. The file format is protected; the information is not. Use this pipeline when you care about the raw file not leaving your machine, not when you need to prevent the content from reaching a provider.
π File
β
βΌ
π macos-vision-mcp
Apple Vision OCR
(on-device, Neural Engine)
β
βΌ
π΅οΈ pseudonym-mcp
PII β reversible tokens
[PERSON:1], [SSN:1], [CREDIT_CARD:1]...
(on-device, Regex + optional Ollama NER)
β
βΌ
βοΈ Cloud LLM
sees tokens, not real values
β
βΌ
π pseudonym-mcp
unmask_text()
tokens β original values
β
βΌ
β
Result with real names restored
How it works: OCR runs locally. The extracted text is passed through pseudonym-mcp before any cloud call β structured PII (SSNs, card numbers, IBANs, email addresses, phone numbers) is replaced by deterministic tokens via regex; names and organisations are masked via a local Ollama NER model if available. The cloud LLM reasons over a pseudonymised document. The response is unmasked locally before being shown to you.
This is the pipeline described in the Obsidian Vault guide and the OpenClaw messaging guide.
Privacy guarantee: The strongest available when using a cloud LLM. The cloud provider receives a document where identifiable values have been replaced with opaque tokens. It can reason about structure, obligations, dates, patterns, and relationships β but it does not see the real names or numbers involved.
What each layer protects:
| Layer | What stays local |
|---|---|
macos-vision-mcp |
Raw file: pixels, layout, fonts, metadata, embedded artifacts |
pseudonym-mcp |
PII values: names, SSNs, card numbers, IBAN, PESEL, email, phone |
| Cloud LLM | Receives: pseudonymised text, structural context, document meaning |
When to use this:
| Β | Pipeline 1 | Pipeline 2 | Pipeline 3 |
|---|---|---|---|
| Raw file reaches cloud | β Never | β Never | β Never |
| Extracted text reaches cloud | β Never | β Yes (cleartext) | β οΈ Yes (pseudonymised) |
| PII values reach cloud | β Never | β Yes | β Masked |
| LLM reasoning quality | β οΈ Local model | β Frontier | β Frontier |
| Latency | ~25 s/page | Fast | Fast + small local overhead |
| External dependencies | Ollama only | Cloud API key | Cloud API key + optional Ollama |
| Best for | Maximum privacy | Metadata protection | PII protection + cloud quality |
Regardless of which pipeline you use, macos-vision-mcp enforces one guarantee that no cloud-upload approach can match: the raw document never leaves your machine.
This matters more than it might seem. A scanned medical report is not just its text. It is a TIFF-embedded image, a page geometry, a document structure, potentially a watermark or stamp, and whatever metadata the scanner attached. Sending that to a cloud OCR API hands over the full artifact. Apple Vision reads it on your Neural Engine and returns a structured text representation. Thatβs the extraction boundary β and itβs a hard one.
What you choose to do with the extracted text is a separate decision, with separate trade-offs. The three pipelines above are the principal configurations. Most real workflows fall into one of them, or combine them: local LLM for initial triage, cloud LLM for the final reasoning step, with pseudonymisation in between.
There is a non-obvious interaction between OCR quality and privacy that is worth naming explicitly.
The macos-vision vs Tesseract benchmark showed that Apple Vision has a bimodal error distribution on a 50-PDF academic corpus: excellent on clean body text (16/50 files with CER < 5%), but with a brittle tail on stylized display typography (13/50 files with CER > 50%). When OCR fails catastrophically on a page, names and numbers are misread as gibberish.
This affects Pipeline 3 in a specific way: pseudonym-mcp can only mask values it recognises. If Kowalski becomes Kowaloki in the OCR output, the NER model will not flag it as a person name and it will not be tokenised before reaching the cloud. OCR errors create gaps in the pseudonymisation layer that are invisible to the user.
Practical implication: For Pipeline 3 on documents with unusual fonts, handwriting, or stylized layouts, verify OCR quality before treating the pseudonymisation pass as reliable. The benchmarkβs latency data is also relevant here: at 25.3 s/page mean, it is feasible to include a local review step for high-stakes documents without making the workflow prohibitively slow.
# Add both MCP servers to Claude Code
claude mcp add macos-vision-mcp -- npx -y macos-vision-mcp
claude mcp add pseudonym-mcp -- npx -y pseudonym-mcp --engines hybrid
# For Pipeline 1 / Pipeline 3 NER: pull a local model
ollama pull mistral-nemo # formatter
ollama pull llama3 # NER for pseudonym-mcp
For Claude Desktop, add to ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"macos-vision-mcp": {
"command": "npx",
"args": ["-y", "macos-vision-mcp"]
},
"pseudonym-mcp": {
"command": "npx",
"args": ["-y", "pseudonym-mcp", "--engines", "hybrid"]
}
}
}