Skip to main content
How-To Guide

How to Extract Data from a Pathology PDF

Manual copy-paste garbles the data. OCR tools miss the table structure. Here is why pathology PDFs are hard to parse — and how SmarterBlood solves it automatically.

The Quick Answer

Pathology PDFs are notoriously difficult to extract data from manually. Multi-column table layouts scramble when you copy-paste. Scanned PDFs return no text at all. Generic OCR tools do not understand the structure of a pathology report. SmarterBlood uses four independent AI models — Claude Sonnet, Claude Haiku, Gemini Flash, and Google Vision OCR — that each independently read your PDF and then vote on every value. The result is structured, labelled, charted data in your dashboard without any manual work.

~99% accuracy on clean PDFs
Scanned PDFs supported
Four-model consensus
No manual data entry

Why Pathology PDFs Are Hard to Extract

If you have ever tried to copy data from a blood test PDF into a spreadsheet, you know the result: a jumbled mess of numbers and names that bears no resemblance to the original table. There are six structural reasons why.

1
Multi-column table layouts

Pathology results are formatted as tables with marker names in one column, values in another, and reference ranges in a third. PDF copy-paste reads left-to-right then top-to-bottom, so it mixes columns. The result is a jumble of numbers and names that is almost impossible to parse.

2
Image-based (scanned) PDFs

Many labs, particularly older ones and hospital systems, produce PDFs that are simply images with no underlying text. Copy-paste produces nothing. Standard OCR tools like Tesseract struggle with the multi-column pathology table format.

3
Lab-specific layouts

Every pathology company uses a different report template. 4Cyte, Sullivan Nicolaides, Dorevitch, Laverty, and hospital labs all look different. Tabula (an open-source PDF extraction tool) requires manual column boundary configuration for each layout.

4
Faxed reports

Many GP practices still receive results by fax, which are then scanned and saved as PDFs. Fax compression introduces noise, diagonal lines, and skew that breaks standard OCR.

5
Multi-date tables

Some labs include results from multiple dates in a single table (e.g. columns for Jan, Mar, Jun). Generic extraction tools read the columns as separate documents, mixing dates and values.

6
Non-standard formatting

Handwritten annotations, stamps, watermarks ("REVISED REPORT"), and multi-page documents with tables split across pages all cause problems for generic OCR and copy-paste workflows.

What Gets Extracted from Your PDF

SmarterBlood extracts eight data fields from every blood test result. All are stored in your account and used to build your trend dashboard.

Data FieldExampleNotes
Marker name
Haemoglobin, TSH, Ferritin, HbA1c
Normalised to a canonical name using a 1,038-alias database. "FERRITIN", "Serum Ferritin", and "Fe STORES" all map to the same marker.
Numeric value
136, 2.4, 45.0, 5.8%
Extracted with physiological validation. Values outside plausible biological limits are flagged for review and re-extracted.
Unit
g/L, mIU/L, mcg/L, mmol/mol
Detected from the report and stored. Units are normalised across labs and years so trend graphs use consistent units.
Reference range
115-155, 0.4-4.0, 12-300
The lab's own reference range is extracted and stored alongside Australian population norms for context.
H/L flags
H (high), L (low), * (abnormal)
Abnormal flags from the original report are preserved. SmarterBlood also independently computes status against Australian reference ranges.
Test date
15 March 2024, 15/03/2024
Date format is auto-detected (Australian DD/MM or international MM/DD) using lab-specific disambiguation rules.
Lab name
4Cyte, Sullivan Nicolaides, Dorevitch
Extracted where present in the header. Used to contextualise lab-specific reference range differences.
Document type
Blood test, urine test, thyroid panel
Classified automatically before extraction. Non-blood-test documents (imaging, referrals) are stored separately without extraction.

How the Extraction Pipeline Works

1
Document classification

Before extraction begins, the document is classified. Is this a blood test? Urine test? Imaging report? Referral letter? Only blood test documents are passed to the extraction pipeline. This prevents wasted processing and keeps your dashboard clean.

2
Four-model parallel extraction

Claude Sonnet (vision), Claude Haiku (vision), Gemini 2.5 Flash (vision), and Google Vision OCR each independently read your PDF. Running four independent models in parallel takes the same time as running one, but produces a four-way check on every value.

3
Majority-vote consensus

For each marker, the four results are compared. If all four agree, the value is accepted at HIGH confidence. A clear majority (three of four) gives MEDIUM confidence. A tie or all-disagree triggers a deeper investigation step using the secondary validation queue.

4
Physiological validation

Each accepted value is checked against known physiological limits. A haemoglobin of 136,000 g/L (a common OCR error where the decimal was missed) is automatically corrected to 136 g/L and flagged in the audit trail.

5
Unit normalisation

Units are detected and normalised. If your 2010 cholesterol was reported in mg/dL and your 2024 result is in mmol/L, both are converted to mmol/L so your trend graph is consistent.

6
Marker name normalisation

The extracted marker name is matched against a 1,038-alias database. "FERRITIN", "Serum Ferritin", "FERRITIN (SERUM)", and "Iron Stores" all map to a single canonical "Ferritin" entry in your dashboard.

DIY Alternatives and Their Limits

If you prefer a do-it-yourself approach or need CSV output, here are the main alternatives and their practical constraints.

ToolApproachMain LimitsBest For
Manual copy-pasteSelect text in PDF viewer, paste into spreadsheetGarbled column order on all multi-column layouts. Zero output on image PDFs.Single simple results with 1-2 values (very rare in practice).
Adobe Acrobat ExportExport PDF to Excel via Adobe AcrobatTable structure often collapses. Image PDFs need OCR upgrade ($). Requires paid subscription.Simple, single-column PDFs with no tables.
TabulaOpen-source tool to extract PDF tables by drawing bounding boxesManual configuration per lab layout. No OCR (image PDFs fail). Outputs raw CSV needing extensive cleanup.Developers who want a DIY pipeline and can invest hours on configuration.
ChatGPT / Claude (manual paste)Copy text, paste into AI, ask it to structure the dataRequires manual copy-paste (fails on image PDFs). No persistent storage, no trends, no history. Re-entry every visit.One-off quick interpretation if you can copy the text cleanly.
Google Vision / AWS TextractDeveloper API for OCR extractionRaw OCR output needs significant post-processing to parse pathology table format. Requires coding skill. No UI.Developers building their own pipeline who need high-quality OCR as a starting point.
SmarterBloodUpload PDF; AI extracts, structures, stores, and charts automaticallyNo CSV export yet (roadmap). Requires an account.Patients who want their full blood test history structured, charted, and accessible without any technical setup.

Accuracy Caveats

Extraction accuracy is ~99% on clean PDFs, not 100%

On clean digital PDFs from major Australian labs, the four-model consensus approach achieves approximately 99% accuracy. That means roughly 1 in 100 values may need checking. Always review your results in SmarterBlood against your original PDF for high-stakes decisions.

Very degraded faxes may miss some values

A heavily degraded fax with missing sections, severe skew, or thermofax fading may extract only some markers. The original report always takes precedence over extracted data.

Marker names vary dramatically between labs

Despite our 1,038-alias database, edge cases exist. If a marker you expect does not appear, it may have been extracted under a slightly different name. Use the search function or contact support.

Dates in unusual formats may be misread

Some hospital systems use formats like "15MAR24" or "2024.03.15" that are outside the standard detection rules. If a result appears at an odd point on your timeline, check the date in the original PDF.


Upload Your Pathology PDF

SmarterBlood extracts every marker, value, and unit from your blood test PDF automatically. No copy-pasting, no spreadsheets. Free and private.

This page explains how SmarterBlood extracts data from pathology PDFs. Extraction accuracy is high but not perfect — always verify extracted values against your original pathology report for clinical decisions. SmarterBlood does not provide medical advice, diagnosis, or treatment.