How to Extract Data from a Pathology PDF
Manual copy-paste garbles the data. OCR tools miss the table structure. Here is why pathology PDFs are hard to parse — and how SmarterBlood solves it automatically.
The Quick Answer
Pathology PDFs are notoriously difficult to extract data from manually. Multi-column table layouts scramble when you copy-paste. Scanned PDFs return no text at all. Generic OCR tools do not understand the structure of a pathology report. SmarterBlood uses four independent AI models — Claude Sonnet, Claude Haiku, Gemini Flash, and Google Vision OCR — that each independently read your PDF and then vote on every value. The result is structured, labelled, charted data in your dashboard without any manual work.
Why Pathology PDFs Are Hard to Extract
If you have ever tried to copy data from a blood test PDF into a spreadsheet, you know the result: a jumbled mess of numbers and names that bears no resemblance to the original table. There are six structural reasons why.
Multi-column table layouts
Pathology results are formatted as tables with marker names in one column, values in another, and reference ranges in a third. PDF copy-paste reads left-to-right then top-to-bottom, so it mixes columns. The result is a jumble of numbers and names that is almost impossible to parse.
Image-based (scanned) PDFs
Many labs, particularly older ones and hospital systems, produce PDFs that are simply images with no underlying text. Copy-paste produces nothing. Standard OCR tools like Tesseract struggle with the multi-column pathology table format.
Lab-specific layouts
Every pathology company uses a different report template. 4Cyte, Sullivan Nicolaides, Dorevitch, Laverty, and hospital labs all look different. Tabula (an open-source PDF extraction tool) requires manual column boundary configuration for each layout.
Faxed reports
Many GP practices still receive results by fax, which are then scanned and saved as PDFs. Fax compression introduces noise, diagonal lines, and skew that breaks standard OCR.
Multi-date tables
Some labs include results from multiple dates in a single table (e.g. columns for Jan, Mar, Jun). Generic extraction tools read the columns as separate documents, mixing dates and values.
Non-standard formatting
Handwritten annotations, stamps, watermarks ("REVISED REPORT"), and multi-page documents with tables split across pages all cause problems for generic OCR and copy-paste workflows.
What Gets Extracted from Your PDF
SmarterBlood extracts eight data fields from every blood test result. All are stored in your account and used to build your trend dashboard.
| Data Field | Example | Notes |
|---|---|---|
| Marker name | Haemoglobin, TSH, Ferritin, HbA1c | Normalised to a canonical name using a 1,038-alias database. "FERRITIN", "Serum Ferritin", and "Fe STORES" all map to the same marker. |
| Numeric value | 136, 2.4, 45.0, 5.8% | Extracted with physiological validation. Values outside plausible biological limits are flagged for review and re-extracted. |
| Unit | g/L, mIU/L, mcg/L, mmol/mol | Detected from the report and stored. Units are normalised across labs and years so trend graphs use consistent units. |
| Reference range | 115-155, 0.4-4.0, 12-300 | The lab's own reference range is extracted and stored alongside Australian population norms for context. |
| H/L flags | H (high), L (low), * (abnormal) | Abnormal flags from the original report are preserved. SmarterBlood also independently computes status against Australian reference ranges. |
| Test date | 15 March 2024, 15/03/2024 | Date format is auto-detected (Australian DD/MM or international MM/DD) using lab-specific disambiguation rules. |
| Lab name | 4Cyte, Sullivan Nicolaides, Dorevitch | Extracted where present in the header. Used to contextualise lab-specific reference range differences. |
| Document type | Blood test, urine test, thyroid panel | Classified automatically before extraction. Non-blood-test documents (imaging, referrals) are stored separately without extraction. |
How the Extraction Pipeline Works
Document classification
Before extraction begins, the document is classified. Is this a blood test? Urine test? Imaging report? Referral letter? Only blood test documents are passed to the extraction pipeline. This prevents wasted processing and keeps your dashboard clean.
Four-model parallel extraction
Claude Sonnet (vision), Claude Haiku (vision), Gemini 2.5 Flash (vision), and Google Vision OCR each independently read your PDF. Running four independent models in parallel takes the same time as running one, but produces a four-way check on every value.
Majority-vote consensus
For each marker, the four results are compared. If all four agree, the value is accepted at HIGH confidence. A clear majority (three of four) gives MEDIUM confidence. A tie or all-disagree triggers a deeper investigation step using the secondary validation queue.
Physiological validation
Each accepted value is checked against known physiological limits. A haemoglobin of 136,000 g/L (a common OCR error where the decimal was missed) is automatically corrected to 136 g/L and flagged in the audit trail.
Unit normalisation
Units are detected and normalised. If your 2010 cholesterol was reported in mg/dL and your 2024 result is in mmol/L, both are converted to mmol/L so your trend graph is consistent.
Marker name normalisation
The extracted marker name is matched against a 1,038-alias database. "FERRITIN", "Serum Ferritin", "FERRITIN (SERUM)", and "Iron Stores" all map to a single canonical "Ferritin" entry in your dashboard.
DIY Alternatives and Their Limits
If you prefer a do-it-yourself approach or need CSV output, here are the main alternatives and their practical constraints.
| Tool | Approach | Main Limits | Best For |
|---|---|---|---|
| Manual copy-paste | Select text in PDF viewer, paste into spreadsheet | Garbled column order on all multi-column layouts. Zero output on image PDFs. | Single simple results with 1-2 values (very rare in practice). |
| Adobe Acrobat Export | Export PDF to Excel via Adobe Acrobat | Table structure often collapses. Image PDFs need OCR upgrade ($). Requires paid subscription. | Simple, single-column PDFs with no tables. |
| Tabula | Open-source tool to extract PDF tables by drawing bounding boxes | Manual configuration per lab layout. No OCR (image PDFs fail). Outputs raw CSV needing extensive cleanup. | Developers who want a DIY pipeline and can invest hours on configuration. |
| ChatGPT / Claude (manual paste) | Copy text, paste into AI, ask it to structure the data | Requires manual copy-paste (fails on image PDFs). No persistent storage, no trends, no history. Re-entry every visit. | One-off quick interpretation if you can copy the text cleanly. |
| Google Vision / AWS Textract | Developer API for OCR extraction | Raw OCR output needs significant post-processing to parse pathology table format. Requires coding skill. No UI. | Developers building their own pipeline who need high-quality OCR as a starting point. |
| SmarterBlood | Upload PDF; AI extracts, structures, stores, and charts automatically | No CSV export yet (roadmap). Requires an account. | Patients who want their full blood test history structured, charted, and accessible without any technical setup. |
Accuracy Caveats
Extraction accuracy is ~99% on clean PDFs, not 100%
On clean digital PDFs from major Australian labs, the four-model consensus approach achieves approximately 99% accuracy. That means roughly 1 in 100 values may need checking. Always review your results in SmarterBlood against your original PDF for high-stakes decisions.
Very degraded faxes may miss some values
A heavily degraded fax with missing sections, severe skew, or thermofax fading may extract only some markers. The original report always takes precedence over extracted data.
Marker names vary dramatically between labs
Despite our 1,038-alias database, edge cases exist. If a marker you expect does not appear, it may have been extracted under a slightly different name. Use the search function or contact support.
Dates in unusual formats may be misread
Some hospital systems use formats like "15MAR24" or "2024.03.15" that are outside the standard detection rules. If a result appears at an odd point on your timeline, check the date in the original PDF.
Related Reading
Upload Your Pathology PDF
SmarterBlood extracts every marker, value, and unit from your blood test PDF automatically. No copy-pasting, no spreadsheets. Free and private.
This page explains how SmarterBlood extracts data from pathology PDFs. Extraction accuracy is high but not perfect — always verify extracted values against your original pathology report for clinical decisions. SmarterBlood does not provide medical advice, diagnosis, or treatment.
