Why does copy-paste from a pathology PDF produce garbled text?

Pathology PDFs use multi-column table layouts that PDF readers were not designed to copy in reading order. When you copy-paste, the text is extracted column by column rather than row by row, mixing up marker names, values, and reference ranges. Scanned or image-based PDFs produce no text at all when copy-pasted.

How accurate is the AI extraction?

SmarterBlood uses four independent AI models (Claude Sonnet, Claude Haiku, Gemini Flash, and Google Vision OCR) and compares their results via majority vote. On clean digital PDFs from major Australian labs, accuracy is approximately 99%. On scanned or faxed documents, accuracy is typically 90-97% depending on scan quality.

Can I export my blood test data to a spreadsheet or CSV?

Currently SmarterBlood does not offer direct CSV export. Your data is displayed as interactive charts and can be exported as a formatted PDF report. CSV export is on our development roadmap.

Does it work on scanned PDFs and image-based PDFs?

Yes. SmarterBlood handles text-based PDFs, image-based PDFs (where the content is a scanned image), and standalone image files (JPEG, PNG). The Google Vision OCR component specifically handles image-based content. Accuracy is highest on clean digital PDFs and somewhat lower on degraded faxes or very old scans.

What data gets extracted from my pathology PDF?

SmarterBlood extracts: marker name, numeric value, unit (e.g. mmol/L, g/L, U/L), lab reference range, result flag (H/L/abnormal), test date, and lab name where present. The patient name and date of birth are used for identity validation and stored securely.

Can I use ChatGPT or another AI to extract data from my blood test PDF?

You can paste results into ChatGPT manually, but it does not read PDFs directly in all versions and will not structure the data into a persistent, charted, trended database. SmarterBlood is purpose-built for this task, stores your history, normalises units across labs and years, and provides Australian reference ranges in context.

Is my pathology PDF data kept private?

Yes. Your uploaded files and extracted data are stored encrypted and associated only with your account. SmarterBlood does not sell health data, share results with third parties, or use your data to train AI models. You can delete all your data from account settings at any time.

How-To Guide

How to Extract Data from a Pathology PDF

Manual copy-paste garbles the data. OCR tools miss the table structure. Here is why pathology PDFs are hard to parse — and how SmarterBlood solves it automatically.

This page explains how automated data extraction from pathology PDFs works. It is a technical guide, not medical advice. Always verify extracted data against your original report for clinical decisions.

The Quick Answer

Pathology PDFs are notoriously difficult to extract data from manually. Multi-column table layouts scramble when you copy-paste. Scanned PDFs return no text at all. Generic OCR tools do not understand the structure of a pathology report. SmarterBlood uses four independent AI models — Claude Sonnet, Claude Haiku, Gemini Flash, and Google Vision OCR — that each independently read your PDF and then vote on every value. The result is structured, labelled, charted data in your dashboard without any manual work.

~99% accuracy on clean PDFs

Scanned PDFs supported

Four-model consensus

No manual data entry

Why Pathology PDFs Are Hard to Extract

If you have ever tried to copy data from a blood test PDF into a spreadsheet, you know the result: a jumbled mess of numbers and names that bears no resemblance to the original table. There are six structural reasons why.

Multi-column table layouts

Pathology results are formatted as tables with marker names in one column, values in another, and reference ranges in a third. PDF copy-paste reads left-to-right then top-to-bottom, so it mixes columns. The result is a jumble of numbers and names that is almost impossible to parse.

Image-based (scanned) PDFs

Many labs, particularly older ones and hospital systems, produce PDFs that are simply images with no underlying text. Copy-paste produces nothing. Standard OCR tools like Tesseract struggle with the multi-column pathology table format.

Lab-specific layouts

Every pathology company uses a different report template. 4Cyte, Sullivan Nicolaides, Dorevitch, Laverty, and hospital labs all look different. Tabula (an open-source PDF extraction tool) requires manual column boundary configuration for each layout.

Faxed reports

Many GP practices still receive results by fax, which are then scanned and saved as PDFs. Fax compression introduces noise, diagonal lines, and skew that breaks standard OCR.

Multi-date tables

Some labs include results from multiple dates in a single table (e.g. columns for Jan, Mar, Jun). Generic extraction tools read the columns as separate documents, mixing dates and values.

Non-standard formatting

Handwritten annotations, stamps, watermarks ("REVISED REPORT"), and multi-page documents with tables split across pages all cause problems for generic OCR and copy-paste workflows.

What Gets Extracted from Your PDF

SmarterBlood extracts eight data fields from every blood test result. All are stored in your account and used to build your trend dashboard.

Data Field	Example	Notes
Marker name	Haemoglobin, TSH, Ferritin, HbA1c	Normalised to a canonical name using a 1,038-alias database. "FERRITIN", "Serum Ferritin", and "Fe STORES" all map to the same marker.
Numeric value	136, 2.4, 45.0, 5.8%	Extracted with physiological validation. Values outside plausible biological limits are flagged for review and re-extracted.
Unit	g/L, mIU/L, mcg/L, mmol/mol	Detected from the report and stored. Units are normalised across labs and years so trend graphs use consistent units.
Reference range	115-155, 0.4-4.0, 12-300	The lab's own reference range is extracted and stored alongside Australian population norms for context.
H/L flags	H (high), L (low), * (abnormal)	Abnormal flags from the original report are preserved. SmarterBlood also independently computes status against Australian reference ranges.
Test date	15 March 2024, 15/03/2024	Date format is auto-detected (Australian DD/MM or international MM/DD) using lab-specific disambiguation rules.
Lab name	4Cyte, Sullivan Nicolaides, Dorevitch	Extracted where present in the header. Used to contextualise lab-specific reference range differences.
Document type	Blood test, urine test, thyroid panel	Classified automatically before extraction. Non-blood-test documents (imaging, referrals) are stored separately without extraction.

How the Extraction Pipeline Works

Document classification

Before extraction begins, the document is classified. Is this a blood test? Urine test? Imaging report? Referral letter? Only blood test documents are passed to the extraction pipeline. This prevents wasted processing and keeps your dashboard clean.

Four-model parallel extraction

Claude Sonnet (vision), Claude Haiku (vision), Gemini 2.5 Flash (vision), and Google Vision OCR each independently read your PDF. Running four independent models in parallel takes the same time as running one, but produces a four-way check on every value.

Majority-vote consensus

For each marker, the four results are compared. If all four agree, the value is accepted at HIGH confidence. A clear majority (three of four) gives MEDIUM confidence. A tie or all-disagree triggers a deeper investigation step using the secondary validation queue.

Physiological validation

Each accepted value is checked against known physiological limits. A haemoglobin of 136,000 g/L (a common OCR error where the decimal was missed) is automatically corrected to 136 g/L and flagged in the audit trail.

Unit normalisation

Units are detected and normalised. If your 2010 cholesterol was reported in mg/dL and your 2024 result is in mmol/L, both are converted to mmol/L so your trend graph is consistent.

Marker name normalisation

The extracted marker name is matched against a 1,038-alias database. "FERRITIN", "Serum Ferritin", "FERRITIN (SERUM)", and "Iron Stores" all map to a single canonical "Ferritin" entry in your dashboard.

DIY Alternatives and Their Limits

If you prefer a do-it-yourself approach or need CSV output, here are the main alternatives and their practical constraints.

Tool	Approach	Main Limits	Best For
Manual copy-paste	Select text in PDF viewer, paste into spreadsheet	Garbled column order on all multi-column layouts. Zero output on image PDFs.	Single simple results with 1-2 values (very rare in practice).
Adobe Acrobat Export	Export PDF to Excel via Adobe Acrobat	Table structure often collapses. Image PDFs need OCR upgrade ($). Requires paid subscription.	Simple, single-column PDFs with no tables.
Tabula	Open-source tool to extract PDF tables by drawing bounding boxes	Manual configuration per lab layout. No OCR (image PDFs fail). Outputs raw CSV needing extensive cleanup.	Developers who want a DIY pipeline and can invest hours on configuration.
ChatGPT / Claude (manual paste)	Copy text, paste into AI, ask it to structure the data	Requires manual copy-paste (fails on image PDFs). No persistent storage, no trends, no history. Re-entry every visit.	One-off quick interpretation if you can copy the text cleanly.
Google Vision / AWS Textract	Developer API for OCR extraction	Raw OCR output needs significant post-processing to parse pathology table format. Requires coding skill. No UI.	Developers building their own pipeline who need high-quality OCR as a starting point.
SmarterBlood	Upload PDF; AI extracts, structures, stores, and charts automatically	No CSV export yet (roadmap). Requires an account.	Patients who want their full blood test history structured, charted, and accessible without any technical setup.

For quantified-self enthusiasts: If you want raw CSV output to build your own analysis, the current best approach is Google Vision API or AWS Textract for OCR, followed by a custom parser for the specific lab layouts you use. SmarterBlood CSV export is on the roadmap — sign up to be notified when it launches.

Accuracy Caveats

Extraction accuracy is ~99% on clean PDFs, not 100%

On clean digital PDFs from major Australian labs, the four-model consensus approach achieves approximately 99% accuracy. That means roughly 1 in 100 values may need checking. Always review your results in SmarterBlood against your original PDF for high-stakes decisions.

Very degraded faxes may miss some values

A heavily degraded fax with missing sections, severe skew, or thermofax fading may extract only some markers. The original report always takes precedence over extracted data.

Marker names vary dramatically between labs

Despite our 1,038-alias database, edge cases exist. If a marker you expect does not appear, it may have been extracted under a slightly different name. Use the search function or contact support.

Dates in unusual formats may be misread

Some hospital systems use formats like "15MAR24" or "2024.03.15" that are outside the standard detection rules. If a result appears at an odd point on your timeline, check the date in the original PDF.