May 26, 2026 · 6 min read · PDF

How to Extract Text from a PDF

Hitendra Patel

Founder, keptlocal · Senior Technical Lead, Healthcare IT

Extracting text from a PDF is one of those tasks that sounds simple but runs into unexpected complications. Here is what actually works — and why scanned PDFs need a different approach.

The two types of PDF (and why it matters)

Before choosing an extraction method, it helps to know what kind of PDF you are dealing with:

Text-based PDFs contain actual text data embedded in the file. When you open such a PDF and can select text with your cursor, the text is extractable. This includes most PDFs created by word processors, design tools, and modern office software.

Image-based (scanned) PDFs are essentially a sequence of page images with no embedded text. Scanning a paper document produces this type. When you open a scanned PDF, you cannot select text — because there is no text to select, only pixels.

For text-based PDFs, the methods below work directly. For scanned PDFs, you need OCR first.

Option 1: Select and copy in your PDF reader

The simplest method for short extractions:

Open the PDF in Adobe Reader, Chrome, Preview, or any PDF viewer.
Click and drag to select the text you want.
Ctrl+C / Cmd+C to copy, then paste wherever you need it.

Limitations: this works well for a paragraph or two. For entire documents or multiple pages, it is tedious. Multi-column layouts and tables often paste in the wrong order. And some PDFs have text selection disabled (a PDF permissions flag).

Option 2: Browser-based extraction (no upload)

The PDF to Text tool on keptlocal extracts all text from a PDF page by page, entirely in your browser using pdf.js. Your file is never uploaded.

Drop your PDF onto the tool.
Click Extract text. The tool processes each page and shows the extracted content.
Copy it to your clipboard or download a .txt file.

This works well for text-based PDFs of any length. The output is organised by page, so long documents remain navigable. For sensitive documents — legal filings, medical records, financial statements — the browser-based approach means no data leaves your device.

Option 3: Copy all text with Ctrl+A

In some PDF viewers, pressing Ctrl+A selects all text in the current document view, which you can then copy in one step. Chrome's built-in PDF viewer supports this. The result can be messy for complex layouts but is fast for simple documents.

Option 4: Save as text from Acrobat

Adobe Acrobat (including the free Reader, not just Pro) can export a PDF as plain text:

File → Export To → Text (Plain)
Choose a location and save.

Acrobat's text extraction generally handles multi-column layouts and tables better than browser-based tools, though the results still depend on the PDF's structure.

Why multi-column and table layouts extract poorly

PDF does not store text in reading order. Text is stored as a series of positioned drawing instructions — "draw this character at coordinate (x, y)." When extracting text, a tool has to reconstruct reading order from those positions. This works well for single-column documents but breaks down for complex layouts. I extracted a two-column academic paper once and got alternating lines from each column — technically correct positionally, but completely unreadable as prose. The breakdown is most visible with:

Multi-column layouts — the columns may extract as alternating lines rather than column by column.
Tables — cells may extract across rather than down, losing the table structure.
Rotated text — sidebars and captions at 90° may appear at random positions in the extracted text.
Text in text boxes — floating text boxes may appear before or after the main body regardless of visual position.

This is not a bug in any particular tool. It is a fundamental limitation of the PDF format for text extraction purposes.

Scanned PDFs: you need OCR

If your PDF is a scan with no text layer, none of the above will work. You need optical character recognition (OCR) to convert the image into extractable text.

Adobe Acrobat Pro is the most reliable OCR option (Tools → Enhance Scans → Recognize Text) — it produces a searchable PDF with an embedded text layer. Google Drive is a free alternative: upload the PDF, right-click and open with Google Docs, and Google's OCR runs automatically. That does mean uploading to Google's servers, which matters for sensitive documents. Microsoft OneNote handles individual pages: insert a PDF page as an image, right-click, and choose "Copy Text from Picture." For command-line users, Tesseract is free, open-source, and handles most languages reasonably well. It is slower than Acrobat but gets the job done for single documents.

keptlocal plans to add browser-based OCR as an optional feature. It will not be as fast as server-side solutions, but it will keep the file local.

The privacy consideration

Online PDF-to-text converters upload your document to process it. For PDFs containing sensitive information — legal documents, medical records, financial reports, personal correspondence — that upload is a real privacy concern.

Browser-based extraction (keptlocal), local PDF readers, and Acrobat all keep the file on your device. If the content is sensitive, avoid upload-based tools.

Extract text from any PDF with keptlocal's PDF to Text tool — no upload, instant results. Also useful: PDF to JPG to convert scanned pages to images for processing elsewhere.

Free browser tool

PDF to Text

Extract all text from a PDF — copy to clipboard or download as .txt, in your browser.

No upload. No signup. Runs in your browser.

Use PDF to Text