How to Extract Text from a PDF
Extracting text from a PDF is one of those tasks that sounds simple but runs into unexpected complications. Here is what actually works — and why scanned PDFs need a different approach.
The two types of PDF (and why it matters)
Before choosing an extraction method, it helps to know what kind of PDF you are dealing with:
Text-based PDFs contain actual text data embedded in the file. When you open such a PDF and can select text with your cursor, the text is extractable. This includes most PDFs created by word processors, design tools, and modern office software.
Image-based (scanned) PDFs are essentially a sequence of page images with no embedded text. Scanning a paper document produces this type. When you open a scanned PDF, you cannot select text — because there is no text to select, only pixels.
For text-based PDFs, the methods below work directly. For scanned PDFs, you need OCR first.
Option 1: Select and copy in your PDF reader
The simplest method for short extractions:
- Open the PDF in Adobe Reader, Chrome, Preview, or any PDF viewer.
- Click and drag to select the text you want.
Ctrl+C/Cmd+Cto copy, then paste wherever you need it.
Limitations: this works well for a paragraph or two. For entire documents or multiple pages, it is tedious. Multi-column layouts and tables often paste in the wrong order. And some PDFs have text selection disabled (a PDF permissions flag).
Option 2: Browser-based extraction (no upload)
The PDF to Text tool on keptlocal extracts all text from a PDF page by page, entirely in your browser using pdf.js. Your file is never uploaded.
- Drop your PDF onto the tool.
- Click Extract text. The tool processes each page and shows the extracted content.
- Copy it to your clipboard or download a
.txtfile.
This works well for text-based PDFs of any length. The output is organised by page, so long documents remain navigable. For sensitive documents — legal filings, medical records, financial statements — the browser-based approach means no data leaves your device.
Option 3: Copy all text with Ctrl+A
In some PDF viewers, pressing Ctrl+A selects all text in the current document view, which you can then copy in one step. Chrome's built-in PDF viewer supports this. The result can be messy for complex layouts but is fast for simple documents.
Option 4: Save as text from Acrobat
Adobe Acrobat (including the free Reader, not just Pro) can export a PDF as plain text:
- File → Export To → Text (Plain)
- Choose a location and save.
Acrobat's text extraction generally handles multi-column layouts and tables better than browser-based tools, though the results still depend on the PDF's structure.
Why multi-column layouts extract poorly
PDF does not store text in reading order. Text is stored as a series of positioned drawing instructions — "draw this character at coordinate (x, y)." When extracting text, a tool has to reconstruct reading order from those positions, which works well for single-column documents but breaks down for:
- Multi-column layouts — the columns may extract as alternating lines rather than column by column.
- Tables — cells may extract across rather than down, losing the table structure.
- Rotated text — sidebars and captions at 90° may appear at random positions in the extracted text.
- Text in text boxes — floating text boxes may appear before or after the main body regardless of visual position.
This is not a bug in any particular tool — it is a fundamental limitation of the PDF format for text extraction purposes.
Scanned PDFs: you need OCR
If your PDF is a scan with no text layer, none of the above will work. You need optical character recognition (OCR) to convert the image into extractable text.
Options for OCR:
- Adobe Acrobat Pro — the most reliable option. Tools → Enhance Scans → Recognize Text. Produces a searchable PDF with an embedded text layer.
- Google Drive — upload the PDF to Google Drive, right-click and open with Google Docs. Google's OCR runs automatically and produces a document with the extracted text. Free, but you are uploading to Google's servers.
- Microsoft OneNote — insert a PDF page as an image, then right-click and choose "Copy Text from Picture." Works for individual pages.
- Tesseract — open-source OCR engine, command-line, free. Handles most languages but requires installation.
keptlocal plans to add browser-based OCR as an optional feature — watch this space.
The privacy consideration
Online PDF-to-text converters upload your document to process it. For PDFs containing sensitive information — legal documents, medical records, financial reports, personal correspondence — that upload is a real privacy concern.
Browser-based extraction (keptlocal), local PDF readers, and Acrobat all keep the file on your device. If the content is sensitive, avoid upload-based tools.
Extract text from any PDF with keptlocal's PDF to Text tool — no upload, instant results. Also useful: PDF to JPG to convert scanned pages to images for processing elsewhere.