keptlocal
· 6 min read · PDF

How to Extract Text from a PDF

HM
Hiten Mahalwar
Founder, keptlocal · Technical Lead, Healthcare IT

Extracting text from a PDF is one of those tasks that sounds simple but runs into unexpected complications. Here is what actually works — and why scanned PDFs need a different approach.

The two types of PDF (and why it matters)

Before choosing an extraction method, it helps to know what kind of PDF you are dealing with:

Text-based PDFs contain actual text data embedded in the file. When you open such a PDF and can select text with your cursor, the text is extractable. This includes most PDFs created by word processors, design tools, and modern office software.

Image-based (scanned) PDFs are essentially a sequence of page images with no embedded text. Scanning a paper document produces this type. When you open a scanned PDF, you cannot select text — because there is no text to select, only pixels.

For text-based PDFs, the methods below work directly. For scanned PDFs, you need OCR first.

Option 1: Select and copy in your PDF reader

The simplest method for short extractions:

  1. Open the PDF in Adobe Reader, Chrome, Preview, or any PDF viewer.
  2. Click and drag to select the text you want.
  3. Ctrl+C / Cmd+C to copy, then paste wherever you need it.

Limitations: this works well for a paragraph or two. For entire documents or multiple pages, it is tedious. Multi-column layouts and tables often paste in the wrong order. And some PDFs have text selection disabled (a PDF permissions flag).

Option 2: Browser-based extraction (no upload)

The PDF to Text tool on keptlocal extracts all text from a PDF page by page, entirely in your browser using pdf.js. Your file is never uploaded.

  1. Drop your PDF onto the tool.
  2. Click Extract text. The tool processes each page and shows the extracted content.
  3. Copy it to your clipboard or download a .txt file.

This works well for text-based PDFs of any length. The output is organised by page, so long documents remain navigable. For sensitive documents — legal filings, medical records, financial statements — the browser-based approach means no data leaves your device.

Option 3: Copy all text with Ctrl+A

In some PDF viewers, pressing Ctrl+A selects all text in the current document view, which you can then copy in one step. Chrome's built-in PDF viewer supports this. The result can be messy for complex layouts but is fast for simple documents.

Option 4: Save as text from Acrobat

Adobe Acrobat (including the free Reader, not just Pro) can export a PDF as plain text:

  1. File → Export To → Text (Plain)
  2. Choose a location and save.

Acrobat's text extraction generally handles multi-column layouts and tables better than browser-based tools, though the results still depend on the PDF's structure.

Why multi-column layouts extract poorly

PDF does not store text in reading order. Text is stored as a series of positioned drawing instructions — "draw this character at coordinate (x, y)." When extracting text, a tool has to reconstruct reading order from those positions, which works well for single-column documents but breaks down for:

  • Multi-column layouts — the columns may extract as alternating lines rather than column by column.
  • Tables — cells may extract across rather than down, losing the table structure.
  • Rotated text — sidebars and captions at 90° may appear at random positions in the extracted text.
  • Text in text boxes — floating text boxes may appear before or after the main body regardless of visual position.

This is not a bug in any particular tool — it is a fundamental limitation of the PDF format for text extraction purposes.

Scanned PDFs: you need OCR

If your PDF is a scan with no text layer, none of the above will work. You need optical character recognition (OCR) to convert the image into extractable text.

Options for OCR:

  • Adobe Acrobat Pro — the most reliable option. Tools → Enhance Scans → Recognize Text. Produces a searchable PDF with an embedded text layer.
  • Google Drive — upload the PDF to Google Drive, right-click and open with Google Docs. Google's OCR runs automatically and produces a document with the extracted text. Free, but you are uploading to Google's servers.
  • Microsoft OneNote — insert a PDF page as an image, then right-click and choose "Copy Text from Picture." Works for individual pages.
  • Tesseract — open-source OCR engine, command-line, free. Handles most languages but requires installation.

keptlocal plans to add browser-based OCR as an optional feature — watch this space.

The privacy consideration

Online PDF-to-text converters upload your document to process it. For PDFs containing sensitive information — legal documents, medical records, financial reports, personal correspondence — that upload is a real privacy concern.

Browser-based extraction (keptlocal), local PDF readers, and Acrobat all keep the file on your device. If the content is sensitive, avoid upload-based tools.

Extract text from any PDF with keptlocal's PDF to Text tool — no upload, instant results. Also useful: PDF to JPG to convert scanned pages to images for processing elsewhere.