PDF Tools

PDF to JSON / CSV / Excel

Turn any PDF into structured data — tables, text, and metadata — instantly in your browser.

PDF → JSON / CSV / Excel

Extract structured data from any PDF — tables, text, and metadata — entirely in your browser. No upload, no server.

100% private — processing runs locally in your browser.

Drag & drop a PDF here, or click to browse

PDF only · Max 500 MB

Loading tool...

Complete Guide to Extracting Data from PDFs

PDFs lock data in a fixed layout, making it painful to reuse content in spreadsheets or code. This tool reverses that by parsing the PDF structure directly in your browser.

Whether you need a single table as a CSV or an entire multi-page report as structured JSON, the extractor preserves layout coordinates so you know exactly where every piece of text lived on the page.

Because everything runs client-side, your sensitive documents — bank statements, legal contracts, internal reports — never touch a remote server.

How Client-Side PDF Parsing Works

Modern browsers can execute WebAssembly at near-native speed, which is what pdfjs-dist (Mozilla's PDF engine) uses under the hood. When you click Parse PDF, the engine reads the PDF's internal content streams page by page, extracting each text glyph along with its font metrics and position.

The result is a flat list of text items per page, each with an affine transform matrix encoding its X/Y position, width, and height. This tool converts those matrices into normalised bounding boxes (x, y, width, height in PDF points, origin top-left) that are easy to work with in any downstream system.

How Table Detection Works

PDFs have no native concept of a 'table' — they are just positioned text runs. The table detector reconstructs grid structure from geometry alone.

Items are sorted by Y position into horizontal bands (rows). A band is considered a table row if it contains at least two items separated by a horizontal gap larger than the column tolerance. Adjacent table bands are merged into a single table region. Within each band, items are bucketed into columns by their X coordinate.

The first fully-populated row is treated as a header row. Detected column count, headers, and bounding boxes are included in the JSON output and used to build CSV and Excel exports.

Works best on digital PDFs with clear columnar alignment.
Conservative by design — it only claims a table when there is strong geometric evidence.
Falls back gracefully: if no tables are detected, text blocks are still exported as a flat sheet in the Excel workbook.

Export Formats Explained

JSON is the most complete format. It includes every text block with bounding boxes, all table data, and document metadata. Use it to feed PDF content into APIs, LLM pipelines, or databases without writing a parser yourself.

CSV is generated per table using papaparse. Merged cells (colSpan) are duplicated across columns so every row has the same number of fields, making the output safe to import into Excel, Google Sheets, or pandas.

Excel (.xlsx) is generated by SheetJS with one sheet per detected table, one sheet per page of text blocks, and a Metadata sheet. Column widths are auto-fitted (capped at 60 characters). Download it and open directly in Microsoft Excel or LibreOffice Calc.

Merge PDFs Split PDF Compress PDF

Scanned PDFs and OCR

If your PDF was created by scanning a physical document, it contains images of text rather than actual text characters. pdfjs-dist cannot extract text from images, so the parser will return empty or near-empty text blocks.

The tool detects this automatically: if the average characters per page falls below 20, it shows OCR guidance. To process a scanned PDF, first create a searchable PDF, then re-upload here.

OCR is intentionally handled as a separate workflow so this parser stays fast and lightweight for the majority of digital PDFs.

Performance and Large Files

The parser processes one page at a time and releases each page's memory with `pdfPage.cleanup()` immediately after extraction. It yields to the browser every 5 pages via `requestAnimationFrame`, keeping the UI responsive even on 200-page documents.

A typical 50-page digital PDF parses in 2–5 seconds on a mid-range laptop. Scanned PDFs with large embedded images take longer because pdfjs must decode each image stream even though no text is extracted.

Files are read as `Uint8Array` in a single `FileReader` pass and a `.slice()` copy is handed to pdfjs, allowing the garbage collector to reclaim the original buffer as soon as parsing starts.

How It Works

Step 1

Upload a PDF (drag & drop or browse) — up to 500 MB, PDF only.

Step 2

Click Parse PDF. The tool extracts all text blocks, detects tables using a columnar grid heuristic, and reads document metadata — entirely client-side.

Step 3

Download the full JSON, individual table CSVs, or a multi-sheet Excel workbook with one click.

Why This Tool

• 100% private — your file never leaves your device.
• Works offline after the first page load.
• Extracts tables, text blocks, and document metadata in one click.
• Exports to JSON, CSV per table, or a full multi-sheet Excel workbook.
• Handles PDFs up to 500 MB with streaming page-by-page processing.
• Preserves layout bounding-box coordinates for every text run.

Use Cases

• Pulling financial tables out of bank statements or annual reports.
• Converting government-form PDFs into importable spreadsheet data.
• Extracting product data from supplier catalogues into JSON for APIs.
• Archiving research papers as structured, searchable JSON.
• Feeding PDF content into LLM pipelines without a backend.

Frequently Asked Questions

Common questions about the PDF to JSON tool — how it works, privacy, file limits, and more.

No. All processing happens in your browser using pdfjs-dist (WebAssembly-backed PDF engine). Your file is never transmitted anywhere — not even to ShellPDFs servers.

The root object has a metadata field (title, author, page count, file size) and a pages array. Each page contains textBlocks (text string + bounding box), tables (rows of cells with text + bbox), and a plainText convenience string.

The table detector uses a columnar grid heuristic — it groups text items by y-band (rows) and x-gap (columns). It works best on digital PDFs with clear column alignment. If your PDF is image-based (scanned), run it through an OCR tool first to create a searchable PDF.

If fewer than 20 characters per page are extracted, the file likely contains scanned images rather than selectable text. Process it with an OCR tool (e.g. Adobe Acrobat, ILovePDF) to make the text selectable, then re-run the parser.

Up to 500 MB. The parser processes one page at a time and yields to the browser every 5 pages to keep the UI responsive. Very large documents may take up to a minute depending on your device.

The Excel workbook has one sheet per detected table (labelled 'Page N – Table M'), one sheet per page of plain text, and a Metadata sheet summarising the document. Column widths are auto-fitted.

Need a walkthrough before you start?

We publish first-party guides for the workflows people actually use, and we explain how those articles are tested, reviewed, and updated.

Read the blog Editorial standards

Privacy, file deletion, and support

Browser-based tools never upload your file. Server-assisted tools run in isolated workers with short-lived storage and deletion rules documented in our public policies.

Explore More Tools

Compress PDF Webpage to PDF PDF to Word Markdown to PDF OCR PDF Merge PDF Split PDF Organize PDF Remove pages Rotate PDF Protect PDF