PDF Tools
PDF to JSON / CSV / Excel
Turn any PDF into structured data — tables, text, and metadata — instantly in your browser.
Complete Guide to Extracting Data from PDFs
PDFs lock data in a fixed layout, making it painful to reuse content in spreadsheets or code. This tool reverses that by parsing the PDF structure directly in your browser.
Whether you need a single table as a CSV or an entire multi-page report as structured JSON, the extractor preserves layout coordinates so you know exactly where every piece of text lived on the page.
Because everything runs client-side, your sensitive documents — bank statements, legal contracts, internal reports — never touch a remote server.
How Client-Side PDF Parsing Works
Modern browsers can execute WebAssembly at near-native speed, which is what pdfjs-dist (Mozilla's PDF engine) uses under the hood. When you click Parse PDF, the engine reads the PDF's internal content streams page by page, extracting each text glyph along with its font metrics and position.
The result is a flat list of text items per page, each with an affine transform matrix encoding its X/Y position, width, and height. This tool converts those matrices into normalised bounding boxes (x, y, width, height in PDF points, origin top-left) that are easy to work with in any downstream system.
How Table Detection Works
PDFs have no native concept of a 'table' — they are just positioned text runs. The table detector reconstructs grid structure from geometry alone.
Items are sorted by Y position into horizontal bands (rows). A band is considered a table row if it contains at least two items separated by a horizontal gap larger than the column tolerance. Adjacent table bands are merged into a single table region. Within each band, items are bucketed into columns by their X coordinate.
The first fully-populated row is treated as a header row. Detected column count, headers, and bounding boxes are included in the JSON output and used to build CSV and Excel exports.
- Works best on digital PDFs with clear columnar alignment.
- Conservative by design — it only claims a table when there is strong geometric evidence.
- Falls back gracefully: if no tables are detected, text blocks are still exported as a flat sheet in the Excel workbook.
Export Formats Explained
JSON is the most complete format. It includes every text block with bounding boxes, all table data, and document metadata. Use it to feed PDF content into APIs, LLM pipelines, or databases without writing a parser yourself.
CSV is generated per table using papaparse. Merged cells (colSpan) are duplicated across columns so every row has the same number of fields, making the output safe to import into Excel, Google Sheets, or pandas.
Excel (.xlsx) is generated by SheetJS with one sheet per detected table, one sheet per page of text blocks, and a Metadata sheet. Column widths are auto-fitted (capped at 60 characters). Download it and open directly in Microsoft Excel or LibreOffice Calc.
Scanned PDFs and OCR
If your PDF was created by scanning a physical document, it contains images of text rather than actual text characters. pdfjs-dist cannot extract text from images, so the parser will return empty or near-empty text blocks.
The tool detects this automatically: if the average characters per page falls below 20, it shows OCR guidance. To process a scanned PDF, first create a searchable PDF, then re-upload here.
OCR is intentionally handled as a separate workflow so this parser stays fast and lightweight for the majority of digital PDFs.
Performance and Large Files
The parser processes one page at a time and releases each page's memory with `pdfPage.cleanup()` immediately after extraction. It yields to the browser every 5 pages via `requestAnimationFrame`, keeping the UI responsive even on 200-page documents.
A typical 50-page digital PDF parses in 2–5 seconds on a mid-range laptop. Scanned PDFs with large embedded images take longer because pdfjs must decode each image stream even though no text is extracted.
Files are read as `Uint8Array` in a single `FileReader` pass and a `.slice()` copy is handed to pdfjs, allowing the garbage collector to reclaim the original buffer as soon as parsing starts.
How It Works
Step 1
Upload a PDF (drag & drop or browse) — up to 500 MB, PDF only.
Step 2
Click Parse PDF. The tool extracts all text blocks, detects tables using a columnar grid heuristic, and reads document metadata — entirely client-side.
Step 3
Download the full JSON, individual table CSVs, or a multi-sheet Excel workbook with one click.
Why This Tool
- • 100% private — your file never leaves your device.
- • Works offline after the first page load.
- • Extracts tables, text blocks, and document metadata in one click.
- • Exports to JSON, CSV per table, or a full multi-sheet Excel workbook.
- • Handles PDFs up to 500 MB with streaming page-by-page processing.
- • Preserves layout bounding-box coordinates for every text run.
Use Cases
- • Pulling financial tables out of bank statements or annual reports.
- • Converting government-form PDFs into importable spreadsheet data.
- • Extracting product data from supplier catalogues into JSON for APIs.
- • Archiving research papers as structured, searchable JSON.
- • Feeding PDF content into LLM pipelines without a backend.
Frequently Asked Questions
Common questions about the PDF to JSON tool — how it works, privacy, file limits, and more.
Need a walkthrough before you start?
We publish first-party guides for the workflows people actually use, and we explain how those articles are tested, reviewed, and updated.
Privacy, file deletion, and support
Browser-based tools never upload your file. Server-assisted tools run in isolated workers with short-lived storage and deletion rules documented in our public policies.
