Best PDF text extractor in Python

Dheeraj Bhat
2 min readJan 29, 2023
PyMuPDF

Extracting the complete text from PDF for further use requires a good tool that can efficiently convert the pdf into text. Finding an open source tool for this purpose is even more difficult. PyMuPDF is one such tool which can help solve these problems.
Some advantages of using the PyMuPDF extractor are:

  • Supports different formats such as PDF, XPS, OpenXPS, CBZ, EPUB, and FictionBook 2
  • Convert documents to formats such as HTML, SVG, PDF, and CBZ
  • Search for text within PDFs
  • Text and Image extraction
  • OCR support (additional Tesseract installation required)
  • PDF editing, annotation and manipulation
  • Encryption and decryption of documents
  • Access to document meta information — like font name, font properties, position information etc.

Installation

PyMuPDF can be installed from PyPi repository using the following command:

pip install PyMuPDF

Usage and Examples

  1. Extracting text from PDF:
import fitzbash

file = 'file.pdf'
doc = fitz.open(file)
for idx, page in enumerate(doc):
text = page.get_text('text')
print('Page no:', idx)
print(text)

2. Accessing metadata:

import fitz

file = 'file.pdf'
doc = fitz.open(file)
print('Metadata:', doc.metadata)

3. Accessing a specific page in a document:

import fitz

file = 'file.pdf'
doc = fitz.open(file)
page_no = 2
page = doc[page_no]
print('Text:', page.get_text('text')

4. Extracting text in different formats:

import fitz

file = 'file.pdf'
doc = fitz.open(file)
page_no = 1
page = doc[page_no]

#extracting text in plain format
print('Text:', page.get_text('text')
#extracting text in html format
print('Text:', page.get_text('html')
#extracting text in xml format
print('Text:', page.get_text('xml')
#extracting text in json format
print('Text:', page.get_text('json')

5. Searching for string in a page of PDF:

import fitz

file = 'file.pdf'
doc = fitz.open(file)
search_string = 'pymupdf'

for idx, page in enumerate(doc):
result = page.search_for('text')
print('Result:', result)

These are some of the examples shown to help you get started with PyMuPDF extractor. This open-source framework has a plethora of functionalities that can be used for variety of purposes. Check out their GitHub repository for more details!

Peformance

PyMuPDF is the fastest of all the available open-source frameworks. There are other libraries like PyPDF2, pikepdf, pdfminer.six, pdfplumber etc to name a few, but this benchmark clearly shows that PyMuPDF largely out-stands all these in terms of speed and performance (as of writing this blog). So, if you are looking for an open-source library for text extraction purposes, PyMuPDF is the go to library for your tasks!

--

--