Best PDF text extractor in Python

2 min readJan 29, 2023

Extracting the complete text from PDF for further use requires a good tool that can efficiently convert the pdf into text. Finding an open source tool for this purpose is even more difficult. PyMuPDF is one such tool which can help solve these problems.
Some advantages of using the PyMuPDF extractor are:

Supports different formats such as PDF, XPS, OpenXPS, CBZ, EPUB, and FictionBook 2
Convert documents to formats such as HTML, SVG, PDF, and CBZ
Search for text within PDFs
Text and Image extraction
OCR support (additional Tesseract installation required)
PDF editing, annotation and manipulation
Encryption and decryption of documents
Access to document meta information — like font name, font properties, position information etc.

Installation

PyMuPDF can be installed from PyPi repository using the following command:

pip install PyMuPDF

Usage and Examples

Extracting text from PDF:

import fitzbash

file = 'file.pdf'
doc = fitz.open(file)
for idx, page in enumerate(doc):
    text = page.get_text('text')
    print('Page no:', idx)
    print(text)

2. Accessing metadata:

import fitz

file = 'file.pdf'
doc = fitz.open(file)
print('Metadata:', doc.metadata)

3. Accessing a specific page in a document:

import fitz

file = 'file.pdf'
doc = fitz.open(file)
page_no = 2
page = doc[page_no]
print('Text:', page.get_text('text')

4. Extracting text in different formats:

import fitz

file = 'file.pdf'
doc = fitz.open(file)
page_no = 1
page = doc[page_no]

#extracting text in plain format
print('Text:', page.get_text('text')
#extracting text in html format
print('Text:', page.get_text('html')
#extracting text in xml format
print('Text:', page.get_text('xml')
#extracting text in json format
print('Text:', page.get_text('json')

5. Searching for string in a page of PDF:

import fitz

file = 'file.pdf'
doc = fitz.open(file)
search_string = 'pymupdf'

for idx, page in enumerate(doc):
    result = page.search_for('text')
    print('Result:', result)

These are some of the examples shown to help you get started with PyMuPDF extractor. This open-source framework has a plethora of functionalities that can be used for variety of purposes. Check out their GitHub repository for more details!

Peformance

PyMuPDF is the fastest of all the available open-source frameworks. There are other libraries like PyPDF2, pikepdf, pdfminer.six, pdfplumber etc to name a few, but this benchmark clearly shows that PyMuPDF largely out-stands all these in terms of speed and performance (as of writing this blog). So, if you are looking for an open-source library for text extraction purposes, PyMuPDF is the go to library for your tasks!

References:
[1] https://github.com/pymupdf/PyMuPDF

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Text Extraction In Python

Written by Dheeraj Bhat

8 Followers

1 Following

Always Learning... (https://www.linkedin.com/in/dheerajnbhat/)

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

More from Dheeraj Bhat

Kedro — A framework for building production ready Data Science pipelines

Dheeraj Bhat

Kedro — A framework for building production ready Data Science pipelines

Kedro is an open source Python framework for creating reproducible, maintainable and modular data science code. It uses best practices of…

Mar 12, 2023

Getting Started with DVC — a Git for Data and Models

Dheeraj Bhat

Getting Started with DVC — a Git for Data and Models

If you are reading this blog, you might have been familiar with what Git is, and how it has been an integral part of software development…

Feb 26, 2023

MNIST Classification Tutorial using ClearML MLOps platform

Dheeraj Bhat

MNIST Classification Tutorial using ClearML MLOps platform

Table of Contents 1. Introduction 2. Key Features of ClearML 3. ClearML Tutorial on MNIST Classification Task i. Installation ii. MNIST…

May 21, 2023

PhyCV - The First Physics inspired Computer Vision Python library

Dheeraj Bhat

PhyCV - The First Physics inspired Computer Vision Python library

The researchers at Jalali Lab UCLA have developed a computer vision library which uses algorithms derived directly from equations of…

Feb 12, 2023

See all from Dheeraj Bhat

Recommended from Medium

Unlocking Document Processing with Python: Advanced File Partitioning and Text Extraction

Avinash Maheshwari

Unlocking Document Processing with Python: Advanced File Partitioning and Text Extraction

Processing and extracting information from diverse document formats is essential for numerous applications. Python’s unstructured library…

Dec 1, 2024

[Python-Doc] Efficient Text Replacement in Word Documents

Amazing lifestyle

[Python-Doc] Efficient Text Replacement in Word Documents

Python script is designed to replace specific text in a Word document using the python-docx library. Here’s a detailed breakdown of how the…

Nov 7, 2024

Lists

Coding & Development

11 stories1033 saves

Predictive Modeling w/ Python

20 stories1856 saves

Practical Guides to Machine Learning

10 stories2225 saves

ChatGPT

21 stories991 saves

Unleash the Power of PaddleOCR: Your Guide to Best Open Source OCR

Generative AI

RSD Studio.ai

Unleash the Power of PaddleOCR: Your Guide to Best Open Source OCR

Want to find out about the best OCR that you can use to build AI applications at scale and earn a fortune!

Feb 8

Why RAG came into existence? How does it work? and what are the different RAG Architectures?

Srinivas P

Why RAG came into existence? How does it work? and what are the different RAG Architectures?

It’s widely known that Large Language Models (LLMs) were among the first to emerge in the field of Generative AI. These models are trained…

Sep 20, 2024

3D Photo Magic | Convert Any Picture to 3D with Python

Eran Feit

3D Photo Magic | Convert Any Picture to 3D with Python

Hi,

Sep 11, 2024

Build Your Dream Python SaaS with 5 Best Open Source: Cost-Effective, Full-Featured!

Alfin Fanther

Build Your Dream Python SaaS with 5 Best Open Source: Cost-Effective, Full-Featured!

Are you a developer looking to build a robust and efficient Python-based Software as a Service (SaaS) application?

Mar 4

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams