A wrapper around the pdftoppm and pdftocairo command line tools to convert PDF to a PIL Image list.
A python (3.7+) module that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object
pip install pdf2image
Windows users will have to build or download poppler for Windows. I recommend @oschwartz10612 version which is the most up-to-date. You will then have to add the bin/
folder to PATH or use poppler_path = r"C:\path\to\poppler-xx\bin" as an argument
in convert_from_path
.
Mac users will have to install poppler.
Installing using Brew:
brew install poppler
Most distros ship with pdftoppm
and pdftocairo
. If they are not installed, refer to your package manager to install poppler-utils
conda
)conda install -c conda-forge poppler
pip install pdf2image
from pdf2image import convert_from_path, convert_from_bytes
from pdf2image.exceptions import (
PDFInfoNotInstalledError,
PDFPageCountError,
PDFSyntaxError
)
Then simply do:
images = convert_from_path('/home/belval/example.pdf')
OR
images = convert_from_bytes(open('/home/belval/example.pdf', 'rb').read())
OR better yet
import tempfile
with tempfile.TemporaryDirectory() as path:
images_from_path = convert_from_path('/home/belval/example.pdf', output_folder=path)
# Do something here
images
will be a list of PIL Image representing each page of the PDF document.
Here are the definitions:
convert_from_path(pdf_path, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600, hide_attributes=False)
convert_from_bytes(pdf_file, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600, hide_attributes=False)
hide_attributes
(Thank you @StaticRocket)timeout
parameter which raises PDFPopplerTimeoutError
after the given number of seconds.use_pdftocairo
parameter which forces pdf2image
to use pdftocairo
. Should improve performance.pdf2image
with multiple threads (but not multiple processes) would cause and exceptionjpegopt
parameter allows for tuning of the output JPEG when using fmt="jpeg"
(-jpegopt
in pdftoppm CLI) (Thank you @abieler)pdfinfo_from_path
and pdfinfo_from_bytes
which expose the output of the pdfinfo CLIpaths_only
parameter will return image paths instead of Image objects, to prevent OOM when converting a big PDFsize
parameter allows you to define the shape of the resulting images (-scale-to
in pdftoppm CLI)
size=400
will fit the image to a 400x400 box, preserving aspect ratiosize=(400, None)
will make the image 400 pixels wide, preserving aspect ratiosize=(500, 500)
will resize the image to 500x500 pixels, not preserving aspect ratiograyscale
parameter allows you to convert images to grayscale (-gray
in pdftoppm CLI)single_file
parameter allows you to convert the first PDF page only, without adding digits at the end of the output_file
poppler_path
python tests.py
to get timings.