A wrapper around the pdftoppm and pdftocairo command line tools to convert PDF to a PIL Image list.
A python (3.7+) module that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object
pip install pdf2image
Windows users will have to build or download poppler for Windows. I recommend @oschwartz10612 version which is the most up-to-date. You will then have to add the bin/ folder to PATH or use poppler_path = r"C:\path\to\poppler-xx\bin" as an argument in convert_from_path.
Mac users will have to install poppler.
Installing using Brew:
brew install poppler
Most distros ship with pdftoppm and pdftocairo. If they are not installed, refer to your package manager to install poppler-utils
conda)conda install -c conda-forge popplerpip install pdf2imagefrom pdf2image import convert_from_path, convert_from_bytes
from pdf2image.exceptions import (
PDFInfoNotInstalledError,
PDFPageCountError,
PDFSyntaxError
)
Then simply do:
images = convert_from_path('/home/belval/example.pdf')
OR
images = convert_from_bytes(open('/home/belval/example.pdf', 'rb').read())
OR better yet
import tempfile
with tempfile.TemporaryDirectory() as path:
images_from_path = convert_from_path('/home/belval/example.pdf', output_folder=path)
# Do something here
images will be a list of PIL Image representing each page of the PDF document.
Here are the definitions:
convert_from_path(pdf_path, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600, hide_attributes=False)
convert_from_bytes(pdf_file, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600, hide_attributes=False)
hide_attributes (Thank you @StaticRocket)timeout parameter which raises PDFPopplerTimeoutError after the given number of seconds.use_pdftocairo parameter which forces pdf2image to use pdftocairo. Should improve performance.pdf2image with multiple threads (but not multiple processes) would cause and exceptionjpegopt parameter allows for tuning of the output JPEG when using fmt="jpeg" (-jpegopt in pdftoppm CLI) (Thank you @abieler)pdfinfo_from_path and pdfinfo_from_bytes which expose the output of the pdfinfo CLIpaths_only parameter will return image paths instead of Image objects, to prevent OOM when converting a big PDFsize parameter allows you to define the shape of the resulting images (-scale-to in pdftoppm CLI)
size=400 will fit the image to a 400x400 box, preserving aspect ratiosize=(400, None) will make the image 400 pixels wide, preserving aspect ratiosize=(500, 500) will resize the image to 500x500 pixels, not preserving aspect ratiograyscale parameter allows you to convert images to grayscale (-gray in pdftoppm CLI)single_file parameter allows you to convert the first PDF page only, without adding digits at the end of the output_filepoppler_pathpython tests.py to get timings.