Project: pdf2docx

Open source Python library converting pdf to docx.

Project Details

English | 中文

license pypi-downloads

Parse and re-create page layout
- page margin
- section and column (1 or 2 columns only)
- page header and footer [TODO]
Parse and re-create paragraph
- OCR text [TODO]
- text in horizontal/vertical direction: from left to right, from bottom to top
- font style, e.g. font name, size, weight, italic and color
- text format, e.g. highlight, underline, strike-through
- list style [TODO]
- external hyper link
- paragraph horizontal alignment (left/right/center/justify) and vertical spacing
Parse and re-create image
- in-line image
- image in Gray/RGB/CMYK mode
- transparent image
- floating image, i.e. picture behind text
Parse and re-create table
- border style, e.g. width, color
- shading style, i.e. background color
- merged cells
- vertical direction cell
- table with partly hidden borders
- nested tables
Parsing pages with multi-processing

It can also be used as a tool to extract table contents since both table content and format/style is parsed.