BankToSheet Blog - How to Convert Specific PDF Pages to Excel with Python

PDF documents are commonly used for storing and sharing a wide variety of content, from invoices and reports to scientific papers. However, extracting and manipulating data from a PDF can often be challenging, especially when it comes to specific pages. This is where Python comes in. Python provides an array of libraries and tools to help automate the extraction and conversion of data from PDFs into more usable formats such as Excel. In this article, we'll walk through how to convert specific pages of a PDF document to Excel using Python.

Why Convert PDF Pages to Excel?

While PDFs are excellent for document presentation and preservation, they are not always the best format for extracting data. Many professionals need to work with tabular data stored within PDFs. Manually copying data from a PDF into an Excel spreadsheet is time-consuming and error-prone. Using Python to automate this process can save you significant time and effort. You can also handle large files and extract specific information without manually sifting through the document.

Prerequisites

Before we start converting specific PDF pages to Excel using Python, we need to install a few libraries. The primary libraries we'll use are:

1. PyPDF2: This library helps us extract specific pages from the PDF document. 2. Tabula-py: A Python wrapper for the Tabula library, which is used to extract tables from PDF files and convert them into DataFrames. 3. Pandas: This library allows us to easily manipulate the data and save it into an Excel format.

Installing Necessary Libraries

First, you'll need to install the required Python libraries if you haven't already. You can install them using pip:

pip install PyPDF2 tabula-py pandas openpyxl

Extracting Specific Pages from the PDF

We will begin by extracting specific pages from the PDF using PyPDF2. This allows us to select the pages we want to work with, without needing to extract the entire document. Here’s how to do it:

Step 1: Import the PyPDF2 library. import PyPDF2 Step 2: Open the PDF file and create a PdfReader object. with open('input.pdf', 'rb') as file: reader = PyPDF2.PdfReader(file) Step 3: Select the pages you want to extract. Pages in PyPDF2 are zero-indexed, so the first page is page 0, the second page is page 1, and so on. writer = PyPDF2.PdfWriter() writer.add_page(reader.pages[0]) # Extracting the first page Step 4: Save the extracted pages into a new PDF. with open('extracted_pages.pdf', 'wb') as output: writer.write(output)

Extracting Tables from the PDF

Now that we’ve extracted the specific pages we need, the next step is to extract any tabular data present on those pages. This can be done using the tabula-py library, which simplifies the process of extracting tables from PDFs.

Step 1: Import the tabula and pandas libraries. import tabula import pandas as pd Step 2: Use the read_pdf function from the tabula library to extract tables from the specified pages. tables = tabula.read_pdf('extracted_pages.pdf', pages='1', multiple_tables=True)

The read_pdf function reads the PDF file, and the pages='1' argument specifies that we only want to extract data from the first page (you can specify other pages by changing the page number). The multiple_tables=True argument ensures that if there are multiple tables on the page, all of them will be extracted.

Converting the Extracted Data to Excel

Once we’ve extracted the tables from the specified pages, the next step is to convert them into an Excel file. To do this, we’ll use pandas. The DataFrame objects returned by tabula are already in a format that can easily be saved to Excel using pandas.

Step 1: Convert the extracted tables to a pandas DataFrame. df = pd.DataFrame(tables[0]) Step 2: Save the DataFrame to an Excel file. df.to_excel('output.xlsx', index=False)

The to_excel function will save the DataFrame to an Excel file. The index=False argument ensures that the index column is not included in the Excel file.

Handling Multiple Tables

If the PDF page contains more than one table, the tables list will contain multiple DataFrames. You can iterate through the list and save each table to a separate sheet in the Excel file:

Step 1: Iterate through the tables and save each one to a different sheet in the Excel file. with pd.ExcelWriter('output.xlsx') as writer: for i, table in enumerate(tables): table.to_excel(writer, sheet_name=f'Table_{i+1}', index=False)

Handling PDF with Complex Layouts

Sometimes, PDFs contain complex layouts where tables are not neatly defined. In such cases, the tabula-py library might not extract the data perfectly. You can try different approaches to improve accuracy:

1. Adjusting the area for extraction: Use the area parameter to specify a portion of the page where the table is located. 2. Increasing the pages parameter: If tables span multiple pages, you can specify a range of pages to extract. 3. Using OCR libraries: For scanned PDFs, you can use OCR (Optical Character Recognition) libraries like Tesseract to extract text data before converting it to a structured table format.

Advanced Techniques

For more advanced users, Python offers additional tools and libraries to refine the extraction process further. You can combine libraries like pdfminer, PyMuPDF, or Camelot to work with PDFs more precisely. These libraries can extract text, images, and other elements to help you create more structured Excel files from complex PDFs.

1. Using PyMuPDF: This library allows for detailed extraction of text, tables, and other elements from a PDF. It provides a high degree of customization in how you extract content. 2. Using Camelot: Camelot is another excellent Python library for extracting tables from PDFs. It's more advanced than tabula-py and allows for the extraction of complex tables with better accuracy.