BankToSheet Blog - Merge PDF files then convert specific pages to Excel with Python

Merging PDF files and converting specific pages to Excel can be a valuable task for data extraction and analysis. This guide will walk you through the process using Python, leveraging powerful libraries such as PyPDF2 and pandas to automate and simplify the workflow.

1. Prerequisites

Before we begin, ensure you have Python installed on your computer. You’ll also need the following Python libraries:

PyPDF2 - for handling PDF files.
pandas - for working with data and creating Excel files.
openpyxl - for saving data to Excel.

Install them using pip:

pip install PyPDF2 pandas openpyxl

2. Merging PDF Files

To merge multiple PDF files into one, you can use the following Python script:

from PyPDF2 import PdfMerger pdfs = ["file1.pdf", "file2.pdf"] merger = PdfMerger() for pdf in pdfs: merger.append(pdf) merger.write("merged.pdf") merger.close()

This script takes a list of PDF files, merges them, and saves the result as merged.pdf.

3. Extract Specific Pages

Once the PDF files are merged, you can extract specific pages for conversion. Here’s an example script:

from PyPDF2 import PdfReader, PdfWriter reader = PdfReader("merged.pdf") writer = PdfWriter() pages_to_extract = [0, 2, 4] # Page indices to extract (0-based) for page in pages_to_extract: writer.add_page(reader.pages[page]) with open("extracted.pdf", "wb") as output_pdf: writer.write(output_pdf)

This script extracts pages 1, 3, and 5 (indices 0, 2, 4) and saves them as extracted.pdf.

4. Convert to Excel

After extracting specific pages, convert the content to Excel. If the pages contain tabular data, you can use the pdfplumber library to extract and convert it. Here’s how:

import pdfplumber import pandas as pd with pdfplumber.open("extracted.pdf") as pdf: all_data = [] for page in pdf.pages: table = page.extract_table() if table: all_data.extend(table) # Create a DataFrame and save to Excel df = pd.DataFrame(all_data[1:], columns=all_data[0]) df.to_excel("output.xlsx", index=False)

This script reads the extracted PDF, pulls tabular data, and writes it into an Excel file.

5. Verify the Results

Open the resulting Excel file and ensure the data is correctly formatted. You can make adjustments in Excel for further refinement.

Conclusion

By merging PDF files, extracting specific pages, and converting them to Excel using Python, you can automate and streamline complex workflows. This approach is highly adaptable for various use cases, from data extraction to detailed reporting. Python’s versatility makes it an ideal tool for handling such tasks efficiently.