Extracting and Cleaning Text from PDFs Using Python

Kushal Shah
6 min readJun 26, 2024

In today’s digital age, extracting and analysing text from documents is crucial for various applications, including Data Analysis, Content Management, and Information Retrieval. This blog will walk you through the process of extracting text from PDF documents using Python and then cleaning the extracted text to remove unwanted characters and non-English content. We have done this for the specific case of a commentary on Bhagavad Gita, but the same process can be used for any PDF with minor variations.

We’ll use several powerful libraries for this task:

  • pytesseract for Optical Character Recognition (OCR).
  • pdf2image to convert PDF pages to images.
  • PyMuPDF for handling PDF documents.
  • langdetect to detect the language of the text.

Step 1: Setting Up the Environment

First, ensure you have the necessary libraries installed. You can install them using pip:

pip install pytesseract pdf2image PyMuPDF langdetect

Step 2: Extracting Text from PDF

We’ll create a function to extract text from a specified range of pages in a PDF file. The function converts each page to an image and then uses pytesseract to extract text from the images.

Here’s the code for the extraction function:

import pytesseract
from pdf2image import convert_from_path
import fitz

# Function to extract text from a range of pages in a PDF file
def extract_text_from_pdf(pdf_file_path, start_page, end_page):
extracted_text = ""

# Specifies the path to the Tesseract OCR executable and Poppler utilities.
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
pop_path = r'A:\Apps\Release-24.02.0-0\poppler-24.02.0\Library\bin'

with fitz.open(pdf_file_path) as pdf:
# Convert specified pages to images (using pdf2image)
images = convert_from_path(pdf_file_path, poppler_path=pop_path, first_page=start_page, last_page=end_page)

# Extract text from each image using Tesseract OCR
for img in images:
text = pytesseract.image_to_string(img, lang='eng+hin')
# Append text from each page
extracted_text += text + "\n"

return extracted_text

This function takes the PDF file path, the start page, and the end page as inputs. It uses Tesseract OCR to convert images to text, supporting both English and Hindi languages.

Step 3: Detecting and Removing Non-English Text

Next, we need to clean the extracted text by identifying and removing non-English lines. We use the langdetect library to detect the language of each line and remove those that are not in English.

Here’s the code to detect and remove non-English lines:

from langdetect import detect
import re

# Function to get list of non-English lines from text
def get_garbage_values(text):
text = text.replace("|",'')

garbage_lines = []

# Check the lines word-by-word to detect their language
for line in text.split('\n'):
words = line.split()
for word in words:
try:
lang = detect(line)
except:
lang = 'unknown'

# If the language of the line is not detected to be English, add that to a list of garbage_lines for further checking
if lang not in ['en']:
garbage_lines.append(line)
break

return [item for item in garbage_lines if item.strip()]

# Function to remove non-English text from the extracted text
def remove_hindi(garbage_list, text):
def is_hindi_string(s):
hindi_pattern = re.compile(r'[\u0900-\u097F]+')
s = ' '.join(s.split())
return bool(hindi_pattern.match(s))

def check_for_alphanumeric(s):
pattern = r'[A-Za-z(){}\[\]0-9:;,.?"\'/\-\u00E9\u2018\u2019\u2014*\u2013\u2014\u2014\u2013]'
temp = re.sub(pattern, '', s)
if len(temp.strip()) != 0:
return True
else:
return False

# List of lines to keep (if needed) [manual-check]
keep = []

for item in garbage_list:
# Check for garbage-lines that will be deleted from the extracted text and print them for manual check
if is_hindi_string(item) and item not in keep:
print(item)
text = text.replace(item, '')
if check_for_alphanumeric(item):
if item not in keep:
print(item)
text = text.replace(item, '')
return text

# Function to clean paragraph text
def clean_text(text):
# Remove special characters, symbols, and digits using regular expression
cleaned_text = re.sub(r'[^\w\s.,]', '', text)
return re.sub(r'\d+', '', cleaned_text).strip() # Remove digits as well

Step 4: Integrating Everything

Finally, we integrate everything to extract text from a PDF, clean it, and save the cleaned text to a file.

# Specify path to PDF file
pdf_file_path = "Book.pdf"

# Page range to extract
start = 25
end = 30

# Extract text from the specified pages of the PDF
extracted_text = extract_text_from_pdf(pdf_file_path, start, end)

# Check the extracted text to find non-English lines
garbage_list = get_garbage_values(extracted_text)

# Remove Hindi text and other unwanted characters from the extracted text
cleaned_text = remove_hindi(garbage_list, extracted_text)

# Specify output file path
output_file_path = "output.txt"

# Write cleaned text to output file
with open(output_file_path, 'w', encoding='utf-8') as f:
f.write(cleaned_text)

Till now, we walked through the process of extracting text from PDF documents and cleaning the extracted text to remove non-English content. We used powerful Python libraries like pytesseract, pdf2image, fitz, and langdetect to accomplish this task. By following these steps, you can automate the extraction and cleaning of text from PDFs, making it easier to analyse and work with large volumes of document data.

Now follows the explanation of text separation and identification of chapter and verse numbers. This is very important to arrange the given PDF nicely in an SQL table for applications like Information Retrieval.

Step 5: Separating Text into Paragraphs

The text is read line by line and processed to accumulate paragraph text until an empty line is encountered, indicating the end of a paragraph. The accumulated text is then cleaned and added to the paragraphs list.

# Split text into paragraphs
paragraphs = []
current_chapter = ''
current_verse = None
current_paragraph_number = 0
paragraph_text = ""
for line in text:
line = line.strip()
if "Chapter" in line:
current_chapter = extract_chapter(line)
# Convert chapter number to digits
current_chapter = convert_to_digits(current_chapter)
elif 'Verse' in line:
current_verse = line.strip().split(']')[0].split('[')[-1]
elif line:
paragraph_text += line + ' '
else:
if paragraph_text:
cleaned_paragraph_text = clean_text(paragraph_text)
if cleaned_paragraph_text: # Check if the cleaned paragraph is not empty
paragraphs.append({
'Chapter Number': current_chapter,
'Verse Number': current_verse,
'Paragraph Number': current_paragraph_number,
'Cleaned Paragraph Text': cleaned_paragraph_text
})
current_paragraph_number += 1
paragraph_text = ""
  • Accumulate Paragraphs: The variable paragraph_text accumulates lines of text until an empty line is encountered.
  • Empty Line Detection: When an empty line is detected, the accumulated text (paragraph_text) is cleaned, and if not empty, it is added to the paragraphs list with the current chapter, verse, and paragraph number.

Step 6: Identifying Chapter Number

The chapter number is identified by checking if a line contains the word “Chapter.” A regular expression is used to extract the chapter number from such lines. If a match is found, the chapter number is converted from word form to digit form.

def extract_chapter(text):
# Define the pattern to match the format "*** Chapter"
pattern = r'^([^ C]+) Chapter'

# Use regular expression to find the match
match = re.match(pattern, text)

# If match is found, extract the substring before "C"
if match:
return match.group(1).strip()
else:
return ''
  • Pattern Matching: The regular expression r’^([^ C]+) Chapter’ is used to match lines that start with a word followed by “Chapter.”
  • Extract Chapter: The matched word (chapter number) is extracted and returned.

The function convert_to_digits maps word forms of numbers to their digit equivalents (e.g., “first” to “1”).

current_chapter = convert_to_digits(current_chapter)

# Function to convert word numbers to digits
def convert_to_digits(number_text):
word_to_digit_map = {
'first': '1',
'second': '2',
'third': '3',
'fourth': '4',
'fifth': '5',
'sixth': '6',
'seventh': '7',
'eighth': '8',
'ninth': '9',
'tenth': '10',
# Add more mappings as needed
}
return word_to_digit_map.get(number_text.lower(), number_text)

Step 7: Identifying Verse Number

The verse number is identified by checking if a line contains the word “Verse.” The line is then split to extract the verse number using ‘split(‘]’)[0].split(‘[‘)[-1]’ to extract the verse number enclosed in square brackets (included in the code block in Step 5).

elif 'Verse' in line:
current_verse = line.strip().split(']')[0].split('[')[-1]

Conclusion

The provided script effectively demonstrates how to process and clean text data extracted from a document. By carefully structuring the process, the script ensures that the information is parsed, cleaned, and formatted correctly for further analysis or storage. Key aspects of this process include:

Reading and Processing Text

The text is read from a file and processed line by line to identify chapters and verses.

Extracting Chapter and Verse Numbers

Regular expressions and string manipulation techniques are used to extract chapter and verse numbers accurately.

Cleaning and Structuring Data

The text is cleaned to remove unwanted characters and formatted into structured paragraphs, each tagged with relevant metadata (chapter number, verse number, paragraph number).

Writing Data to CSV

The cleaned and structured data is written to a CSV file, making it easy to handle and analyse using various tools and methods.

This approach to text extraction and cleaning is not only efficient but also scalable, making it applicable to a wide range of documents with similar structures. By using Python’s powerful text processing libraries and writing clear, concise functions, we can automate the extraction of structured data from unstructured text, significantly reducing manual effort and improving accuracy. This script can be further extended or modified to accommodate different text formats and structures, demonstrating its flexibility and robustness.

Feel free to customise and expand the code to suit your specific needs. Happy coding!

Work done and blog written by my interns, Anay Kamal and Aakanksha Priya, former students of VIT Bhopal.

--

--

Kushal Shah

Now faculty at Sitare University. Studied at IIT Madras, and earlier faculty at IIT Delhi. Join my online LLM course : https://www.bekushal.com/llmcourse