Digitizing Historical Documents: Simple OCR with Tesseract

3 min readJun 27, 2024

In this article, we’ll discuss how to use a simple code to extract words from a PDF file. This process is called OCR, that stands for Optical Character Recognition.

I believe historians can greatly benefit from Python for accessing content in PDFs; because it can transform PDFs based-images into searchable text and improve text get from searchable PDFs.

All historians will understand that this tasks facilitate when doing a first exploratory analysis of the text, looking for specific concepts or just quoting. This way, Python can be a valuable tool for historians.

For this article, I chose documents that are likely to be accessible to historians researching in digital repositories.

This is the first article from a series with more two parts. In this opportunity, let’s focus on a simple case, since, from here, we will be able to progress to more advanced partss (part-II, part-III -soon).

For these purposes, I selected a PDF with Good Quality Text: The first type has searchable characters and provides good quality text extraction. The example used here is “Diario Politico” by Jose Victorino Lastarria[1].

The goal is to improve the misspelling when the original PDF is accessed directly. Below we can see an example of these errors and a sample of the final results:

(1) not accurate OCR (2) more accurate OCR with Tesseract

Access notebook here: github repository.

# Import module to convert pdf to image
# Import sk-image modules to improve images quality
# Import matplotlib to visualize your data

from pdf2image import convert_from_path
from skimage import data, color
from skimage.restoration import denoise_wavelet
from skimage.filters import threshold_otsu
from skimage import filters
import matplotlib.pyplot as plt

# Indicate the path to the pdf file
PATH = "~diario_politico_lastarria.pdf"

# Convert PDF pages to PIL images
images = convert_from_path(PATH)

print(f'This pdf has {len(images)} pages in total.' )

# Filter page without text
# Here: pages 0-12 and 169-to-end 
# In general you will want to exclude initial and final pages
filtered_pages = list(range(12)) + list(range(169, len(images)+1))

print(f'Excluded pages: {filtered_pages}.')

# Remove those pages
for n, img in enumerate(images.copy()):
    if n in filtered_pages:
       images.remove(img)

print(f'There are {len(images)} pages with text.')

# Create a list to hold the preprocessed images
images_list = []

# Preprocess the images
for n, img in enumerate(images):

    # first transform images to numpy array
    img_array = np.array(img)
    
    # Roughly crop the pages closer to text area
    img_cropped = img_array[200:3100, :]

    # Transform images to grayscale
    gray_image = color.rgb2gray(img_cropped)
 
    # Create binary image
    thresh = threshold_otsu(gray_image)
    binary = gray_image > thresh

    # Add the preprocessed images in a list
    images_list.append(binary)

    # Visualize the first page
    if n == 0:
        img = Image.fromarray((image * 255).astype('uint8'))
        plt.imshow(img, cmap='gray')

# Set TESSDATA_PREFIX environment variable
os.environ['TESSDATA_PREFIX'] = '~/tesseract-ocr/4.00/tessdata'


# Performing OCR with Tesseract

# Define a function to extract text and normalize it
def extract_text(image, n):
 
    # Define custom config to tesseract
    custom_config = r'--oem 3 --psm 6'

    # Convert numpy array back to PIL image
    pil_image = Image.fromarray((image * 255).astype('uint8')) 

    #save images
    pil_image.save(os.path.join('/content/drive/MyDrive/OCR_project/images_from_pdf' + '/' + 'image' + f'_{n}' + '.png'))

    # Perform OCR
    text = pytesseract.image_to_string(pil_image, lang='spa', config=custom_config)

    # Normalize text to remove accents
    normalized_text = unidecode(text, errors='unicode_expect_nonascii')

    return normalized_text

# Using pdf's name, to create a txt file with same name
new_file_name = PATH.split("\\")[-1].split(".")[0] + ".txt"

# Extract text from each enhanced image and save to a file
with open(new_file_name, 'a', encoding='utf-8') as f:
    for n, img in enumerate(images_list):
        text = extract_text(img, n)
        print(text, file=f)

print("OCR processing complete. Results saved to file.txt")

Digitizing Historical Documents: Simple OCR with Tesseract

Written by Vanessa Pacheco

No responses yet