Digitizing Historical Documents: Simple OCR with Tesseract

Vanessa Pacheco
3 min readJun 27, 2024

--

In this article, we’ll discuss how to use a simple code to extract words from a PDF file. This process is called OCR, that stands for Optical Character Recognition.

I believe historians can greatly benefit from Python for accessing content in PDFs; because it can transform PDFs based-images into searchable text and improve text get from searchable PDFs.

All historians will understand that this tasks facilitate when doing a first exploratory analysis of the text, looking for specific concepts or just quoting. This way, Python can be a valuable tool for historians.

For this article, I chose documents that are likely to be accessible to historians researching in digital repositories.

This is the first article from a series with more two parts. In this opportunity, let’s focus on a simple case, since, from here, we will be able to progress to more advanced partss (part-II, part-III -soon).

For these purposes, I selected a PDF with Good Quality Text: The first type has searchable characters and provides good quality text extraction. The example used here is “Diario Politico” by Jose Victorino Lastarria[1].

The goal is to improve the misspelling when the original PDF is accessed directly. Below we can see an example of these errors and a sample of the final results:

(1) not accurate OCR (2) more accurate OCR with Tesseract

Access notebook here: github repository.

# Import module to convert pdf to image
# Import sk-image modules to improve images quality
# Import matplotlib to visualize your data

from pdf2image import convert_from_path
from skimage import data, color
from skimage.restoration import denoise_wavelet
from skimage.filters import threshold_otsu
from skimage import filters
import matplotlib.pyplot as plt

# Indicate the path to the pdf file
PATH = "~diario_politico_lastarria.pdf"

# Convert PDF pages to PIL images
images = convert_from_path(PATH)

print(f'This pdf has {len(images)} pages in total.' )

# Filter page without text
# Here: pages 0-12 and 169-to-end
# In general you will want to exclude initial and final pages
filtered_pages = list(range(12)) + list(range(169, len(images)+1))

print(f'Excluded pages: {filtered_pages}.')

# Remove those pages
for n, img in enumerate(images.copy()):
if n in filtered_pages:
images.remove(img)

print(f'There are {len(images)} pages with text.')

# Create a list to hold the preprocessed images
images_list = []

# Preprocess the images
for n, img in enumerate(images):

# first transform images to numpy array
img_array = np.array(img)

# Roughly crop the pages closer to text area
img_cropped = img_array[200:3100, :]

# Transform images to grayscale
gray_image = color.rgb2gray(img_cropped)

# Create binary image
thresh = threshold_otsu(gray_image)
binary = gray_image > thresh

# Add the preprocessed images in a list
images_list.append(binary)

# Visualize the first page
if n == 0:
img = Image.fromarray((image * 255).astype('uint8'))
plt.imshow(img, cmap='gray')

# Set TESSDATA_PREFIX environment variable
os.environ['TESSDATA_PREFIX'] = '~/tesseract-ocr/4.00/tessdata'


# Performing OCR with Tesseract

# Define a function to extract text and normalize it
def extract_text(image, n):

# Define custom config to tesseract
custom_config = r'--oem 3 --psm 6'

# Convert numpy array back to PIL image
pil_image = Image.fromarray((image * 255).astype('uint8'))

#save images
pil_image.save(os.path.join('/content/drive/MyDrive/OCR_project/images_from_pdf' + '/' + 'image' + f'_{n}' + '.png'))

# Perform OCR
text = pytesseract.image_to_string(pil_image, lang='spa', config=custom_config)

# Normalize text to remove accents
normalized_text = unidecode(text, errors='unicode_expect_nonascii')

return normalized_text

# Using pdf's name, to create a txt file with same name
new_file_name = PATH.split("\\")[-1].split(".")[0] + ".txt"

# Extract text from each enhanced image and save to a file
with open(new_file_name, 'a', encoding='utf-8') as f:
for n, img in enumerate(images_list):
text = extract_text(img, n)
print(text, file=f)

print("OCR processing complete. Results saved to file.txt")

--

--

Vanessa Pacheco
Vanessa Pacheco

Written by Vanessa Pacheco

I'm a graphic designer and brand strategist, and now I'm excited to explore engineering-generated illustrations.

No responses yet