Documents containing a combination of texts, images, tables, codes, etc., in complex layouts are digitally saved in image format. Analyzing and extracting useful information out of these image documents is performed with the help of machine learning. This supervised task is termed as Document Image Analysis (DIA). The popular DIA tasks in practical use include:
- Document Image Classification
- Layout Detection
- Table Detection
- Scene Text Detection
- Character Recognition
There have been a few task-specific applications such as OCR (Optical Character Recognition) in real-world usage over decades. However, a library that provides all DIA tasks in one place became an important need of document analysis society, such as historical researchers and social science analysts. For instance, a screenshot image of an old newspaper’s page may contain historic research-centred contents in the form of tables, charts, texts and photographs. An OCR reader can be used to extract texts but cannot read other information. Moreover, an OCR reader may miss to recognize the text layouts and mix texts from different layouts in its output. A separate method will be required to extract information from tables, charts and so on.
The evolution of deep learning-based convolutional neural networks has begun to try to give solutions to the need of an integrated Document Image Analysis system. However, the practical implementation of recent successful deep learning models has faced some challenges. High-level DIA parameters are not always explicitly processed by deep learning frameworks. This makes customization of pre-trained models difficult. Popular models are trained on a particular set of annotated document images. Documents do not possess any common template and formats and are limited only by human creativity. This needs to collect task-specific annotated document images, preprocess them according to the model requirements, and fine-tune the model with those images in case of custom implementations of a popular model. The deep learning network part and the DIA part are usually trained separately to make customized fine-tuning difficult, tedious, and time-consuming.
To this end, Zejiang Shen of the Allen Institute of AI, Ruochen Zhang of the Brown University, Melissa Dell and Jacob Carlson of the Harvard University, Benjamin Charles Germain Lee of the University of Washington, and Weining Li of the University of Waterloo have introduced LayoutParser, a Python library for Document Image Analysis. This library has a Model Zoo with a great collection of pre-trained deep learning models with an off-the-shelf implementation strategy. This library has a unified architecture to adapt any DIA model. Apart from the usage of pre-trained models, LayoutParser provides tools for customization and fine-tuning as per need. Further, data preparation tools- for tasks such as document image annotation and data preprocessing tools are readily available in this library. The library aims at quality models and pipelines distribution with reproducibility, reusability and extensibility through a continuously improving community platform.
LayoutParser performs one or more of the following DIA usages:
- It receives document images as input. It offers off-the-shelf tools for any DIA task. It performs the tasks in order and yields the output.
- It receives unannotated document images. It provides tools for efficient annotation of layouts and other parts of a document image.
- It supports efficient custom training for user-specific tasks. Once trained, the model can be employed for inference.
- It offers tools for visualization and storage of data, models, weights and checkpoints.
- It provides community sharing, distribution, and documentation.
To store a layout in memory and retrieve it back, LayoutParser offers unified data structures. Three key components in the LayoutParser data structure are Coordinate, TextBlock, and Layout. Unique operations are defined in LayoutParser to process the library-defined data structures.
We discuss the code implementation and two practical applications of the library in the sequel.
Layout Detection in a Document Image
Install the LayoutParser library and its dependencies from the PyPi packages.
%%bash pip install -U layoutparser # install detectron2 pip install 'git+https://github.com/facebookresearch/detectron2.git@v0.4#egg=detectron2' # install OCR module pip install layoutparser[ocr]
Import the libraries and modules.
import layoutparser as lp import matplotlib.pyplot as plt import matplotlib %matplotlib inline import cv2
Deploy a pre-trained Detectron2 model configured for layout parsing.
model = lp.Detectron2LayoutModel('lp://PubLayNet/faster_rcnn_R_50_FPN_3x/config', extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.8], label_map={0: "Text", 1: "Title", 2: "List", 3:"Table", 4:"Figure"})
Now the model is ready for inference. Download the source files from the official repository to obtain a sample image to perform inference on it.
!git clone https://github.com/Layout-Parser/layout-parser.git
Output:
Change directory to read the example data.
%cd /content/layout-parser/examples/data/ !ls -p
Output:
Read the ‘paper-image.jpg’ and display it.
img = cv2.imread("/content/layout-parser/examples/data/paper-image.jpg") # convert BGR image into RGB format image = img[..., ::-1] # display image plt.figure(figsize=(12,16)) plt.imshow(image) plt.xticks([]) plt.yticks([]) plt.show()
Output:
Predict the layouts in the above image using the pre-trained model.
layout = model.detect(image)
Display the image with predicted layouts over it.
lp.draw_box(image, layout, box_width=3)
Output:
This Colab Notebook contains the above example code implementations.
OCR from Table Document Image
Install the LayoutParser and its dependencies. In addition, install an OCR engine. Here, we use the TesseractOCR engine to recognize text and its location.
%%bash pip install -U layoutparser pip install layoutparser[ocr] sudo apt install tesseract-ocr sudo apt install libtesseract-dev
Import the necessary libraries and modules.
import layoutparser as lp import numpy as np import pandas as pd import matplotlib.pyplot as plt import matplotlib %matplotlib inline import cv2
Load the pre-trained TesseractOCR engine.
model = lp.TesseractAgent()
Prepare data from the source code. Download the source files from the source repository and change the directory to denote the example images path.
!git clone https://github.com/Layout-Parser/layout-parser.git %cd /content/layout-parser/examples/data/ !ls -p
Read the image and display it to have an idea of how it looks.
image = cv2.imread('example-table.jpeg') # display image plt.figure(figsize=(12,16)) plt.imshow(image) plt.xticks([]) plt.yticks([]) plt.show()
Output:
Detect text characters with the OCR engine. Collect the text along with its bounding box details for plotting and post-processing.
res = model.detect(image, return_response=True) # collect text and its bounding boxes ocr = model.gather_data(res, lp.TesseractFeatureType(4))
Plot the original image along with bounding boxes on recognized texts.
lp.draw_text(image, ocr, font_size=12, with_box_on_text=True, text_box_width=1)
Output:
We can recognize that the output texts are reproduced with Engine-specified fonts and sizes. Thus the system has recognized texts and their locations precisely. Further, we can post-process these texts in a column-wise manner or row-wise manner as per need.
This Colab Notebook contains the above example code implementations.
Wrapping Up
In this article, we have discussed the open-source LayoutParser library, its architecture and capabilities. Further, we discussed two practical use cases of Document Image Analysis with hands-on Python codes. With more inclusion of new models in the near future, LayoutParser will get a prominent place in Document Image Analysis.