Featured image of post Visual Document Question Answering with LayoutLM

Visual Document Question Answering with LayoutLM

This application performs question answering on images of documents using LayoutLM and OCR (PaddleOCR or Tesseract). Upload an image, ask a question, and get an answer!

Description:

This code implements a document question-answering system leveraging the LayoutLM architecture for visual question answering. It integrates two OCR engines, PaddleOCR and Tesseract, for text extraction and bounding box detection from input images. The pre-trained impira/layoutlm-document-qa model and tokenizer are used for encoding both the question and the OCR-extracted text along with their respective spatial coordinates. The model predicts the answer span within the document by calculating start and end logits, and the answer is decoded using the tokenizer. A Gradio interface facilitates user interaction, allowing image uploads, question input, OCR engine selection, and visualization of the answer, confidence score, and OCR results. The system operates on CPU and handles bounding box normalization for consistent model input.

Author: Renee Vera

Demo

Code:

Demo

Demo

Contact

LinkedIn Email

Licensed under CC BY-NC-SA 4.0
Last updated on Aug 25, 2023 00:00 UTC
comments powered by Disqus
Built with Hugo
Theme Stack designed by Jimmy