Description:
This code implements a business document classification system using a Vision Transformer-based model (DiT) fine-tuned for document types. The pipeline leverages Hugging Face’s AutoFeatureExtractor for image preprocessing and a pre-trained AutoModelForImageClassification to identify document categories (e.g., emails, forms). Input images are transformed into tensor representations, processed through the transformer model to generate logits, and mapped to human-readable labels via class indices. GPU acceleration optimizes inference speed. A Gradio interface provides real-time interaction, allowing users to upload document images and receive JSON-formatted predictions, with built-in examples demonstrating classification across common business document types.
Author: Renee Vera
Input
Upload a document image (JPG/PNG) via the Gradio interface or use provided examples.
Processing
Image preprocessing with DiT-specific feature extraction. GPU-accelerated inference using the transformer model. Class probability mapping to predefined document categories.
Output
JSON result showing the predicted document type (e.g., “email”, “form”).
