TINA — TypeDeterminator Identifier & Nomenclature Assistant
A visual document classification system that identifies Energy Performance Certificates (EPCs) and their country of origin.
Overview
TINA is a lightweight computer vision module designed to classify scanned PDF documents and detect Energy Performance Certificates (EPCs) by country (Germany, France, Austria). Unlike traditional OCR-based approaches, it relies entirely on visual structure recognition, making it robust to noisy scans and multilingual content.
Approach
The module processes PDFs as images, extracting layout-based features using a pretrained CNN. It then matches each document to known EPC templates through visual similarity and country-specific validation. The output is a structured JSON summary used for automated routing within backend systems.
Key Highlights
- Visual-based classification replacing fragile OCR/text-based pipelines.
- Lightweight CNN feature extraction for scalable, high-throughput inference.
- Country-specific validation through handcrafted visual cues.
- Cross-platform deployment with API integration.
My Role
- Designed and implemented the full visual classification pipeline.
- Integrated pretrained CNNs for feature extraction and similarity scoring.
- Developed modular country-level detectors using OpenCV and scikit-learn.
- Engineered API integration and optimized for production deployment.
Impact
- Significantly improved classification accuracy and robustness.
- Reduced false positives in noisy and multilingual EPC documents.
- Delivered a modular, production-ready component adaptable to new formats.
Learnings
- Applied computer vision for document layout analysis.
- Balanced model performance with deployment efficiency.
- Strengthened practical skills in Python, TensorFlow, and OpenCV.