A template-independent approach for information extraction in real estate documents

Ital-IA 2023 · Nicola Landro, Gabriele Destro, Stefano Taverni, Ignazio Gallo ·

Business corporations manage tons of unstructured data daily, such as PDFs and websites. Recent advances in the deep learning field help find insight from this unstructured information. New models leverage the power of the Transformer architecture to accomplish natural language understanding tasks on these data, jointly using the raw image and its text content or directly the image without OCR. We propose an extraction pipeline that employs question-answering models to get insight from unstructured data, allowing fast and efficient information retrieval from different sources. We show an application of this technique to a specific set of documents and how we can scale this infrastructure to different types of records. Our solution can effectively handle large document corpora robustly, helping corporations exploit all the power coming from their data.

PDF Abstract