A template-independent approach for information extraction in real estate documents

Business corporations manage tons of unstructured data daily, such as PDFs and websites. Recent advances in the deep learning field help find insight from this unstructured information. New models leverage the power of the Transformer architecture to accomplish natural language understanding tasks on these data, jointly using the raw image and its text content or directly the image without OCR. We propose an extraction pipeline that employs question-answering models to get insight from unstructured data, allowing fast and efficient information retrieval from different sources. We show an application of this technique to a specific set of documents and how we can scale this infrastructure to different types of records. Our solution can effectively handle large document corpora robustly, helping corporations exploit all the power coming from their data.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here