ChemNLP: A Natural Language Processing based Library for Materials Chemistry Text Data

17 Sep 2022  ·  Kamal Choudhary, Mathew L. Kelley ·

Natural language processing (NLP) has an immense potential to aid materials design processes. While there have been several advancements in this field, a complete and integrated framework with well-curated dataset and tools to apply NLP is still needed. In this work, we present the ChemNLP library and an accompanying web-app that can be used to analyze important materials chemistry information. We use the publicly available arXiv dataset that has been collected over 34 years and contains ~1.8 million articles. First, we analyze the article publication trend, categorizations, and common phrases in the arXiv dataset. Then, we develop a user-friendly, interactive web-app to retrieve articles for a given chemical compound. Furthermore, we demonstrate the effectiveness of the proposed framework to accelerate the identification of superconducting materials. We determine the overlap between density functional theory and text-based databases for superconductors. Finally, we perform machine learning based clustering and classification tasks to quickly categorize scholarly articles given article title information with accuracy up to 81.2 %. ChemNLP is available at the websites: and

PDF Abstract