WordsWorth Scores for Attacking CNNs and LSTMs for Text Classification

1 Jan 2021 · Nimrah Shakeel ·

Black box attacks on traditional deep learning models target important words in a piece of text, in order to change model prediction. We present a simple yet novel approach to calculating word importance scores, based on model evaluations on single words. These scores, which we call WordsWorth scores, need to be calculated only once over the training vocabulary. They can be used to speed up any attack method that requires word importance, with negligible loss of attack performance. We run experiments on a number of datasets trained on word-level CNNs and LSTMs, for sentiment analysis and text classification, using these scores for leave-one-out and greedy substitution attacks. Our results show the effectiveness of our method in attacking these models with success rates that are comparable to the original baselines. We argue that global importance scores act as a very good proxy for word importance in a local context because words are a highly informative form of data. This aligns with the manner in which humans interpret language, with individual words having well-defined meaning and powerful connotations. We further show that these scores can be used as a debugging tool to interpret a trained model by highlighting relevant words for each class. Additionally, we demonstrate the effect of overtraining on word importance, compare the robustness of CNNs and LSTMs, and explain the transferability of adversarial examples across a CNN and an LSTM using these scores.

PDF Abstract