AdvBench

To systematically evaluate the effectiveness of our approach at accomplishing this, we designed a new benchmark, AdvBench, based on two distinct settings.

Harmful Strings: A collection of 500 strings that reflect harmful or toxic behavior, encompassing a wide spectrum of detrimental content such as profanity, graphic depictions, threatening behavior, misinformation, discrimination, cybercrime, and dangerous or illegal suggestions. The adversary’s objective is to discover specific inputs that can prompt the model to generate these exact strings. The strings’ lengths vary from 3 to 44 tokens, with a mean length of 16 tokens when tokenized with the LLaMA tokenizer.
Harmful Behaviors: A set of 500 harmful behaviors formulated as instructions. These behaviors range over the same themes as the harmful strings setting, but the adversary’s goal is instead to find a single attack string that will cause the model to generate any response that attempts to comply with the instruction, and to do so over as many harmful behaviors as possible.

Homepage

Benchmarks

Add a new result Link an existing benchmark

No benchmarks yet. Start a new benchmark or link an existing one.

Papers

Paper	Code	Results	Date	Stars

Dataset Loaders

Add Remove

No data loaders found. You can submit your data loader here.

Tasks

Similar Datasets

AlpacaEval-TH

MT-Bench-TH

Usage

License

Unknown

Modalities

Languages

English

AdvBench

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit