PS4: a Next-Generation Dataset for Protein Single Sequence Secondary Structure Prediction

biorXiv Preprint 2023  ·  Omar Peracha ·

Protein secondary structure prediction is a subproblem of protein folding. A lightweight algorithm capable of accurately predicting secondary structure from only the protein residue sequence could therefore provide a useful input for tertiary structure prediction, alleviating the reliance on MSA typically seen in today's best-performing models. This in turn could see the development of protein folding algorithms which perform better on orphan proteins, and which are much more accessible for both research and industry adoption due to reducing the necessary computational resources to run. Unfortunately, existing datasets for secondary structure prediction are small, creating a bottleneck in the rate of progress of automatic secondary structure prediction. Furthermore, protein chains in these datasets are often not identified, hampering the ability of researchers to use external domain knowledge when developing new algorithms. We present PS4, a dataset of 18,731 non-redundant protein chains and their respective Q8 secondary structure labels. Each chain is identified by its PDB code, and the dataset is also non-redundant against other secondary structure datasets commonly seen in the literature. We perform ablation studies by training secondary structure prediction algorithms on the PS4 training set, and obtain state-of-the-art Q8 and Q3 accuracy on the CB513 test set in zero shots, without further fine-tuning. Furthermore, we provide a software toolkit for the community to run our evaluation algorithms, train models from scratch and add new samples to the dataset. All code and data required to reproduce our results and make new inferences is available at https://github.com/omarperacha/ps4-dataset

PDF Abstract

Datasets


Introduced in the Paper:

PS4
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Protein Secondary Structure Prediction CB513 PS4-Conv Q8 0.756 # 2
Q3 0.863 # 2
Protein Secondary Structure Prediction CB513 PS4-Mega Q8 0.763 # 1
Q3 0.868 # 1
Protein Secondary Structure Prediction PS4 PS4-Mega Q8 0.782 # 1
Protein Secondary Structure Prediction PS4 PS4-Conv Q8 0.779 # 2

Methods