CIRPIN: Learning Circular Permutation-Invariant Representations to Uncover Putative Protein Homologs
Aiden R. Kolodziej, S. Mazdak Abulnaga, Sergey Ovchinnikov. Machine Learning for Computational Biology (MLCB).
Protein structure-based homology detection has been revolutionized by deep learning methods that can rapidly search massive databases. However, current structural search tools often miss proteins related by topological rearrangements, particularly circular permutation (CP), where proteins share identical global folds but differ in the positioning of their termini. We introduce a circular permutation-invariant graph neural network (CIRPIN) that addresses this limitation through a novel data augmentation strategy using synthetic circular permutations (synCPs). We demonstrate that CIRPIN learns representations of proteins that are invariant to circular permutation, enabling it to identify similar proteins within the Structural Classification of Proteins - extended (SCOPe) and AlphaFold Cluster Representatives (AFDB-ClustR) databases. Leveraging the speed of CIRPIN and the accuracy of traditional structural alignment tools, we search these databases and uncover thousands of novel protein pairs related by circular permutation. Notably, we discover that PDZ domains exist naturally in four circularly permuted forms. These results highlight CIRPIN as a powerful tool to investigate the emergence of circular permutations in nature.
