Innov’SAR Core is the main tool from the Innov’SAR platform developped by PEACCEL for the statistical modeling of protein sequence-activity relation in the R language.
The application is coded with the shiny package.
Our proprietary innovative Sequence-Activity Relationship methodology, called innov’SAR core, identifies high fitness mutants from smart mutant libraries relying on physico-chemical properties of the amino acids, digital signal processing and regression techniques.
The novelty of Innov’SAR core is that it uses Fast Fourier Transform (FFT) to numerically encode protein sequences of a library of variants of the protein/enzyme with known activities into a set of protein spectra.
To sum up the basic characteristic of the procedure: Only an initial dataset containing the primary sequences of enzyme variants and the respective biological properties is required. It is different from other ML approaches due to the following characteristics:
Thanks to the Fourier transform, the non-linear aspects inside the protein sequence are captured;
FFT allows to introduce new mutations at positions not previously explored or new positions of mutations;
A single round allows the identification of high performing mutants, while avoiding
The need for excessively large datasets customary in other ML or deep learning approaches;
No need for alignment-based amino acid descriptors, no need for protein sequences of equal length, as well as,
Large computational resources and/or long computational times are not required.
Applying an FFT to a protein sequence digitally encoded is not the same as simply encoding it in another way, indeed this mathematical treatment makes it possible to take into account the order of the protein sequence and all the interactions between positions within it, and thus to better identify epistatic phenomena.
The innov’SAR core approach is interpolative, extrapolative and predicts outside-the-box, not found in other state-of-the-art Machine Learning or Deep Learning approaches. The comparison of innov’SAR core with 14 other methods shows that it outperforms these SOTA ML & DL methods in terms of hit rate (81%).
Figure S2: https://chemistry-europe.onlinelibrary.wiley.com/doi/10.1002/cbic.202000612
Relationship between the hit rate normalized to the log10 number of functionally characterized mutants used for training and the size of the search region explored: comparison of 15 studies. Turquoise blue square: assuming a non-normalized hit rate of maximum value of 1 (incomplete data to have the exact hit rate from the paper) for the CNN model proposed by Xu et al (2020). Purple diamond: The hit rate indicated in the Attention-Based Neural Networks model proposed by Wu et al (2020) is used for comparison.