Clean - A Novel Machine Learning Model
for Enzyme Function Prediction
"Uncovering the Science behind Life with AI" series - Part 2
This is the second post in the series “Uncovering the science behind life with AI,” in which we will present applications of artificial intelligence in the field of life sciences. We will present a novel machine learning algorithm named CLEAN (contrastive learning–enabled enzyme annotation) . CLEAN was developed to address the problem of enzyme function prediction. After decades of research, protein function prediction is still one of the biggest open challenges in biology. Computational tools that could reliably predict protein function would therefore be useful in many areas, such as the study of disease mechanisms, protein design, and drug development.
Enzymes are proteins (if you want to know about proteins in general, check our previous blog post about generating protein sequences) that specifically accelerate chemical reactions. They are critical to the functioning of biological cells and organisms, as they support and regulate biological processes such as metabolism, synthesis of new molecules, DNA replication, etc. We also use isolated enzymes in a wide range of applications, from food processing (e.g., cheese making) to cleaning (e.g., enzymes in detergents) to diagnostics (e.g., to determine blood glucose levels).
Some chemical reactions would only take place under very harsh conditions (e.g. at very high temperature and/or very high pressure and/or at very acidic or very basic pH) and/or would be very slow without enzymes. However, enzymes allow reactions to occur under mild conditions such as those found in the human body (temperature around 37°C and near neutral pH). Another important property of enzymes is that they are usually very specific, i.e., they accelerate only a single chemical reaction. For example, the enzymes used to determine blood glucose levels react only with glucose and do not react with any other molecules, not even with another similar sugar.
Enzymes can be classified based on which chemical reaction they catalyze (i.e., accelerate). The most well-known numerical classification scheme for enzymes based on their catalytic function is the Enzyme Commission number (EC number). Every EC number is associated with an enzyme-catalyzed chemical reaction. All enzymes that catalyze the same reaction are given the same EC number.
Structure and function of proteins
As with other proteins, the function of enzymes is mainly determined by their 3D structure, i.e., the arrangement of the atoms of the enzyme in space. And the 3D structure of proteins is determined by the amino acid sequence, i.e., the order in which the amino acids (building blocks of proteins) occur in the protein. Therefore, protein structure prediction from amino acid sequences is an important scientific task that has been intensively studied for many decades. Protein structure prediction can contribute significantly to improving our understanding of the relationship between protein structure and protein function. It is also important for practical applications, such as in the development of new proteins with certain desirable properties – for example, if we want to create an enzyme that would be able to efficiently degrade plastics or could be used as a drug to treat a disease.
Determining a protein structure is not an easy task. So far, we have been able to experimentally elucidate the 3D structures of only a small fraction of all known proteins. Thus, it is not surprising that many computational tools have been developed with the aim of predicting 3D structures of proteins from amino acid sequences. Traditionally, those models made their predictions based on physicochemical properties of proteins and similarities to other proteins with known 3D structures. However, these tools have been limited to groups of proteins that we know relatively well. More recently, deep learning has revolutionized protein structure prediction with models such as AlphaFold  that are able to predict the 3D structure of any amino acid sequence with good accuracy.
Prediction of protein function
Predicting protein function is also a difficult task. Like protein structure, the experimental determination of protein function can be a very tedious and expensive task. It is therefore not surprising that numerous computational tools have also been developed for protein function prediction. Similar to structure prediction, these models predict protein structure based on amino acid sequence similarities, structural similarities, evolutionary relationships between proteins, etc.
To date, many machine learning models have also been developed for protein function prediction. Among them are models developed for predicting the catalytic functions of enzymes. They treat this task as a multilabel classification problem with the goal of assigning EC numbers to enzymes. However, different EC groups of enzymes are very unequally represented – as often in the field of biology, there are some classes we know a lot about and others with almost no representatives and with very limited information available. And the prediction power of those models of course suffers from such unbalances in the datasets they are trained on. Protein databases are today still full of entries with unknown or wrongly predicted function(s).
CLEAN (contrastive learning–enabled enzyme annotation)
Yu et al. recently published a paper in the scientific journal Science  presenting their machine learning algorithm named CLEAN (contrastive learning–enabled enzyme annotation). CLEAN assigns EC numbers to enzymes with better accuracy, reliability, and sensitivity than the current state-of-the-art tools.
The model was developed as a feedforward neural network that receives protein embeddings as inputs. The embeddings are numerical representations of amino acid sequences that contain the important features and information about the function of the enzyme. The protein embeddings were obtained using the ESM-1b language model . The output layer of the neural network then produces a refined, function-aware embedding of the input protein.
As the name suggests, CLEAN uses a contrastive learning  framework. Contrastive learning is a self-supervised, task-independent deep learning technique that allows a model to learn the general features about data, even without labels, by teaching the model which data points are similar (the positives) or different (the negatives). During training of CLEAN, each reference amino acid sequence (anchor) in the training dataset was sampled with an amino acid sequence belonging to an enzyme with the same EC number (positive sample) and a sequence of an enzyme with a different EC number (negative sample).
The training goal was for CLEAN to learn an embedding space of enzymes in which the Euclidean distance reflects the functional similarities between enzymes. Amino acid sequences of enzymes with the same EC number (i.e., same function) should have a small Euclidean distance, while sequences of enzymes with different EC numbers should have a large distance. The learning objective was formulated as a contrastive loss function that minimizes the distance between the reference amino acid sequence (anchor) and the positive sample, while maximizing the distance between the anchor and the negative sample.
For the prediction of the EC number, the numerical representations of each EC number were first obtained by averaging the learned embeddings of all enzymes in the training set belonging to each EC number. Then, the embedding of the query amino acid sequence (i.e., the input protein) was compared with the representation of each EC number to obtain the pairwise Euclidean distance between the query sequence and each EC number representation. Finally, the input protein was assigned EC numbers that were significantly close to the query sequence.
In silico validation
CLEAN was trained and evaluated on the world’s leading universal protein sequence and annotation data knowledgebase UniProt . After training, the predictive performance of CLEAN was compared with six state-of-the-art EC number annotation tools on a dataset not included in the development of any of the models. CLEAN performed best on several multilabel accuracy metrics. It was also more accurate and precise than previously developed machine learning-based models for predicting functions for newly discovered proteins, especially those with unknown enzyme functions. To assess the reliability of the prediction results, the authors applied a two-component Gaussian mixture model to the distribution of Euclidean distances between enzyme sequence embeddings and EC number embeddings.
To experimentally test the predictions of CLEAN, the researchers had selected three enzymes from the class of halogenases. Halogenases are enzymes that “incorporate” halogen atoms (i.e., fluorine (F), chlorine (Cl), bromine (Br), iodine (I) or astatine (A) atoms) into substrate molecules. Halogenases are generally poorly studied, and only a limited number of amino acid sequences of halogenases are available in protein databases, making prediction of halogenase function a difficult task. The three selected halogenases were either labelled as uncharacterized and/or hypothetical proteins or had conflicting annotations in the literature.
CLEAN predicted new EC numbers for these three halogenases, suggesting that they may have different potential functions than previously thought. The researchers characterized the function of these three halogenases in a series of experiments. The experiments confirmed that CLEAN’s predictions were more accurate than the previous annotations.
Angelika Vižintin received her PhD in Biosciences at the University of Ljubljana, Slovenia. In her PhD thesis, she studied the effects of short high-voltage electric pulses on animal cell lines and their potential application, for example, for electrochemotherapy, a local cancer treatment that combines membrane-permeabilizing electric pulses and chemotherapeutic drugs. She is currently taking courses in machine learning, mathematical foundations of artificial intelligence, computer science, etc. in the Artificial Intelligence programme at Johannes Kepler University in Linz, Austria, with the goal of learning how to use artificial intelligence tools in biological/biomedical research. She also produces science-related programs for the Slovenian Radio Študent, one of the oldest and strongest independent radio stations in Europe. Among the topics she reports on most frequently are the biology of the female body and women’s diseases, biotechnology, genetically modified organisms, molecular biology, and artificial intelligence in the life sciences. In her free time, she enjoys listening to music and traveling.
 T. Yu, H. Cui, J. C. Li, Y. Luo, G. Jiang, and H. Zhao, “Enzyme function prediction using contrastive learning,” Science, vol. 379, no. 6639, pp. 1358–1363, Mar. 2023.
 J. Jumper et al., “Highly accurate protein structure prediction with AlphaFold,” Nature, vol. 596, p. 583, 2021, doi: 10.1038/s41586-021-03819-2.
 S. W. Englander and L. Mayne, “The case for defined protein folding pathways,” Proc Natl Acad Sci U S A, vol. 114, no. 31, pp. 8253–8258, Aug. 2017.
 A. Rives et al., “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences,” Proc Natl Acad Sci U S A, vol. 118, no. 15, p. e2016239118, Apr. 2021.
 P. Khosla et al., “Supervised Contrastive Learning,” Adv Neural Inf Process Syst, vol. 33, pp. 18661–18673, 2020.
 A. Bateman et al., “UniProt: the universal protein knowledgebase in 2021,” Nucleic Acids Res, vol. 49, no. D1, pp. D480–D489, Jan. 2021, doi: 10.1093/NAR/GKAA1100.