Life’s OrigAImi; demystifying AlphaFold’s protein folding

A deep neural network developed by Google’s DeepMind has made a huge leap in the quest to solve the protein folding problem, one of the largest challenges in biology. The results of the algorithm, called AlphaFold, were revealed on 30 November during the Critical Assessment of protein Structure Prediction (CASP) conference. To explain what AlphaFold does and what makes it unique, we are first going to dive into the protein folding problem and understand why scientists are so eager to solve it. 

Building a protein

First, I am going to explain what proteins do. Proteins are the building blocks of life. They are big, complex molecules and are involved with every biological process that takes place in a living organism. The instructions for proteins are coded in DNA. I will not go into detail how exactly this works as this is out of the scope of this article. But if you are interested, I have included some nice articles that explain how DNA is used to build proteins. The important point for this article is that proteins are made up of amino acids and the sequence of amino acids in the protein is stored in DNA. 


When a new protein is formed, it is unfolded and is just a long 2D string. After folding, the protein is 3D, and it can perform its function in the cell. What the protein does depends on its 3D structure. And the 3D structure depends on its DNA sequence. However, knowing the DNA sequence of a protein is not enough to determine its eventual 3D shape and functionality. DNA can only give us the amino acid sequence, not how the protein will fold. Modeling all the possible ways that the protein could fold would take longer than the age of the universe. But somehow proteins manage to fold themselves within milliseconds. Trying to model the 3D structure of a protein using its DNA and amino acid sequence is known as the ‘protein folding problem’. 

Importance of protein folding

You might wonder: ‘Why do we care?’ If proteins manage to fold themselves just fine, why bother modelling them? The reason is twofold; to cure protein diseases and to aid in drug discovery. A mutation in DNA can lead to the wrong amino acid being placed in the protein, which can result in malfunctions in protein synthesis and folding. Some well-known diseases that involve misfolded proteins are Alzheimer’s disease, Sickle cell disease, and Cystic fibrosis. Modelling protein folding could help identify what part of the protein is causing the folding malfunction. 

The second reason is that the shape of a protein dictates what molecules it can bind to. If we can model the 3D structure of a protein, we can devise a molecule that can bind to the protein. This can be used for developing drugs based on the structure of the protein. This is much faster and cheaper than the current method of drug discovery, which often consists of trying many drugs and hoping that one of them does something that can be used as medicine. 


Now that we have all the background information covered, we can go back to the real topic of this article; AlphaFold. As I have said before, AlphaFold aims to solve the protein folding problem. Like all neural networks, AlphaFold needs a lot of data to train. The data used by AlphaFold is genomic data, this is data obtained by sequencing DNA. The amount of genomic data has increased considerably in the past few years due to the rapid decrease in the cost of genetic sequencing. AlphaFold uses this vast amount of data to train its deep neural network and predict properties of proteins. 

AlphaFold can predict two properties of proteins: the distance between amino acid pairs, and the angles of the chemical bonds between those amino acids. This information is produced by a neural network that predicts the probability of different distances between two amino acids. The resulting network is then trained on the genomic dataset. A second neural network uses the probabilities produced by the first network to assess how close a proposed protein structure is to the correct structure. 

Improving the score

AlphaFold also uses two further methods that collaborate to improve the accuracy of the protein predictions. The first method is a generative neural network that replaces pieces of the proposed protein with other protein fragments, creating a new protein. If this new protein has a better score than the original protein, it becomes the new proposed protein structure. 

The second method uses gradient descent to improve the protein structure. This is a common machine learning technique that makes small, stepwise improvements to gradually ‘walk’ towards the optimal solution. This was done on the entire protein chain instead of subunits of the protein to simplify the prediction process. 

High accuracy

AlphaFold was able to easily beat the competition and take first place at the CASP competition. This competition was founded in 1994 to catalyse research in the field of protein folding, and it is now the gold standard for assessing folding prediction techniques. Some of the predictions AlphaFold made were close to the quality of experimental results, something that other submissions to CASP were not able to achieve. It is possible that the team of AlphaFold got lucky and the CASP challenge only presented problems that happened to be well suited to AlphaFold. But even then, the fact that AlphaFold was able to beat the competition with such a big margin is unique. 

AlphaFold is a big step for the field of protein biology, but it did not solve the protein folding problem. While researching AlphaFold I found quite a lot of blog posts that claimed that AlphaFold had ‘solved protein folding’, but this is not true! There is still a lot unknown in the field of protein folding and more progress is needed before we can claim that the protein folding problem is solved. Even though AlphaFold has a higher than ever accuracy, its accuracy is still too low to predict novel protein structures and to be useful in drug discovery. Also, the CASP competition only uses standalone proteins that have little interaction with other proteins. Most proteins are part of a big web of interacting proteins and depend on each other for chemical stability and functioning. AlpaFold’s predictions were accurate for simple, standalone proteins but not for interacting proteins. As most proteins belong to the latter category, AlphaFold will need to improve its accuracy on complex proteins to be correct on new data outside of the CASP competition. Furthermore, what AlphaFold solves is a protein prediction problem, it helps in understanding the protein folding problem but it does not solve it. Even if AlphaFold can accurately predict what a protein looks like given the DNA sequence, we still do not understand how protein folding occurs in nature and why misfolding occurs.  

AlphaFold is the first big step in understanding the protein folding problem as it shows that there is some underlying structure in genomic data that AlphaFold is able to find. This has given researchers the hope that the protein folding problem can be solved within their lifetime, something that seemed impossible just twenty years ago. But AlphaFold alone is not enough to solve protein folding, to solve protein folding we need to understand its mechanisms and that is something AlphaFold is not able to do yet. 

Nice articles: 

DNA seen through the eyes of a coder (or, If you are a hammer, everything looks like a nail)

DNA and RNA Basics: Replication, Transcription, and Translation 


AlphaFold GitHub: