AI pups: Location of proteins and diseases

A revolution in cell biology: the pups AI model from mit and Harvard accurately predicts the locations of disease-fighting proteins

Revolutionary advances in understanding cellular mechanisms and potentially new directions in disease diagnostics and treatment are emerging thanks to an innovative approach that harnesses the power of artificial intelligence. Scientists have developed a sophisticated computational method that can predict the location of almost every protein within a human cell with exceptional precision. This model, trained on a shared understanding of protein and cell behavior, opens the door to faster and more efficient identification of pathological conditions and the development of new therapeutic strategies.

The question of where a protein is located within a cell is not merely academic; it has profound implications for cellular function and, consequently, for health. Incorrect protein localization, meaning its placement in an inappropriate cellular compartment, can be a trigger or a significant factor in the development of a range of serious diseases. For example, in Alzheimer's disease, the accumulation of certain proteins in the wrong places in brain cells leads to neurodegeneration. Similarly, in cystic fibrosis, a defective protein fails to reach its correct location on the cell membrane, causing a disruption of ion transport. In the context of cancer, abnormal protein distribution can promote uncontrolled cell growth and division or allow cancer cells to evade the immune response.

Given that a single human cell contains approximately 70,000 different proteins and their variants, manually identifying the location of each one presents an enormous challenge. Traditional experimental methods typically allow for the testing of only a small number of proteins at a time, making the process extremely expensive, time-consuming, and labor-intensive. Each experiment requires careful preparation, specific reagents, and sophisticated equipment, and the results often provide only a fraction of the picture of the complex intracellular organization.

A new generation of computational techniques and the Human Protein Atlas

To accelerate and simplify this complex task, new generations of computational techniques are being developed. They rely on machine learning models that use extensive datasets with information on thousands of proteins and their locations, measured in different cell lines. One of the largest and most significant such resources is the Atlas of human proteins (Human Protein Atlas). This comprehensive catalog contains data on the subcellular behavior of more than 13,000 proteins in over 40 different types of cell lines. Despite its impressive size, the Human Protein Atlas has so far managed to explore only about 0.25 percent of all possible combinations of proteins and cell lines within its database. This clearly indicates the vastness of the unexplored space and the need for more advanced tools that can efficiently map the remaining part of the protein universe.

Faced with this challenge, researchers from prestigious institutions such as MIT, Harvard University, and the Broad Institute (a joint institute of MIT and Harvard) have developed a new computational approach. Their method allows for the efficient exploration of the remaining, still unmapped space of intracellular protein localization. The key advantage of this new approach is its ability to predict the location of any protein in any human cell line, even in cases where neither the specific protein nor the particular cell line has been previously experimentally tested. This represents a significant step forward compared to existing methods.

Precision at the single-cell level

The technique they have developed goes a step further than many existing artificial intelligence-based methods because it localizes the protein at the single-cell level, instead of providing an average estimate for all cells of a certain type. This capability of single-cell level localization is of utmost importance. For example, it allows for the precise determination of a protein's position in a specific cancer cell after therapy application, which can provide crucial insights into treatment efficacy and resistance mechanisms. Understanding heterogeneity within a cell population, even within the same cell line or tissue, is key to developing personalized medical approaches.

The research team combined a protein language model with a special type of computer vision model to capture rich and detailed information about the protein and the cell. The protein language model analyzes the amino acid sequence that makes up the protein, extracting information about its structure and properties that determine its affinity for specific cellular compartments. On the other hand, the computer vision model, known as an image inpainting model, analyzes images of the cell stained with specific markers to gather information about the state of that cell – its type, individual characteristics, and the possible presence of stress or pathological changes. The final result that the user receives is an image of the cell with a highlighted area indicating the predicted location of the protein. Since protein localization is often an indicator of its functional status, this technique can help researchers and clinicians in more effective disease diagnosis, identification of target molecules for new drugs, and enable biologists to better understand the connection between complex biological processes and protein distribution within the cell.

Yitong Tseo, a PhD student in MIT's Computational and Systems Biology program and one of the lead authors of the paper published on this topic in the journal Nature Methods on May 15, 2025, points out: "You could run these protein localization experiments on a computer without ever needing to go into the lab, hopefully saving months of effort. While you would still need to verify the prediction, this technique could act as an initial screen of what to test experimentally."

Alongside Tseo, Xinyi Zhang, a PhD student in the Department of Electrical Engineering and Computer Science (EECS) and the Eric and Wendy Schmidt Center at the Broad Institute, is a lead author of the paper. Among the authors are also Yunhao Bai from the Broad Institute and senior authors Fei Chen, an assistant professor at Harvard and a member of the Broad Institute, and Caroline Uhler, a professor of engineering in the Andrew and Erna Viterbi Department in EECS and the Institute for Data, Systems, and Society (IDSS) at MIT, who is also director of the Eric and Wendy Schmidt Center and a researcher in MIT's Laboratory for Information and Decision Systems (LIDS).

Collaboration of advanced models: Introducing PUPS

Many existing models for predicting protein behavior are limited in that they can only make predictions based on data about proteins and cells on which they were trained or are unable to accurately determine the location of proteins within a single cell. To overcome these limitations, researchers created a two-part method for predicting the subcellular location of previously unseen proteins, called PUPS (Prediction of Unseen Proteins' Subcellular localization).

The first part of PUPS uses a protein sequence model. This model is designed to capture the properties of the protein that determine its localization, as well as its three-dimensional structure, based on the chain of amino acids that forms it. The amino acid sequence is the primary information that dictates how a protein will fold and what functions it will perform, including signals for its routing within the cell.

The second part of the system includes an image inpainting model. This is a sophisticated computer vision model originally designed to fill in missing parts of an image. In this context, the model analyzes three differently stained images of the cell to gather key information about its state. These images typically show the nucleus (with a marker like DAPI), microtubules (important components of the cytoskeleton), and the endoplasmic reticulum (a key organelle for protein synthesis and transport). By analyzing these markers, the model gains insight into the cell type, its individual morphological features, and detects whether the cell is under some form of stress, which can affect protein distribution.

PUPS then merges the representations, or digital descriptions, created from each of these two models – the protein sequence model and the cell image model. By combining this information, the system predicts where the protein is located within a specific, individual cell. To visualize this prediction, an image decoder is used which generates an output image. This image clearly marks the area where PUPS predicts the protein under investigation is located.

"Different cells within a single cell line exhibit different characteristics, and our model is able to understand that nuance," Tseo explains. This ability to distinguish individual cellular variations is crucial for precise analysis.

The user of the PUPS system needs to input the amino acid sequence that forms the protein of interest and three images of cellular markers – one for the nucleus, one for microtubules, and one for the endoplasmic reticulum. After inputting this data, PUPS performs the rest of the analysis and generates a localization prediction.

Deeper understanding through an innovative learning process

During the PUPS model training process, researchers applied several innovative techniques to teach it to effectively combine information from both constituent models. The goal was to enable PUPS to make an informed guess about the protein's location, even if it had never "seen" that specific protein or cell line before.

One of these techniques involves assigning a secondary task to the model during training: explicitly naming the localization compartment, such as the cell nucleus, mitochondria, or Golgi apparatus. This task is performed concurrently with the primary task of image inpainting (predicting where the protein is located in the image). This additional step has been shown to help the model learn more effectively and develop a better general understanding of possible cellular compartments and the signals that guide proteins to them. An analogy might be a teacher asking students not only to draw all parts of a flower but also to write their names. This additional requirement for naming enhances learning and understanding.

Furthermore, the fact that PUPS is simultaneously trained on data about proteins and cell lines helps it develop a deeper understanding of where proteins are typically localized in a cell image. The system learns to recognize subtle patterns and correlations between protein characteristics (derived from its sequence) and visual features of the cell (derived from marker images).

Impressively, PUPS can even independently understand how different parts of a protein sequence separately contribute to its overall localization. This means the model can identify specific amino acid motifs or domains within the protein that act as "zip codes," directing the protein to its destination in the cell.

"Most other methods usually require you to first have a marker for the protein, so you've already seen it in your training data. Our approach is unique in that it can generalize simultaneously across proteins and cell lines," Zhang emphasizes. This ability to generalize to unseen cases is a key advantage of PUPS.

Because PUPS can generalize to proteins it did not encounter during training, it is able to capture changes in localization caused by unique protein mutations not included in the Human Protein Atlas. This is particularly important for studying genetic diseases where mutations can alter protein behavior, including its intracellular distribution.

Researchers confirmed PUPS's ability to predict the subcellular location of new proteins in previously unseen cell lines by conducting laboratory experiments and comparing the results. A comparison with an existing, baseline artificial intelligence method showed that PUPS, on average, exhibited less prediction error for the tested proteins. These validation results confirm the robustness and accuracy of the new model.

Future directions and potential applications

Looking ahead, the research team plans to further improve PUPS. One of the goals is to enable the model to understand protein-protein interactions, i.e., how proteins interact with each other and how these interactions can affect their joint localization. They are also working on enabling PUPS to predict the localization of multiple proteins simultaneously within a single cell, thus providing a more complex picture of cellular organization.

The longer-term vision includes training PUPS to make predictions not only on cultured cells in laboratory conditions but also on samples of live human tissue. Such an advancement would have enormous significance for clinical diagnostics and therapy development, allowing for the analysis of protein localization in the real biological context of a patient. Understanding how proteins behave in the complex environment of tissues, with different cell types and intercellular interactions, would open new perspectives for personalized medicine. This pioneering work at the intersection of artificial intelligence, cell biology, and medicine promises to transform our approach to researching, diagnosing, and treating diseases, placing the power of predictive analytics at the service of human health.

The research was funded by the Eric and Wendy Schmidt Center at the Broad Institute, the National Institutes of Health (NIH), the National Science Foundation (NSF), the Burroughs Wellcome Fund, the Searle Scholars Program, the Harvard Stem Cell Institute, the Merkin Institute, the Office of Naval Research, and the U.S. Department of Energy.

Source: Massachusetts Institute of Technology

Find accommodation nearby

Creation time: 16 May, 2025

A revolution in cell biology: the pups AI model from mit and Harvard accurately predicts the locations of disease-fighting proteins

A new generation of computational techniques and the Human Protein Atlas

Precision at the single-cell level

Collaboration of advanced models: Introducing PUPS

Deeper understanding through an innovative learning process

Future directions and potential applications

Find accommodation nearby

AI Lara Teč

Events Croatia

Zadar shines again: Ante Butić national champion, Croatia in Macao 2026 with the support of the Zadar County Tourist Board

Autumn in Poreč: Sport Fest, IRONMAN 70.3 and the European Majorette Cup fill the Žatika Hall and Zelena Resort

Šibenik hosts 19th Congress of Croatian Camping 27-29 October 2025: sustainability, investment and innovation

First meeting of tamburitza groups in Zagreb: Kud Vrapčanci, FA Bilje, HSPD Podgorac and HKPD Bosiljak in CZKIO Susedgrad

Adi Šoše in Karlovac: Valentine's concert on February 13, 2026 in ŠSD Rakovac with the greatest hits and lavish production

Kvarner before 2026 as a European region of gastronomy: trainings in Rijeka and the Rijeka Ring raise standards

I'm not like that/Hallway to Nowhere: Croatian premiere in Zagreb and start of distribution in cinemas from October 16

Motovun: 14th Teran & Truffle Festival on 18 and 19 October 2025 brings tastings, walks and a music program

A revolution in cell biology: the pups AI model from mit and Harvard accurately predicts the locations of disease-fighting proteins

A new generation of computational techniques and the Human Protein Atlas

Precision at the single-cell level

Collaboration of advanced models: Introducing PUPS

Deeper understanding through an innovative learning process

Future directions and potential applications

Find accommodation nearby

Related