Back to Success Stories
AI Teams

Drug Discovery

A US based biotech company aims to increase the quality of life for patients with degenerative diseases by treating and extending their healthspan. Their research focuses on discovering therapeutic proteins that help cells live longer and regenerate.


The research scientists needed to identify proteins that have rejuvenating properties for treatment of a specific disease called Sarcopenia, a special form of Myopathy, where individuals lose their ability to regenerate their skeletal muscles. Research and experimentation was not yielding the number of proteins needed for the next stage of discovery. Each of the 20,000 proteins had 20,000 features which needed to be analyzed to identify a true positive. The traditional approach to assessing each protein during the search phase would cost roughly around $2,000 per protein, which equated to a cost of $40 million if all the proteins were evaluated through wet lab experiments. The client’s goal was to reduce the cost and time of analyzing proteins during the research phase.

Many organizations face the same problems that were challenges for the client:
1. Unlabeled data
A common problem in many business applications is having unlabeled data which can be difficult or costly to assess. In this case, the cost of assessing each protein would equal to $2,000.
2. Data within a square matrix
Less data makes it difficult to properly model the complex interactions between all the variables. Square matrices indicate there are as many independent variables as there are equations to help model their behavior. Machine learning algorithms function when there are fewer variables and exceedingly more equations describing them. The more data prevalent in machine learning models, the better.
3. Outliers in your data
Outliers are extreme data points that fall outside the range compared to other data. They are important to consider when transforming and cleaning data to avoid issues with statistical procedures and are trivial in ML tasks. In this case, some proteins were out of range compared to other data points however complex feature engineering was required to feed the values into neural networks.


The client was striving to lower the cost and time of running experiments
A substantial amount of data needed to be analyzed quickly


Within 3 weeks of meeting, Fusemachines was able to onboard 1 PhD and 3 engineers to work with the client. The Data Scientist and engineers had instant communication during the integration process with the client’s internal team and project leaders. They worked to help solve protein prediction as well as extract information from research documents using natural language processing (NLP) to help corroborate the predictions.
As a second step, the client wanted to use NLP to comb through scientific journals and solidify the protein search based on the specific disease being analyzed and different predicted proteins. This would help speed up the process by proving the alignment of contemporary scientific literature with the findings.
To successfully predict proteins, Fuse engineers pre-processed the data to make sure there weren’t any inherent biases. To tackle the issue of dealing with unlabeled data, engineers used techniques to produce pseudo labels which help the AI model make better predictions. In short, this solution involved taking the “unknown” proteins, making predictions on them, then taking the proteins that have been predicted positive and feeding them into the model again for training, but this time as labeled examples.

Leveraging Data

1. Collect data
Organizations looking to leverage machine learning and AI should establish a data strategy to retain or acquire the proper amount and type of data to facilitate accelerated research outcomes. Fusemachines accessed the clients data capabilities and developed a strategy to attain the needed data for the project. Fusemachines staff developed all needed data infrastructure and worked with the client to identify the best toolset for the project and future needs.
2. Assess data
Feature selection is critical for the development of accurate machine learning algorithms and requires properly labeled data. Square matrices presented inherent challenges in feature selection. Fusemachines unique approach utilized a convolutional autoencoder that compressed 20,000 features into a latent space with fewer dimensions. This forced the model to remove any redundant features in the data, saving 100’s to 1,000’s of hours it would take to analyze each of the 400 million unique features in the data set.
3. Pay attention to nuances
It’s important to scope about the nuances when dealing with a project like this. In this case, small variations in the attributes can convert it into a whole new protein. Therefore, imputing the values of outliers just so we can normalize and feed the data into a neural network would end up producing faulty results. It’s crucial to understand what the data means, how the algorithms function, and identify any outliers to make sure there are no issues with statistical analyses.
Alternative Applications
  • Identification of alternative therapies
  • Medical imaging in detecting COVID from chest CT scans (CNN)
  • Cytology and histology analyses
  • Detecting cancerous tumors or identifying specific attributes


Of the 20,000 proteins, each with 20,000 unique features, the AI/ML system the Fuse Team developed in 4 months predicted and ranked an ordered list of the top 100 proteins with rejuvenating properties the clients requested. Out of the 10 proteins that were chosen to conduct experiments on, 8 were found to be true positives, far exceeding the clients expectation of 1. Fusemachines was able to help the client save time and money by reducing the research phase by developing a system utilizing machine learning and natural language processing (ML/AI).
Project duration 6 months
80% Identification rate
Accelerate your organization’s data to AI journey.
Tell us how we can help and we’ll take care of the rest.
This website uses cookies to facilitate and enhance your use of the website and track usage patterns. By continuing to use this website, you agree to our use of cookies as described in our Privacy Policy.