Drug Discovery | Fusemachines

Problem

The research scientists needed to identify proteins that have rejuvenating properties for treatment of a specific disease called Sarcopenia, a special form of Myopathy, where individuals lose their ability to regenerate their skeletal muscles. Research and experimentation was not yielding the number of proteins needed for the next stage of discovery. Each of the 20,000 proteins had 20,000 features which needed to be analyzed to identify a true positive. The traditional approach to assessing each protein during the search phase would cost roughly around $2,000 per protein, which equated to a cost of $40 million if all the proteins were evaluated through wet lab experiments. The client’s goal was to reduce the cost and time of analyzing proteins during the research phase.

Many organizations face the same problems that were challenges for the client:

1. Unlabeled data

A common problem in many business applications is having unlabeled data which can be difficult or costly to assess. In this case, the cost of assessing each protein would equal to $2,000.

2. Data within a square matrix

Less data makes it difficult to properly model the complex interactions between all the variables. Square matrices indicate there are as many independent variables as there are equations to help model their behavior. Machine learning algorithms function when there are fewer variables and exceedingly more equations describing them. The more data prevalent in machine learning models, the better.

3. Outliers in your data

Outliers are extreme data points that fall outside the range compared to other data. They are important to consider when transforming and cleaning data to avoid issues with statistical procedures and are trivial in ML tasks. In this case, some proteins were out of range compared to other data points however complex feature engineering was required to feed the values into neural networks.

Challenge

The client was striving to lower the cost and time of running experiments

A substantial amount of data needed to be analyzed quickly

Solution

Within 3 weeks of meeting, Fusemachines was able to onboard 1 PhD and 3 engineers to work with the client. The Data Scientist and engineers had instant communication during the integration process with the client’s internal team and project leaders. They worked to help solve protein prediction as well as extract information from research documents using natural language processing (NLP) to help corroborate the predictions.

As a second step, the client wanted to use NLP to comb through scientific journals and solidify the protein search based on the specific disease being analyzed and different predicted proteins. This would help speed up the process by proving the alignment of contemporary scientific literature with the findings.

To successfully predict proteins, Fuse engineers pre-processed the data to make sure there weren’t any inherent biases. To tackle the issue of dealing with unlabeled data, engineers used techniques to produce pseudo labels which help the AI model make better predictions. In short, this solution involved taking the “unknown” proteins, making predictions on them, then taking the proteins that have been predicted positive and feeding them into the model again for training, but this time as labeled examples.

Leveraging Data

1. Collect data

Organizations looking to leverage machine learning and AI should establish a data strategy to retain or acquire the proper amount and type of data to facilitate accelerated research outcomes. Fusemachines accessed the clients data capabilities and developed a strategy to attain the needed data for the project. Fusemachines staff developed all needed data infrastructure and worked with the client to identify the best toolset for the project and future needs.

2. Assess data

Feature selection is critical for the development of accurate machine learning algorithms and requires properly labeled data. Square matrices presented inherent challenges in feature selection. Fusemachines unique approach utilized a convolutional autoencoder that compressed 20,000 features into a latent space with fewer dimensions. This forced the model to remove any redundant features in the data, saving 100’s to 1,000’s of hours it would take to analyze each of the 400 million unique features in the data set.

3. Pay attention to nuances

It’s important to scope about the nuances when dealing with a project like this. In this case, small variations in the attributes can convert it into a whole new protein. Therefore, imputing the values of outliers just so we can normalize and feed the data into a neural network would end up producing faulty results. It’s crucial to understand what the data means, how the algorithms function, and identify any outliers to make sure there are no issues with statistical analyses.

Alternative Applications

Identification of alternative therapies
Medical imaging in detecting COVID from chest CT scans (CNN)
Cytology and histology analyses
Detecting cancerous tumors or identifying specific attributes

Results

Of the 20,000 proteins, each with 20,000 unique features, the AI/ML system the Fuse Team developed in 4 months predicted and ranked an ordered list of the top 100 proteins with rejuvenating properties the clients requested. Out of the 10 proteins that were chosen to conduct experiments on, 8 were found to be true positives, far exceeding the clients expectation of 1. Fusemachines was able to help the client save time and money by reducing the research phase by developing a system utilizing machine learning and natural language processing (ML/AI).

Project duration 6 months

80% Identification rate

Want to learn more?

Book a discovery call with an AI expert today.

SCHEDULE A CALL