Experiments
This page summarize various trying of utilizing machine learning
Natural Language Processing
A comprehensive Chrome extension utilize many of the NLP tech like NER, Q&A, medical translation
Then we break it down piece by piece
Our own medical specific pretrained model BioElectra
An electra model based on pubmed abstract and pmc text.
Name Entity Recognition
NER task: Extract meta/ structured data from a long piece of text. Actually the cases we showed are pipelines containing 2 steps:
- NER task: Finds out the region of the target(underline phrases within sentences)
- Classification: Finds out what the underlined text is.
NER Example
This following example is tring to find
drug
with text. For now, other targets likeGene, Mutation, Diseases
also works in certain accuracy.
from gc_lab.drug_norm import DrugNorm
dn = DrugNorm.from_db()
print(dn.find_drug("The drugs like Bicalutamide, famitinib and Palbociclib."))
this will output:
["Bicalutamide", "Famitinib", "Palbociclib"]
Which matches the knowledge base records from Genomicare, we also have API ready for that
Cloze as an inference tool
We can use cloze as an inference tool
Question & Answering
For text too long, and you just want a quick answer within the text, you can just type in the equestion, and model will under line an answer. Example see SAPERE AUDE demo video above
Text Generation
- Text generation: solve most NLP as text generation problem (GPT, GPT2 based)
Machine Translation
Currently we’re using commercial API from other big tech on this subject, but high performance translation on designated context can be achieved given enough labeled data. (Fine-tuning translation model)
Graph Learning
Machine Learning is great at leveraging its learning power in discrete data, where the feature of a node can be learned by the data records of interaction between nodes. eg. We don’t define any feature of gene and drug, but learn the map of their combination
- Here is an example model built from an open dataset: openbiolink
Computer Vision
PDL1 - Immunohistochemical Analysis
From PDL1 IHC slides to … anything
Image Data Privicy Protection
Redacting patient information with OCR tech by key words
As for now, we are using PaddleOCR for the OCR layer of the pipeline, then we redact some of the polygon regions when match specific/ configurable key word rule.
If necessity demands, and with enough labeled data, we can finetune the OCR model.
MRI Image
- Brain tumor type classifications and feature visualization using MRI images (T1 enhanced open dataset)
- Brain tumor region prediction using MRI images (T1 enhanced open dataset)
Organic Compond Property Prediction
We can perform multiple target prediction on a single organic compond mostly chemicals with Mol mass below 1000.
- SMILES string precict mutagenecity (one of the toxity), prediction on CKB drugs