SMILES and Bioassay Search Models
ISEF Category: Computational Biology and Bioinformatics
Ready to Turn This Idea Into a Real Project?
This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.
For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →
Subcategory: Other · Difficulty: Advanced · Setup: University Lab · Time: Full Year
The Hook
A drug molecule can look like random letters to you, but to a machine, those letters can become a searchable pattern. That means you could ask for compounds with a specific biological effect, then test whether the model finds the right ones. This is the same idea behind modern search systems, but applied to chemistry and biology. If you like coding and real science data, this topic gives you both.
What Is It?
This project asks whether a computer can learn to match two very different kinds of information about the same compound: its chemical structure and its bioassay description. SMILES is a text code that describes a molecule by listing its atoms and bonds in a line. Bioassay text describes what researchers observed, such as whether a compound reduced oxidative stress or changed cell survival.
Think of it like pairing a face with a voice. The face is the SMILES string, and the voice is the assay description. A cross-modal contrastive model learns by pulling matched pairs closer together and pushing mismatched pairs apart. If it works well, you can type a plain-English query and rank compounds that may fit that biological effect.
Why This Is a Good Topic
This is a strong science fair topic because you can test a clear question, measure model performance, and compare different design choices. It connects to real drug discovery, where researchers need faster ways to search huge chemical databases. You can learn data cleaning, natural-language processing, representation learning, and evaluation, all with public data and code. A strong project here is not just about training a model, but about proving when it works, when it fails, and why.
Research Questions
- How does the choice of text encoder affect retrieval accuracy for matching SMILES strings to PubChem BioAssay descriptions?
- What is the effect of using assay titles only versus assay titles plus full descriptions on cross-modal alignment?
- Does filtering PubChem BioAssay records by assay type improve the model's ability to rank relevant compounds?
- To what extent does adding molecular fingerprints improve retrieval compared with SMILES-only embeddings?
- Which similarity threshold best separates true matched compound-assay pairs from random pairs?
- How does the model perform on queries about specific biological effects compared with queries about general assay categories?
Basic Materials
- Laptop or desktop with at least 16 GB RAM.
- Python 3.10 or later.
- Public PubChem BioAssay data exported as CSV or TSV.
- SMILES and assay text dataset with matched compound-assay pairs.
- Notebook software such as JupyterLab or Google Colab.
- Basic spreadsheet software for inspecting samples.
- Internet access for downloading public data and documentation.
Advanced Materials
- Access to a university or lab workstation with a GPU.
- Python environment with PyTorch or TensorFlow.
- PubChem BioAssay bulk data or API access.
- RDKit for molecular parsing and fingerprint generation.
- Hugging Face Transformers for text encoders.
- Vector database or ANN library such as FAISS for retrieval tests.
- Python packages for statistics and plotting, such as SciPy, pandas, and seaborn.
Software & Tools
- Python: Runs data cleaning, model training, and evaluation scripts.
- RDKit: Parses SMILES strings and generates molecular features.
- PyTorch: Trains the contrastive model and handles embeddings.
- Hugging Face Transformers: Provides pretrained text encoders for bioassay descriptions.
- PubChem: Supplies public compound and assay records for building the dataset.
Experiment Steps
- Define the exact retrieval task, such as matching an assay description to the correct compound or ranking compounds for a text query.
- Build a clean paired dataset, then decide how you will split training, validation, and test examples to avoid data leakage.
- Choose two encoders, one for SMILES and one for text, and decide how you will turn each output into a shared embedding space.
- Plan a baseline, such as random ranking or keyword search, so you can prove the model adds value.
- Set up evaluation metrics that fit search problems, such as top-k accuracy, mean reciprocal rank, and retrieval enrichment.
- Design an error analysis plan that checks which assay types, compound classes, or query styles cause the most mistakes.
Common Pitfalls
- Mixing near-duplicate assay records across train and test sets, which makes the model look better than it really is.
- Treating noisy assay text as if every sentence means the same kind of biological effect.
- Using raw SMILES strings without checking whether invalid or rare tokens break the tokenizer.
- Evaluating only on overall accuracy and missing the ranking quality that matters for search.
- Ignoring class imbalance, which can let the model learn common assay language instead of true chemical-biology alignment.
What Makes This Competitive
A class-level version of this project only shows that a model can run. A stronger version tests whether the model generalizes across assay families, query styles, and compound classes. You can also compare several encoders, add hard negative samples, and report ranking metrics instead of just simple accuracy. That kind of careful analysis shows you understand both the biology and the machine learning.
Project Variations
- Train the same alignment model on ChEMBL assay text instead of PubChem BioAssay to compare database quality.
- Replace full assay descriptions with short curator-written labels to test how much text detail the model needs.
- Use molecular fingerprints instead of SMILES strings as the compound input and compare retrieval performance.
Learn More
- PubChem BioAssay: Search PubChem for assay records, compound tables, and downloadable bulk data.
- RDKit documentation: Learn molecule parsing, fingerprints, and basic cheminformatics tools from the official project docs.
- MIT OpenCourseWare, Introduction to Computational Biology: Find lecture materials on sequence models, data analysis, and biological data workflows.
- NIH PubMed: Search for review articles on contrastive learning in drug discovery and multimodal biomedical models.
- Bioinformatics: Search the journal site and PubMed for review articles on chemical language models and drug repurposing.
- FAISS documentation: Learn fast similarity search for embedding-based retrieval from the official library docs.
Computational Biology and Bioinformatics Category Guide
How to Do Real Computational Biology Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →
To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →
