Graph Neural Nets for Dye Absorption Prediction

ISEF Category: Chemistry

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Computational Chemistry · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

A dye that looks great in one setup can fail in another because light absorption shifts with structure. That makes dye design a bit like tuning a radio, you need the right molecular signal at the right wavelength. Your project can predict that signal before anyone makes the molecule. That saves time, money, and lab effort.

What Is It?

This project studies how a dye molecule absorbs light, then predicts that absorption with machine learning. Absorption maxima, often written as lambda max, means the wavelength where the dye absorbs the most light. For indoor-light photovoltaics, that matters because the dye needs to match the light source well.

A graph neural network treats a molecule like a connected map. Atoms are the points, bonds are the links, and the model learns patterns from the whole structure. You can think of it like teaching a system to read a molecule the way a person reads a subway map, not by memorizing one station at a time, but by understanding how the whole route fits together.

Your project can compare the model’s predictions with TDDFT, or time-dependent density functional theory, a quantum chemistry method used to estimate excited-state properties. That gives you two ways to estimate absorption, one based on learned patterns and one based on physics.

Why This Is a Good Topic

This is a strong science fair topic because you can test it with real data, clear error metrics, and meaningful structure-property links. The real-world problem is dye design for indoor-light photovoltaics, where better absorption can improve device performance. You can learn data cleaning, molecular representation, model training, validation, and comparison against a physics-based method.

Research Questions

How does the size of the training set affect graph-neural-net error in predicting dye absorption maxima?
What is the effect of molecular scaffold class on prediction error for absorption maxima?
Does adding computed descriptors improve graph-neural-net performance over structure alone?
To what extent does the model agree with TDDFT across dyes with different conjugation lengths?
Which molecular features most strongly influence predicted absorption maxima in indoor-light dye candidates?
What is the effect of dataset curation rules on model bias toward certain dye families?

Basic Materials

Laptop or desktop computer with internet access.
Google Colab account.
Spreadsheet software for tracking compounds and labels.
Curated dye dataset from the Computational Materials Repository or a similar public source.
PubChem for checking structures and identifiers.
Python with free notebook access in Colab.
Open-source molecular toolkit such as RDKit.
Basic graphing tool for error plots and parity plots.

Advanced Materials

Workstation or cloud compute access with a GPU.
Larger curated dye dataset with train, validation, and test splits.
TDDFT output files or calculated spectra for external validation.
Molecular descriptor generation pipeline.
RDKit for structure handling and featurization.
PyTorch or TensorFlow for model training.
Open Babel for format conversion.
Version control system for tracking code changes.

Software & Tools

Google Colab: Runs notebooks in the browser and lets you train models without local setup.
Python: Handles data cleaning, model training, and error analysis.
RDKit: Converts molecular structures into graph and descriptor features.
PyTorch Geometric: Builds graph neural network models for molecular prediction.
ImageJ: Not needed here, so skip image analysis and focus on numerical model outputs.

Experiment Steps

Define the exact prediction task, the dye class, and the absorption label you will model.
Curate a clean dataset and decide which molecules to exclude before you train anything.
Choose one molecular representation, then compare it with a second representation as a baseline.
Split the data in a way that prevents similar molecules from leaking into both training and test sets.
Train the graph neural network and compare its predictions against a physics-based TDDFT benchmark.
Evaluate error patterns by scaffold, conjugation, and dye family, then decide what the model still misses.

Common Pitfalls

Mixing multiple absorption datasets with different measurement conditions, which makes the target label inconsistent.
Letting near-duplicate molecules appear in both training and test sets, which inflates performance.
Comparing the model to TDDFT values from a different solvent or method, which turns the benchmark into noise.
Using only average error and ignoring which dye families the model handles poorly.
Treating the molecule as a plain list of atoms and bonds without checking whether the graph input preserves the chemistry you need.

What Makes This Competitive

A stronger project does more than report one model score. You can compare several splits, test whether the model fails on new scaffold families, and measure whether physics-based descriptors help or hurt. You can also ask a deeper question, like whether the model learns real structure-property trends or only memorizes common dye patterns. Careful error analysis and a clean benchmark make the project much stronger than a simple training run.

Project Variations

Use visible-light dyes from solar cell literature instead of indoor-light candidates, then compare whether the same model transfers.
Predict absorption maxima with a simpler molecular fingerprint model first, then test whether the graph neural net really adds value.
Group dyes by scaffold family and study which families produce the largest prediction errors.

Learn More

PubChem: Search compound records, structures, and linked property data for dye candidates.
NCBI PubMed: Search review articles on dye-sensitized solar cells, molecular absorption, and machine learning in chemistry.
MIT OpenCourseWare: Look for free materials in quantum chemistry, computational chemistry, and machine learning methods.
NOAA/NASA data portals: Use these if you want background on solar spectrum and indoor-light conditions for comparison.
Journal of Chemical Information and Modeling: Search articles on molecular machine learning, graph networks, and property prediction.

Chemistry Category Guide

How to Do Real Chemistry Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →