Plant Mix Prediction for Phytoremediation

ISEF Category: Environmental Engineering

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Bioremediation · Difficulty: Advanced · Setup: Home Setup · Time: Full Year

The Hook

Some plants can pull pollutants out of soil, but not every species works the same way. Think of it like assembling a sports team, where each player handles a different job. Your project asks which plant mix gives the best cleanup score for a contaminated site. That turns a messy environmental problem into a prediction problem you can test with public data.

What Is It?

Phytoremediation means using plants to help clean contaminated soil or water. Some plants take up metals. Some hold soil in place. Some help microbes in the ground break down pollutants. Your job is to predict which plant species, or mix of species, should work best for a given site.

A good analogy is a toolbox. A hammer, a screwdriver, and pliers all fix different problems. In the same way, one plant might handle lead well, while another might do better in salty soil or in a site with mixed contamination. Your model tries to match the right plants to the right site conditions.

This project sits at the intersection of ecology, data science, and cleanup planning. You use public soil data, site records, and vegetation survey data to build a model that ranks plant mixes by likely performance. You are not proving the plants clean a site in real life. You are building a prediction system that helps identify promising combinations.

Why This Is a Good Topic

This is a strong science fair topic because you can test a real environmental question with public data and clear metrics. You can compare species, contamination types, and site conditions, then measure how well your model predicts plant mixes. The real-world connection is obvious, since contaminated land needs cheaper, greener cleanup options. You can also learn data cleaning, feature selection, model evaluation, and how to think about messy environmental data.

Research Questions

How does soil contaminant type affect the predicted best plant species mix for phytoremediation?
What is the effect of adding vegetation survey data on model accuracy for Superfund site plant selection?
Does a model trained on one region predict phytoremediation plant mixes better than a model trained on mixed regions?
To what extent do soil pH, land use history, and contaminant concentration improve species-mix predictions?
Which machine-learning model gives the best ranking of plant mixes for sites with heavy-metal contamination?
How does excluding rare plant species change prediction quality and recommendation stability?

Basic Materials

Laptop or desktop computer with internet access.
Spreadsheet software such as Google Sheets or Excel.
Python installed through Anaconda or a similar free distribution.
Jupyter Notebook for data cleaning and analysis.
Public site data from EPA Superfund records and vegetation survey datasets.
Note-taking document for tracking variables, assumptions, and model decisions.
External storage or cloud folder for versioned project files.

Advanced Materials

Access to a university workstation or strong personal computer for larger datasets.
Python with pandas, scikit-learn, numpy, matplotlib, and seaborn.
Jupyter Notebook or JupyterLab for reproducible analysis.
GIS software such as QGIS for mapping site features.
ImageJ if you need to process plot or survey images.
Access to PubMed or Web of Science for literature review on phytoremediation species performance.
Optional R or Python packages for spatial analysis and cross-validation.

Software & Tools

Python: Cleans the data, trains machine-learning models, and scores prediction accuracy.
Jupyter Notebook: Keeps code, notes, and results in one place for repeatable analysis.
Google Sheets: Helps you inspect raw site records and track missing values.
QGIS: Lets you explore whether geography or site clustering affects your predictions.
scikit-learn: Provides the main machine-learning models and evaluation tools.

Experiment Steps

Define the prediction target, such as the best single species or the best species mix for each site.
Gather public soil contamination records, vegetation survey data, and any site-level metadata you can defend using the literature.
Clean the dataset, align site IDs, and decide how you will handle missing values, rare species, and inconsistent names.
Build a feature set that turns site conditions into usable model inputs, then choose a baseline model before trying more complex ones.
Split the data in a way that tests real generalization, then compare models with the same scoring metric.
Interpret the results to see which site features and plant traits most strongly influence your recommendations.

Common Pitfalls

Using plant names that are recorded in different formats across datasets, which creates duplicate categories and breaks the model.
Mixing site records from different years without checking whether the contamination profile changed, which makes the training data inconsistent.
Predicting a plant mix without a clear label for success, which turns the project into an unclear classification problem.
Ignoring class imbalance, which lets common species dominate the model and hides rare but useful phytoremediation plants.
Testing on random rows instead of held-out sites, which makes the model look better than it really is on new locations.

What Makes This Competitive

A stronger project goes beyond simple prediction accuracy. You can test whether your model generalizes across regions, contaminant types, or site sizes, then compare multiple model families with the same validation plan. You can also add interpretability, so your results explain why a plant mix ranks well, not just what ranks first. That kind of analysis shows judgment, not just coding.

Project Variations

Focus on heavy-metal sites only, then compare how well the model predicts mixes for lead, arsenic, and cadmium contamination.
Predict the best native plant mix instead of any plant mix, then test whether native-only recommendations lose accuracy.
Add soil chemistry variables such as pH and organic matter, then measure whether they improve ranking performance over contamination data alone.

Learn More

EPA Superfund site data: Search the EPA Superfund program pages and site profiles for contamination and cleanup records.
USGS Water and Soil Data: Search USGS databases for soil chemistry, land use, and environmental sampling resources.
USDA PLANTS Database: Look up plant traits, native ranges, and species names for vegetation matching.
NOAA Climate Data: Search NOAA climate normals and regional summaries to add environmental context to site conditions.
PubMed: Search for review articles on phytoremediation, plant uptake, and species performance in contaminated soils.
MIT OpenCourseWare: Search for free courses in machine learning, data analysis, and environmental engineering foundations.

Environmental Engineering Category Guide

How to Do Real Environmental Engineering Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →