Predicting Soybean Yield With Weather and Satellite Data

Predicting Soybean Yield With Weather and Satellite Data

ISEF Category: Plant Sciences

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Agriculture and Agronomy  ·  Difficulty: Advanced  ·  Setup: Home Setup  ·  Time: Full Year

The Hook

Soybean yields can swing hard from one county to the next, even in the same state. That makes them a perfect target for machine learning. You can use satellite images, weather data, and public crop records to predict harvest outcomes. Then you can ask which weeks of the season matter most.

What Is It?

This project builds a model that tries to predict soybean yield from data you can get online. Think of it like teaching a computer to read the crop season the way a coach reads a game film. Satellite data can show how green the crop looks, weather data can show heat and rain stress, and harvest records give the final score.

NDVI stands for normalized difference vegetation index. It is a number that estimates how green and active plants are from satellite images. High NDVI often means healthy crop growth, while lower values can signal stress, though not always. USDA NASS gives official crop yield data, and NOAA gives weather records like rainfall and temperature. When you combine them, you can test which time windows in the growing season best predict yield.

The feature-importance part matters too. It tells you which inputs the model leans on most. Maybe early-season rain matters more than late-season heat. Maybe a certain stretch of summer NDVI is the strongest signal. That gives your project a real research question, not just a prediction score.

Why This Is a Good Topic

This is a strong science fair topic because it has a clear output, a public data source, and lots of room for original analysis. You are not inventing the data, you are deciding how to join it, clean it, and test it. That means you can focus on real research skills like feature engineering, model comparison, and error analysis. It also connects to a real problem, farmers and agronomists want better yield forecasts under climate stress.

Research Questions

  • How does adding Sentinel-2 NDVI to weather data change soybean yield prediction accuracy?
  • What is the effect of using different phenological windows on county-level yield prediction?
  • Does a random forest model predict soybean yield better than linear regression for the same data set?
  • To what extent do NOAA temperature and precipitation variables improve predictions when NDVI is already included?
  • Which growth period contributes most to model performance under years with drought stress?
  • What is the effect of training the model on one region and testing it on another region?
  • Which combination of satellite, weather, and USDA NASS features gives the lowest prediction error?

Basic Materials

  • Laptop or desktop computer with enough memory to handle tabular data and satellite-derived features.
  • Internet access for downloading public data sets.
  • Spreadsheet software for organizing county, year, and feature tables.
  • Python installed with pandas, scikit-learn, matplotlib, and seaborn.
  • NOAA weather data access from a public archive or API.
  • USDA NASS county yield data.
  • Sentinel-2 vegetation index data from a public platform or processed data set.
  • External storage or cloud drive for large files.
  • Notebook for tracking variable definitions, data sources, and model versions.

Advanced Materials

  • Access to a higher-memory workstation or university server for larger feature tables.
  • Google Earth Engine or equivalent remote-sensing platform for deriving NDVI time series.
  • Python with xarray, geopandas, rasterio, scikit-learn, and shap.
  • Gridded weather products from NOAA or reanalysis data for spatial aggregation.
  • County boundary shapefiles for geographic joins.
  • Version control system for code and analysis history.
  • Access to a statistics package for model diagnostics and cross-validation checks.
  • Published crop calendar or phenology references for soybean growth-stage timing.

Software & Tools

  • Python: Runs data cleaning, feature engineering, model training, and evaluation.
  • Google Earth Engine: Extracts Sentinel-2 vegetation indices across counties and dates.
  • QGIS: Checks county boundaries, maps data coverage, and inspects spatial joins.
  • ImageJ: Not needed for this topic, but useful if you compare satellite-derived image outputs by hand.
  • Shap: Explains which weather and NDVI features drive model predictions.

Experiment Steps

  1. Define the prediction target, the geographic unit, and the yield years you will include.
  2. Decide how you will align satellite, weather, and USDA NASS data to the same county-year records.
  3. Select a small set of candidate feature windows across the soybean growing season, then justify why those windows matter.
  4. Build a baseline model first, then compare it with models that add NDVI and weather features.
  5. Plan cross-validation that tests generalization across years, counties, or regions instead of only memorizing the training set.
  6. Choose an interpretation method that turns feature importance into a seasonal story, not just a score.

Common Pitfalls

  • Mixing county-level yield data with mismatched satellite dates, which creates fake patterns from bad alignment.
  • Using raw NDVI values without checking cloud cover or missing scenes, which can push the model toward noisy inputs.
  • Training and testing on the same counties and years, which makes the accuracy look better than real-world performance.
  • Ignoring scale differences between weather variables, which can hide whether the model depends on heat, rain, or vegetation signals.
  • Treating feature importance as proof of causation, which can lead to overclaiming what the model actually learned.

What Makes This Competitive

A strong version of this project goes beyond one prediction model. You can compare several algorithms, test them across different regions, and check whether the same feature windows matter in drought years versus normal years. The best entries also explain error, not just accuracy. If you can tie model behavior to soybean growth stages and climate stress, your project starts to look like real agricultural data science.

Project Variations

  • Swap soybean for corn or wheat to see whether the same weather windows matter across crops.
  • Use only NOAA weather data first, then add Sentinel-2 NDVI to measure how much remote sensing improves the model.
  • Train the model on one Midwest region and test it on another to study geographic transferability.

Learn More

  • USDA NASS Quick Stats: Find county-level crop yield data and production records in the USDA database.
  • NOAA National Centers for Environmental Information: Find weather and climate data, including temperature and precipitation records.
  • NASA Earthdata: Explore satellite data access tools and remote sensing background through NASA's data portal.
  • Sentinel-2 User Guide from ESA: Learn what the Sentinel-2 bands mean and how vegetation indices are built.
  • MIT OpenCourseWare, Introduction to Machine Learning: Review model types, validation, and error metrics before you code.
  • PubMed: Search review articles on crop yield prediction, remote sensing, and climate stress in agriculture.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub​ →

Shopping Cart