Human vs GPT Bias Reasoning

ISEF Category: Behavioral and Social Sciences

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Cognitive Psychology · Difficulty: Intermediate · Setup: Home Setup · Time: 1 to 2 Months

The Hook

A chatbot can sound smarter than a person, but that does not mean it reasons the same way. On classic bias questions, humans often miss the statistical answer because the first answer feels right. GPT-class models can miss too, but not always in the same pattern. Your project asks where those two kinds of intuition line up, and where they split apart.

What Is It?

This project compares two kinds of answers to the same short problems. In cognitive psychology, System 1 means fast, automatic thinking, and System 2 means slower, careful thinking. The classic items here test cognitive reflection, which is the skill of stopping a fast wrong answer and checking it, and base-rate neglect, which happens when people ignore how common something really is.

Think of it like two players solving the same puzzle. One player grabs the first answer that feels smooth, while the other slows down and checks the numbers. A human may use gut feeling, then revise it. A GPT-class model does not have a human gut, but it still produces answers from learned patterns. That makes it a strong test case for asking whether model errors look like human biases, or whether they come from a different process.

Why This Is a Good Topic

This is a strong science fair topic because you can test it with clear right and wrong answers, then compare patterns across humans and an LLM. It connects to real problems in medicine, finance, hiring, and everyday decision-making, where people and tools both face misleading cues. You can learn survey design, coding, scoring, and basic statistics without needing a university lab.

Research Questions

How does prompt format change GPT-class accuracy on cognitive-reflection items?
What is the effect of adding step-by-step instructions on base-rate-neglect errors?
Does the model choose the statistically correct answer more often than students on the same items?
To what extent do confidence ratings predict correct answers for humans and for the model?
Which item features, such as wording length or numerical detail, most often trigger divergence between human and model answers?
How does item order affect accuracy across repeated human and model trials?

Basic Materials

Laptop or desktop computer with internet access.
Free survey or form software for collecting human responses.
Spreadsheet software for scoring and organizing data.
A published set of cognitive-reflection and base-rate-neglect items.
Basic consent and debrief sheets for student participants.
Notebook or document for coding answer patterns.
GPT-class chatbot interface for running the comparison prompt.

Advanced Materials

GPT-class API access for repeated model trials under fixed settings.
Python with pandas, scipy, and statsmodels for analysis.
R with tidyverse and lme4 for mixed-effects models.
Qualtrics, REDCap, or a school-approved survey platform for cleaner response capture.
A response-coding sheet for confidence ratings, explanations, and error types.
Optional text-analysis notebook for comparing human and model explanations.

Software & Tools

Google Forms: Collects human responses and exports them for scoring.
Google Sheets: Organizes trial data and helps you spot response patterns.
Python: Scores answers, compares groups, and runs basic statistical tests.
JASP: Runs t tests, chi-square tests, and regression with a simple interface.
RStudio: Fits item-level models when you want to study how specific questions behave.

Experiment Steps

Define the exact item set you will use, and separate cognitive-reflection items from base-rate-neglect items.
Decide how you will present the same items to humans and to the model so the comparison stays fair.
Build a scoring key that labels the statistically correct answer, the intuitive distractor, and any ambiguous response.
Plan controls for wording, item order, and prompt style so you can test one factor at a time.
Choose the statistics and graphs you will use before you collect data, including accuracy rates and item-level disagreement.
Set up a way to compare not just correctness, but also confidence, explanation style, and error type.

Common Pitfalls

Changing the wording of a question between the human survey and the model prompt, which creates a format effect instead of a real comparison.
Mixing items with different answer types, which makes scoring messy and hides where the model actually diverges.
Letting the model see clues from earlier answers, which can inflate later performance and blur item-level results.
Comparing one small human group to many model runs without matching trial counts, which makes the groups uneven.
Treating a fluent explanation as a correct answer, which overstates how well the model handled the underlying logic.

What Makes This Competitive

A stronger version of this project would go past simple accuracy counts. You can model item-level differences, test whether prompt changes shift the error pattern, and compare confidence calibration for humans and the model. If you add a cleaner control set, a better statistical test, or a novel way to classify wrong answers, the project becomes much more persuasive. The key is to show not just who was right, but how and why each system missed.

Project Variations

Compare GPT-class models with different system prompts, then test whether stricter instructions reduce base-rate neglect.
Use published teen or adult data from classic studies, then compare the model's error profile against the human group that matches your sample.
Swap in medical or legal decision vignettes, then measure whether the same bias pattern appears in a higher-stakes setting.

Learn More

PubMed: Search review articles and original studies on cognitive reflection, heuristics, and base-rate neglect.
Google Scholar: Find classic papers and newer follow-up studies on human reasoning bias and model behavior.
Journal of Experimental Psychology: General: Read abstracts and articles on reasoning, judgment, and decision making through the journal site or your school library.
Thinking & Reasoning: Browse studies on reasoning errors, intuitive judgment, and problem solving through the journal site or Scholar.
MIT OpenCourseWare: Use psychology and statistics course materials to review experimental design and basic analysis.
NIH PubMed tutorials: Learn how to search for peer-reviewed sources and filter for review articles.

Behavioral and Social Sciences Category Guide

How to Do Real Behavioral and Social Sciences Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →