Peer Grading Systems for LLM Review Detection

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Online Learning · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

One fake review can tilt a whole grade pool. That is a big problem when hundreds of students grade each other online, and some reviews may come from AI instead of real effort. Your project asks a sharp question, can a system spot suspicious peer feedback and stop it from skewing the final score?

What Is It?

This phenomenon studies a peer-grading system that does not trust every review equally. Instead, it looks for clues that a review may be low quality, copied, or AI-generated. Then it gives that review less weight when it averages scores. Think of it like a teacher who notices one student always gives vague feedback and quietly counts that input less.

The system checks two main things. Style tells you how the review sounds, such as sentence length, word variety, and repeated phrases. Semantics tells you what the review means, such as whether it matches the assignment and uses enough specific detail. When both signals look off, the system treats the review as suspicious and adjusts the reputation score of that reviewer. The goal is to make peer grading fairer, even when some feedback is noisy or artificial.

Why This Is a Good Topic

This is a strong science fair topic because you can test it with real text data, clear metrics, and measurable results. You can compare normal peer grading against an anomaly-aware system and see whether it improves score accuracy. The project connects to online education, academic integrity, and automated text analysis, all of which matter in schools and MOOCs. You can learn data cleaning, feature extraction, model evaluation, and basic machine learning without needing a wet lab.

Research Questions

How does anomaly-based reweighting affect the accuracy of peer-grade aggregation?
What is the effect of adding stylistic features on detecting low-quality peer reviews?
Does combining semantic similarity and style features improve review anomaly detection?
To what extent do false positives change the fairness of the final grade distribution?
Which reviewer-history features best predict unreliable peer feedback?
How does the system perform when trained on one MOOC dataset and tested on another?

Basic Materials

Laptop or desktop computer with at least 8 GB RAM.
Publicly released peer-review dataset from a MOOC or academic writing benchmark.
Python installed locally or in Google Colab.
Jupyter Notebook for analysis and documentation.
Text editor such as VS Code.
Spreadsheet software for tracking labels, scores, and metrics.
GitHub account for version control and project backup.

Advanced Materials

University or high-performance laptop with more memory for larger text models.
Access to a GPU through a university cluster or cloud research account.
Labeled MOOC peer-review dataset with grader IDs, rubric scores, and submission text.
Natural language processing libraries such as spaCy, scikit-learn, PyTorch, or Hugging Face Transformers.
Annotation tool for manual review labeling.
Statistical analysis software for significance testing and error analysis.
Secure data storage for any restricted educational records.

Software & Tools

Python: Runs text cleaning, feature extraction, anomaly detection, and evaluation scripts.
Jupyter Notebook: Helps you document experiments and inspect model outputs step by step.
Google Colab: Gives you a free notebook workspace if your laptop is slow.
scikit-learn: Supports classification, clustering, metrics, and cross-validation.
spaCy: Extracts language features such as token patterns, sentence structure, and named entities.

Experiment Steps

Define what counts as suspicious peer feedback, such as vague text, generic praise, or mismatches with the rubric.
Choose one aggregation baseline, then decide how your anomaly score will change each review's weight.
Build a feature set that captures both writing style and meaning, so your detector has two ways to spot problems.
Plan a labeling strategy for a small validation set, so you can check whether the detector matches human judgment.
Design evaluation metrics that measure both grading accuracy and review detection quality, not just one or the other.
Compare multiple model versions, then test which features help most across different courses or assignment types.

Common Pitfalls

Treating all unusual reviews as AI-generated, which can confuse poor writing with actual manipulation.
Using only style features, which can miss semantically weak reviews that still sound polished.
Training and testing on the same course, which makes the system look better than it really is.
Ignoring reviewer bias history, which leaves the reputation score blind to repeated low-quality feedback.
Measuring only detection accuracy, which can hide the real problem of whether final grades become fairer.

What Makes This Competitive

A stronger project would test more than one way of spotting suspicious reviews and compare how each method changes final grading. You could add fairness checks, error analysis by reviewer type, or cross-course validation so your model is not tied to one dataset. You would also stand out if you explain when the system fails, not just when it works. That kind of careful analysis looks much more like real research.

Project Variations

Test whether the same detection system works on discussion posts instead of peer reviews.
Compare classic machine-learning features against transformer-based text embeddings for spotting suspicious feedback.
Measure whether reviewer reputation improves when the system tracks score consistency across multiple assignments.

Learn More

PubMed: Search for review articles on automated feedback quality, educational data mining, and text anomaly detection.
arXiv: Search for recent preprints on peer grading, LLM-generated text detection, and learning analytics.
MIT OpenCourseWare: Use computer science and machine learning courses to review text classification and evaluation basics.
NOAA National Centers for Environmental Information: A model for working with large public datasets and careful metadata handling.
Kaggle Datasets: Explore public text classification datasets and notebook examples for feature engineering and evaluation.

Systems Software Category Guide

How to Do Real Systems Software Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →