Evidence-Linked AI Code Review Bots

Evidence-Linked AI Code Review Bots

ISEF Category: Systems Software

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Other  ·  Difficulty: Advanced  ·  Setup: University Lab  ·  Time: Full Year

The Hook

Most AI code reviewers sound confident, but many cannot prove why they flagged a bug. That makes them hard to trust. You can build a reviewer that points to the exact line and rule behind each comment. That turns a vague warning into a traceable claim.

What Is It?

This project asks a simple question, can a code review tool explain itself well enough that a human can check its work? Your system would look at source code, run static analysis, and then use a small language model to turn the findings into clear review comments. Static analysis means rule-based checks that inspect code without running it. Think of it like a spell-checker for code, except it can also point to logic mistakes and style problems.

The key idea is evidence linkage. Instead of saying, "This might be wrong," the tool should say, "This line violates this rule, and here is the exact reason." That matters because developers need feedback they can trust. A review comment that names the line, the rule, and the observed pattern is easier to verify than a generic AI warning.

The benchmark names in the phenomenon point to research datasets used for bug-finding and code-change tasks. Defects4J is a well-known collection of real Java bugs. ManySStuBs4J collects small, single-statement bug fixes. You do not need to recreate those exact datasets to study the idea. You can use them as inspiration for how to test whether evidence-linked comments are more accurate and useful than plain model output.

Why This Is a Good Topic

This is a strong science fair topic because you can measure real system behavior, not just build a demo. You can test accuracy, traceability, comment usefulness, and false alarm rate across code samples with known issues. The project connects to software reliability, developer productivity, and safe AI tools. A student can learn how static analyzers work, how language models can be constrained, and how to evaluate software with real metrics.

Research Questions

  • How does adding exact line citations affect the accuracy of automated code review comments?
  • What is the effect of static analysis evidence on the false positive rate of review suggestions?
  • Does a small language model produce more useful comments when it receives rule-based findings instead of raw code alone?
  • To what extent do evidence-linked comments help a human reviewer identify the real bug faster?
  • Which combination of static analysis rules and model prompts gives the clearest review explanation?
  • How does performance change when the system reviews bug-fix snippets versus larger code diffs?

Basic Materials

  • Laptop or desktop computer with a modern web browser.
  • Python installed locally.
  • Git for version control.
  • Open-source code repositories with known bugs or bug-fix commits.
  • Text editor or IDE such as VS Code.
  • Spreadsheet software for tracking results.
  • Access to public benchmark datasets such as Defects4J or ManySStuBs4J documentation.
  • Open-source static analysis tool such as PMD, SpotBugs, ESLint, or a similar language-specific checker.

Advanced Materials

  • University or high-performance workstation for running multiple model and analysis experiments.
  • Local inference access for a small language model such as Llama or a similar open model.
  • Docker for reproducible software environments.
  • JUnit or another automated test harness for Java projects.
  • Code property graph or program analysis framework such as Joern, if the chosen language supports it.
  • Annotation tool for human evaluation of review comment quality.
  • Database or structured log store for experiment outputs.

Software & Tools

  • Python: Organizes experiment scripts, parses static analysis output, and scores review comments.
  • GitHub: Hosts code samples and tracks diffs, issues, and pull requests.
  • VS Code: Lets you inspect code, run analysis, and label examples in one place.
  • Jupyter Notebook: Helps you compare metrics, plot results, and document your analysis.
  • ImageJ: Not needed for this topic, so leave it out and choose a text or code analysis workflow instead.

Experiment Steps

  1. Define the review task you want the system to perform, such as bug finding, style checking, or both.
  2. Choose one programming language and one source of ground-truth examples so your evaluation stays fair.
  3. Build a baseline that gives plain model comments without evidence, then compare it with an evidence-linked version.
  4. Decide how you will score each comment, including correctness, traceability, and usefulness to a human reviewer.
  5. Plan a small set of controls that rule out easy wins, such as duplicated bugs, trivial formatting issues, or leaked labels.
  6. Compare results across different bug types and explain where evidence helps most and where it fails.

Common Pitfalls

  • Using a benchmark file as both training data and test data, which makes the results look better than they are.
  • Measuring only whether the model sounds convincing, which misses whether the comment is actually correct.
  • Mixing different languages or code styles in one run, which adds noise and hides the effect of evidence linkage.
  • Ignoring line-level grounding, which lets the system make vague claims that no one can verify.
  • Treating static analysis warnings as ground truth, which can overcount harmless style issues as real defects.

What Makes This Competitive

A stronger project does more than compare two tools. It separates correctness, explanation quality, and human trust into different metrics. It also tests harder cases, like multiple bugs in one file or comments that need both static rules and model reasoning. If you can show when evidence helps, when it fails, and why, your project starts to look like real systems research.

Project Variations

  • Use Python or JavaScript code instead of Java to see whether evidence-linked comments work better in a different ecosystem.
  • Compare static analysis plus a small LM against static analysis alone to isolate the model’s contribution.
  • Test human ratings of comment trust and usefulness, not just bug detection accuracy.

Learn More

  • NIH PubMed: Search for review articles on explainable AI, software fault detection, and human trust in automated decision systems.
  • NASA Open Source Software Catalog: Browse examples of public software projects and development practices.
  • MIT OpenCourseWare, Software Construction: Review lecture notes on testing, debugging, and program analysis.
  • Defects4J project page: Read the dataset documentation and bug benchmark setup for real Java defects.
  • ManySStuBs4J paper: Find the original paper through a scholarly search engine or the journal page to learn how single-statement bugs are organized.
  • arXiv: Search for papers on code review automation, static analysis, and LLM-based software engineering.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub​ →

Shopping Cart