LLM Repo License Conflict Detection

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Other · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

One copied file can create a legal mess for an entire app. That risk is huge when a large language model writes code from patterns it has seen before. Your project can build a detector that spots likely license conflicts before they spread through a repo. That turns a fuzzy legal problem into a measurable computer science problem.

What Is It?

This project asks a simple question with a tricky answer, where did a piece of code likely come from, and what license rules might follow it? A software license is a rule set that tells people how they can use, share, and change code. Some licenses are permissive, while others, like GPL, can require shared code to stay open under certain conditions.

Think of your detector like a fingerprint match system for code. Instead of matching a person, you match patterns in a file or snippet against known open-source code and license cues. The goal is to predict when a repo may contain borrowed code with a license that conflicts with the rest of the project. In an LLM-written repo, that means you are not only asking, "Does this look similar to something online?" You are also asking, "What legal baggage might ride along with that similarity?"

Why This Is a Good Topic

This is a strong science fair topic because you can measure it in clear ways, like precision, recall, and false positive rate. You can test whether fuzzy fingerprinting works better than simple text matching, and whether some kinds of generated code are harder to classify than others. It also connects to a real problem in software development, open-source compliance, and AI-generated code. You can learn code search, classification, evaluation design, and error analysis without needing a wet lab.

Research Questions

How does fuzzy code fingerprinting compare with exact string matching for identifying likely upstream licenses?
What is the effect of snippet length on license prediction accuracy?
Does adding file-level context improve GPL contamination detection in AI-generated repositories?
To what extent do formatting changes, variable renaming, and comment removal reduce fingerprint match quality?
Which license families are most often confused in a benchmark of public AI-generated repos?
How does a model trained on one set of repositories perform on a different set of repositories?

Basic Materials

Laptop or desktop computer with a modern CPU.
Git client for collecting and managing repositories.
Python 3 environment.
Text editor or code editor.
Spreadsheet software for tracking samples and labels.
External storage or cloud drive for organizing benchmark files.
Basic notebook for label rules and error notes.

Advanced Materials

University lab workstation or high-memory desktop.
Large local code corpus or access to a curated open-source repository dataset.
Code similarity indexing tools for large-scale search.
Python libraries for machine learning, data cleaning, and evaluation.
GPU access if you test embedding-based or transformer-based classifiers.
Database or search backend for storing snippet fingerprints and matches.
Version control system for reproducible experiments.

Software & Tools

Python: Runs the parsing, fingerprinting, classification, and evaluation pipeline.
GitHub search: Helps you sample public repositories and inspect code history.
Git: Tracks changes to your benchmark, labels, and experiment versions.
Pandas: Organizes snippet metadata, labels, and prediction results.
scikit-learn: Supports baseline classifiers, metrics, and cross-validation.
Jupyter Notebook: Lets you inspect matches, errors, and confusion patterns interactively.

Experiment Steps

Define the unit you will analyze, such as a file, function, or snippet, and decide how you will label license origin.
Build a benchmark from public repos by selecting samples with known or strongly documented licenses.
Design at least two matching methods, one simple baseline and one fuzzy fingerprinting approach.
Set rules for how you will score a match, including what counts as a likely upstream license and what counts as contamination risk.
Plan a testing split that keeps related repos or near-duplicate snippets out of both training and test sets.
Compare errors by license family, snippet size, and code modification style, then turn those patterns into your main findings.

Common Pitfalls

Using whole repositories as labels when only some files actually carry the same license, which muddies the ground truth.
Mixing training and test snippets from near-duplicate repos, which makes the detector look better than it really is.
Treating every similarity match as a license conflict, which inflates false alarms.
Ignoring license text and repository metadata, which can make the model miss obvious cues.
Evaluating only overall accuracy, which hides the fact that GPL-related errors matter more than easy negative cases.

What Makes This Competitive

A competitive version of this project would go beyond a simple similarity checker. You would need a clean benchmark, a careful label scheme, and metrics that separate mild similarity from real license risk. Strong work would compare multiple approaches, explain failure cases, and show where the detector helps most, such as near-duplicate code, paraphrased functions, or mixed-license repos. If you add a novel error analysis or a better way to estimate contamination risk, your project gets much stronger.

Project Variations

Test whether the detector works better on Python repos than on JavaScript repos.
Compare snippet-level detection with file-level detection for mixed-license projects.
Add repository metadata, then see whether license prediction improves over code-only matching.

Learn More

GNU Project licenses page: Read plain-language summaries of major open-source licenses and their obligations on the GNU website.
OSI license list: Review approved open-source licenses and compare their core terms on the Open Source Initiative website.
GitHub Docs, code search: Learn how public code search and repository metadata can help you build a benchmark.
PubMed: Search review articles on software license compliance, code cloning, and software reuse when you want background reading.
arXiv: Search for recent papers on code similarity detection, provenance, and license classification.
MIT OpenCourseWare, software construction or information systems courses: Use lecture notes on software reuse, repositories, and software engineering methods.

Systems Software Category Guide

How to Do Real Systems Software Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →