Robot Arm Command Refusal With Vision Models

ISEF Category: Robotics and Intelligent Machines

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Cognitive Systems · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

A robot can sound smart and still do something silly, like try to place a cup inside an apple. That happens because language models can guess words without really checking physics. Your project asks a sharper question, can a robot spot impossible commands before it acts? That is a real test of machine common sense.

What Is It?

This project studies common-sense grounding, which means matching words to the real world. Your robot arm gets a command, looks at the scene through a camera, and decides whether the action makes physical sense. If the command is impossible, it should refuse and explain why.

Think of it like a careful friend who does not just hear your sentence, but also looks at the table, the cup, and the apple before acting. A vision-language model, or VLM, combines image understanding with text understanding. You can test whether that model can catch obvious physics mistakes, like object containment, size mismatch, or support problems.

The science fair angle is not just whether the robot moves. The real question is whether it reasons about the scene well enough to stop bad actions. That makes this a mix of robotics, computer vision, and machine reasoning.

Why This Is a Good Topic

This is a strong science fair topic because you can measure it clearly. You can build a labeled set of commands, score whether the system refuses impossible ones, and compare different prompts, models, or scene setups. It connects to robot safety, home automation, and assistive robots that need to avoid harmful mistakes. You can learn how to design benchmarks, read model outputs, and analyze failure cases like a researcher.

Research Questions

How does the robot's refusal accuracy change when the command involves size mismatch versus containment mismatch?
What is the effect of adding a short physics reminder to the prompt on refusal accuracy?
Does a quantized model refuse impossible commands as often as the non-quantized version?
To what extent does the robot explain refusals with scene-specific reasons instead of generic text?
Which object pairs trigger the most false accepts in the benchmark set?
How does lighting or camera angle change the model's ability to detect implausible commands?

Basic Materials

Raspberry Pi with camera support.
Pi Camera or USB webcam.
Small robot arm kit with repeatable motion control.
Laptop or desktop computer for model testing.
Stable table or work surface.
Household objects with clear shapes, such as a cup, apple, spoon, box, and ball.
Printed prompt cards or a digital prompt list.
Notebook or spreadsheet for logging responses.
Tape or markers for fixed object placement.
Smart phone or basic camera for documenting trials.

Advanced Materials

Robot arm with open software control and repeatable pose planning.
Raspberry Pi or edge computer for on-device inference tests.
Camera with adjustable mount and fixed focal distance.
Calibrated reference objects with known dimensions.
External monitor or serial logging setup for debugging model output.
Dataset storage with version control for prompt and image labels.
GPU workstation for comparing full-size and quantized VLMs.
Image annotation tool for tagging object positions and scene states.

Software & Tools

Python: Runs the benchmark, logs model outputs, and computes refusal accuracy.
ImageJ: Helps measure object size in images and compare scene constraints.
Label Studio: Organizes and labels your prompt set and image categories.
Jupyter Notebook: Lets you explore error patterns and make plots from your results.
GitHub Desktop: Keeps your prompt set, code, and results organized with version history.

Experiment Steps

Define one kind of impossible command, such as containment, support, or size mismatch, so your benchmark stays focused.
Build a prompt set that mixes impossible and possible commands, then label each one with the scene fact that should trigger refusal.
Choose the model setup you will test first, including a baseline version and at least one comparison condition.
Plan how the robot will turn a visual scene into a decision, then decide what counts as a correct refusal and a correct acceptance.
Design a scoring sheet for refusal accuracy, explanation quality, and false acceptance rate so you can compare versions fairly.
Set aside a separate test group of prompts, so you can check whether the system generalizes beyond the examples you created.

Common Pitfalls

Mixing impossible commands with vague commands, which makes it hard to tell whether the robot failed reasoning or just failed wording.
Letting the camera angle change between trials, which can make the same scene look different to the model.
Using only one type of impossibility, which makes the benchmark too narrow and weakens the claim.
Scoring explanations only by whether they sound fluent, which can hide wrong or generic reasons.
Testing on the same prompts used during prompt tuning, which inflates accuracy and hides real failure cases.

What Makes This Competitive

A strong version of this project goes past a simple yes or no score. You can compare failure types, such as wrong acceptance, overrefusal, and vague explanation. You can also test whether the model stays accurate across new objects, new camera angles, and new prompt styles. If you build a clean benchmark and analyze where the robot breaks, your project starts to look like real research, not just a demo.

Project Variations

Test whether the robot refuses commands about unsupported objects, like placing a heavy item on a thin edge.
Compare how well two different VLMs explain impossible containment commands across the same scene set.
Measure whether adding depth cues or object size labels improves the refusal rate on physically impossible prompts.

Learn More

MIT OpenCourseWare: Search for robotics, computer vision, and machine learning courses that explain perception, planning, and model evaluation.
NIH PubMed: Search review articles on embodied AI, vision-language models, and robot safety.
NASA Open Science Data Repository: Explore open datasets and examples of structured scientific benchmarking.
NOAA Education Resource Collection: Find material on observation, data quality, and measurement error.
arXiv: Search preprints on vision-language models, common-sense reasoning, and robot command following.

Robotics and Intelligent Machines Category Guide

How to Do Real Robotics and Intelligent Machines Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →