Vision-Language Robot Navigation with LLaVA

ISEF Category: Robotics and Intelligent Machines

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Machine Learning · Difficulty: Advanced · Setup: University Lab · Time: Full Year

The Hook

A robot can hear “go to the room with the most red things” and still end up in the wrong place. The tricky part is not moving the robot, it is linking words, images, and maps in a way that makes sense. That gap is where your project lives. You can test whether a vision-language model really helps a robot navigate better than a simpler vision baseline.

What Is It?

This project asks a robot to follow human-style instructions that are fuzzy, visual, or relative. Instead of a clean command like “turn left,” the robot has to handle phrases like “go to the room with the most red things” or “find the chair near the window.” A vision-language model, like LLaVA, tries to connect the text instruction with what the camera sees. ROS 2 Nav2 then turns that decision into robot motion.

Think of it like giving directions to a friend who has never seen your house. A simple map tells them where walls and doors are, but it does not tell them which room counts as “the red room.” Your robot needs both perception and language understanding. A CLIP-only baseline gives you a simpler comparison point, since CLIP can match text to images but does not reason as deeply about the scene.

Why This Is a Good Topic

This makes a strong science fair topic because you can measure clear outcomes, like success rate, wrong-room rate, and navigation time. You can also compare two approaches, which makes your results stronger than a single demo. The real-world link is easy to explain, since home robots, assistive robots, and warehouse robots all need to understand messy human instructions. You can learn how to design a fair test, build a baseline, and analyze model failure modes.

Research Questions

How does a vision-language agent’s success rate change across instructions with color words, object words, and spatial words?
What is the effect of adding a CLIP-only baseline on navigation success in a foam-board apartment?
Does the robot perform better when the target room contains one dominant visual cue instead of several competing cues?
To what extent do instruction length and wording complexity change the robot’s path efficiency?
Which failure type happens most often, wrong room choice, map localization error, or planning failure?
How does changing the number of distractor objects affect the agent’s instruction following accuracy?

Basic Materials

TurtleBot-clone or similar mobile robot platform
Onboard camera and depth sensor, if available
Laptop or desktop computer for model inference and logging
Wi-Fi router or direct network link for robot communication
Foam board, cardboard, and tape for building a small apartment map
Colored paper, toy furniture, and printable signs for room cues
Measuring tape for consistent map layout
Printed instruction set for testing
Stopwatch or timestamped log files
Digital camera or phone for documenting runs.

Advanced Materials

Mobile robot platform with ROS 2 support and wheel odometry
RGB camera and depth camera, or RGB-D sensor
GPU workstation for local model inference
AprilTags or fiducial markers for localization checks
Laser distance meter for accurate map geometry
External microphone, if you test spoken instructions later
Data logging drive or network storage for run traces
Calibration board for camera alignment checks
Benchmark object set with standardized colors and shapes
Spare batteries and charging station for repeated trials.

Software & Tools

ROS 2: Runs robot communication, sensing, and navigation modules.

Experiment Steps

Define the exact instruction types you want the robot to handle, such as color-based, object-based, and spatial instructions.
Build a small apartment map with repeated room layouts and clear distractors so you can test perception, not just luck.
Choose one baseline and one vision-language agent, then keep the navigation stack and map conditions as similar as possible.
Design success metrics before testing, including task completion, wrong-target rate, travel distance, and time to completion.
Plan a trial schedule that balances instruction types, room configurations, and repeated runs so one condition does not dominate.
Decide how you will score errors, log failures, and compare the two models with the same analysis rules.

Common Pitfalls

Letting the foam-board apartment change between trials, which breaks comparison across runs.
Using instructions that are too vague, which makes failure impossible to interpret.
Mixing navigation bugs with language understanding bugs, which hides the real cause of errors.
Changing lighting or camera angle between sessions, which shifts visual recognition performance.
Running too few trials per instruction type, which makes random luck look like a real effect.

What Makes This Competitive

A stronger project goes beyond a simple success-rate chart. You can separate perception errors from planning errors, then test whether one kind of instruction breaks the system more than another. You can also compare not just LLaVA and CLIP, but different prompt styles, map layouts, or distractor levels. Careful logging, clean controls, and a clear error analysis will make the work feel much more like real robotics research.

Project Variations

Test whether the robot handles material-based instructions better than color-based instructions, such as “room with the wooden chair” versus “room with the red chair.”
Compare free-form language with templated commands to see how much natural language hurts or helps navigation.
Change the apartment complexity by adding more distractor objects, then measure how scene clutter changes success rate.

Learn More

ROS 2 Documentation: Read the official tutorials and package docs on the ROS website to understand robot communication and navigation.

Robotics and Intelligent Machines Category Guide

How to Do Real Robotics and Intelligent Machines Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

To discover more projects, visit the MehtA+ Science Fair Project Discovery Hub →