Vision-Language Robot Navigation with LLaVA
ISEF Category: Robotics and Intelligent Machines
Ready to Turn This Idea Into a Real Project?
This guide was put together with the help of AI research tools to give you a solid starting point. But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.
For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →
Subcategory: Machine Learning · Difficulty: Advanced · Setup: University Lab · Time: Full Year
The Hook
A robot can hear “go to the room with the most red things” and still end up in the wrong place. The tricky part is not moving the robot, it is linking words, images, and maps in a way that makes sense. That gap is where your project lives. You can test whether a vision-language model really helps a robot navigate better than a simpler vision baseline.
What Is It?
This project asks a robot to follow human-style instructions that are fuzzy, visual, or relative. Instead of a clean command like “turn left,” the robot has to handle phrases like “go to the room with the most red things” or “find the chair near the window.” A vision-language model, like LLaVA, tries to connect the text instruction with what the camera sees. ROS 2 Nav2 then turns that decision into robot motion.
Think of it like giving directions to a friend who has never seen your house. A simple map tells them where walls and doors are, but it does not tell them which room counts as “the red room.” Your robot needs both perception and language understanding. A CLIP-only baseline gives you a simpler comparison point, since CLIP can match text to images but does not reason as deeply about the scene.
Why This Is a Good Topic
This makes a strong science fair topic because you can measure clear outcomes, like success rate, wrong-room rate, and navigation time. You can also compare two approaches, which makes your results stronger than a single demo. The real-world link is easy to explain, since home robots, assistive robots, and warehouse robots all need to understand messy human instructions. You can learn how to design a fair test, build a baseline, and analyze model failure modes.
Research Questions
- How does a vision-language agent’s success rate change across instructions with color words, object words, and spatial words?
- What is the effect of adding a CLIP-only baseline on navigation success in a foam-board apartment?
- Does the robot perform better when the target room contains one dominant visual cue instead of several competing cues?
- To what extent do instruction length and wording complexity change the robot’s path efficiency?
- Which failure type happens most often, wrong room choice, map localization error, or planning failure?
- How does changing the number of distractor objects affect the agent’s instruction following accuracy?
Basic Materials
- TurtleBot-clone or similar mobile robot platform
- Onboard camera and depth sensor, if available
- Laptop or desktop computer for model inference and logging
- Wi-Fi router or direct network link for robot communication
- Foam board, cardboard, and tape for building a small apartment map
- Colored paper, toy furniture, and printable signs for room cues
- Measuring tape for consistent map layout
- Printed instruction set for testing
- Stopwatch or timestamped log files
- Digital camera or phone for documenting runs.
Advanced Materials
- Mobile robot platform with ROS 2 support and wheel odometry
- RGB camera and depth camera, or RGB-D sensor
- GPU workstation for local model inference
- AprilTags or fiducial markers for localization checks
- Laser distance meter for accurate map geometry
- External microphone, if you test spoken instructions later
- Data logging drive or network storage for run traces
- Calibration board for camera alignment checks
- Benchmark object set with standardized colors and shapes
- Spare batteries and charging station for repeated trials.
Software & Tools
- ROS 2: Runs robot communication, sensing, and navigation modules.
Experiment Steps
- Define the exact instruction types you want the robot to handle, such as color-based, object-based, and spatial instructions.
- Build a small apartment map with repeated room layouts and clear distractors so you can test perception, not just luck.
- Choose one baseline and one vision-language agent, then keep the navigation stack and map conditions as similar as possible.
- Design success metrics before testing, including task completion, wrong-target rate, travel distance, and time to completion.
- Plan a trial schedule that balances instruction types, room configurations, and repeated runs so one condition does not dominate.
- Decide how you will score errors, log failures, and compare the two models with the same analysis rules.
Common Pitfalls
- Letting the foam-board apartment change between trials, which breaks comparison across runs.
- Using instructions that are too vague, which makes failure impossible to interpret.
- Mixing navigation bugs with language understanding bugs, which hides the real cause of errors.
- Changing lighting or camera angle between sessions, which shifts visual recognition performance.
- Running too few trials per instruction type, which makes random luck look like a real effect.
What Makes This Competitive
A stronger project goes beyond a simple success-rate chart. You can separate perception errors from planning errors, then test whether one kind of instruction breaks the system more than another. You can also compare not just LLaVA and CLIP, but different prompt styles, map layouts, or distractor levels. Careful logging, clean controls, and a clear error analysis will make the work feel much more like real robotics research.
Project Variations
- Test whether the robot handles material-based instructions better than color-based instructions, such as “room with the wooden chair” versus “room with the red chair.”
- Compare free-form language with templated commands to see how much natural language hurts or helps navigation.
- Change the apartment complexity by adding more distractor objects, then measure how scene clutter changes success rate.
Learn More
- ROS 2 Documentation: Read the official tutorials and package docs on the ROS website to understand robot communication and navigation.
Robotics and Intelligent Machines Category Guide
How to Do Real Robotics and Intelligent Machines Research at Home: A High School Student’s Guide to Free Tools, Affordable Kits, and Public Databases →For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →
To discover more projects, visit the MehtA+ Science Fair Hub →
