Robot Object Naming With Gaze And CLIP

Robot Object Naming With Gaze And CLIP

ISEF Category: Robotics and Intelligent Machines

Ready to Turn This Idea Into a Real Project?

This guide was put together with the help of AI research tools to give you a solid starting point.But a competitive science fair project lives in the details: refining your research question, fine-tuning your variables, analyzing your data, and presenting your findings like a seasoned scientist.

For next steps tailored to your interests, skill level, and timeline, work one-on-one with a MehtA+ mentor. Learn more about MehtA+ Science & Engineering Research Mentorship →

Subcategory: Cognitive Systems  ·  Difficulty: Advanced  ·  Setup: University Lab  ·  Time: Full Year

The Hook

A robot can watch a room full of objects and still not know what a mug is. That gap is the whole project. You can test whether active asking, not just passive captions, helps a robot learn faster. If your model learns new words from human-guided questions, you are studying how machines build language from interaction.

What Is It?

This project asks a simple question with a hard answer: can a robot learn object names better when it asks about them? In passive caption-training, a model sees pictures and text pairs, then tries to match them later. In active learning, the robot gets to point at an item and ask for its name. That changes the data it gets, because the robot can target confusing objects instead of waiting for random examples.

Think of it like a student with flashcards. One student studies a shuffled deck. Another student circles the cards they keep missing and asks for help on those first. The second student often learns faster because the practice is focused. Your robot version can use gaze or camera attention as the pointing signal, then compare how fast it grows its vocabulary against a passive baseline.

Why This Is a Good Topic

This is a strong science fair topic because you can measure learning speed, accuracy, and sample efficiency. The question connects to real problems in home robots, assistive devices, and interactive AI. You can also build a clear control group, which makes the project testable instead of vague. A student can learn about multimodal learning, transfer learning, and experimental design without needing a wet lab.

Research Questions

  • How does active object asking affect the number of household object names a robot learns compared with passive caption-training?
  • What is the effect of gaze-guided sampling on naming accuracy for visually similar objects?
  • Does fine-tuning CLIP with question-and-answer episodes improve zero-shot recognition of new household objects?
  • To what extent does the diversity of human responses change the robot’s vocabulary growth rate?
  • Which training strategy gives the best tradeoff between labeled examples and final object-name accuracy?
  • How does object category, such as containers, tools, or electronics, affect learning speed in the robot?
  • What is the effect of using confidence-based question selection versus random question selection?

Basic Materials

  • Laptop or workstation with a modern GPU or access to a school server.
  • Webcam or RGB camera for object images.
  • Small tabletop robot, or a camera on a pan-tilt mount.
  • Printed household objects or real household objects with clear labels.
  • Notebook for logging each interaction episode.
  • Spreadsheet software for tracking vocabulary growth and accuracy.
  • Basic room lighting setup with fixed placement.

Advanced Materials

  • Robot arm or mobile robot with camera and gaze estimation support.
  • Depth camera or stereo camera for better object segmentation.
  • GPU workstation for fine-tuning vision-language models.
  • Pretrained CLIP model and training framework.
  • Object detection model for separating cluttered scenes.
  • Motion capture or eye-tracking system for precise pointing and attention data.
  • Structured dataset of household objects with repeated views and labels.

Software & Tools

  • Python: Runs data collection scripts, model training, and evaluation code.
  • PyTorch: Fine-tunes the vision-language model and compares training strategies.
  • OpenCV: Captures images, tracks objects, and checks image quality.
  • ImageJ: Helps inspect image regions and compare sample snapshots when needed.
  • Roboflow: Organizes labeled object images and exports datasets for training.

Experiment Steps

  1. Define the exact object vocabulary you want the robot to learn, and group items into easy and hard categories.
  2. Choose one active-learning rule for asking questions, then define a passive baseline for comparison.
  3. Plan how you will record each learning episode, including the image, the object name, and the robot’s confidence.
  4. Build a scoring method for vocabulary growth, such as accuracy, new-word count, or learning speed over episodes.
  5. Design controls that separate true learning from simple memorization of object views or room layout.
  6. Decide how you will test generalization on new lighting, new angles, or new instances of the same object.

Common Pitfalls

  • Training and testing on the same object photos, which inflates accuracy without real learning.
  • Letting room lighting change across sessions, which shifts color and hurts visual matching.
  • Using too few object categories, which makes vocabulary growth look bigger than it really is.
  • Mixing up object names for similar items, which creates noisy labels and weak comparisons.
  • Skipping a passive baseline, which leaves you unable to tell whether active asking helped at all.

What Makes This Competitive

A stronger version of this project does more than compare two models. You can test several question-selection strategies, then measure not just accuracy, but learning efficiency and generalization. You can also separate object similarity, label ambiguity, and lighting changes to see what actually helps the robot learn. That kind of careful analysis makes the project much more than a demo.

Project Variations

  • Test whether the robot learns faster from close-up object crops than from full-room images.
  • Compare human answers given as names, short descriptions, or category labels.
  • Measure whether the robot learns household tools better than food containers or toys.

Learn More

  • MIT OpenCourseWare: Search for courses on computer vision, machine learning, and human-robot interaction to build your background.
  • Stanford Online Materials: Search lecture notes on deep learning, vision-language models, and active learning from free course pages.
  • PubMed: Search review articles on interactive learning and multimodal perception in assistive robotics.
  • arXiv: Search recent preprints on CLIP fine-tuning, active learning, and robot language grounding.
  • IEEE Xplore Abstracts: Use abstract searches to find papers on object naming, robot learning, and embodied AI.
Shopping Cart