Seeing with Humans: Gaze-Assisted Neural Image Captioning

Gaze tracking uses human gaze data for various applications in computer-vision systems, since gaze reflects how humans process visual scenes. Recently, researchers from Max Planck Institute for Informatics, Saarbrucken wanted to know whether gaze data can also be beneficial for scene-centric tasks, such as image captioning. They presented a new perspective on gaze-assisted image captioning using neural networks, i.e. one uses the data of user gazes to automatically caption various images depending on the gaze focus.

They have used the SALICON dataset and compared localization capability of state-of-the-art object and scene recognition models with human gaze. A novel image-captioning model was proposed that integrates gaze information into long short-term memory architecture (a class of neural networks) with an attention mechanism. In this model, human gaze was represented as a set of fixation – where the user focused the most at specific locations – and then the model localized its machine attention to both fixated and non-fixated regions.

Human gaze fixation can help object-recognition models to differentiate important objects from non-important objects, but the model also needs to pay attention to objects that do not attract users’ gaze. Therefore, human gaze information can help holistic image understanding and captioning tasks.

Want to try out how gaze tracking works? Check out our demo!