Visage Technologies
Face Tracking & Analysis


A computer reads a story

Jul 1, 2016

Sequencing is a task for children that aims to improve understanding of the temporal occurrence in a sequence of events, so that they sort various images (sometimes with captions) into a coherent story. Researchers from Virginia Tech and TTI Chicago have proposed the task of machine-learning sequencing – given a jumbled set of aligned image-caption pairs that belong to a story, the task is that the computer needs to sort them in a way that they form a consistent story.

Image sequencing
They have used stories from the Sequential Image Narrative Dataset where a set of 5 aligned image-caption pairs together form a coherent story, and given a jumbled input story, they have trained machine-learning models to sort them. They have proposed a task of visual story sequencing and implemented two approaches to solve the task. The first one is based on individual story elements to predict the position, and the other one is based on the pairwise story elements to predict the relative order of the story elements.

These two approaches were also combined in a voting scheme that outperformed the individual methods. They have used text-based features from the caption and image-based features to represent story elements to show that they provide complementary improvements. The voting model that used pairwise Skip-Thought (the model that encodes a sentence to predict the sentences around it, using skip-grams which are bags of words that need not be consecutive in the text under consideration and may leave gaps that are skipped over), convolutional neural networks (type of feed-forward artificial neural network in which the connectivity pattern between its neurons is inspired by the organization of the animal visual cortex) and multilayer perceptrons (a feedforward artificial neural network model that maps sets of input data onto a set of appropriate outputs, it consists of multiple layers of nodes in a directed graph, with each layer fully connected to the next one), along with neural position embeddings worked the best and predicted the ordering of sentences to within a distance error of 0.8 (out of 5) positions.



← back  
Go to top