The hippogriff gently nudges Harry: learning action from movie datasets

Similar human actions in different movies

Human action recognition is one of key areas in computer vision, due to its applications such as video surveillance, video retrieval and human-machine interaction. The main problem of these uses is for the computer to recognize low-level and high-level activities. Advances in computer vision and machine learning techniques and algorithms have improved over years, including more accurate eye, face and head tracking and motion capture, which can be seen in animated movies and Hollywood industry too. However, the video and recording of human activities has increased as well, and these records can be of use for potential marketing research (for example, how do consumers move in a store and where they stay the most), business surveillance or as datasets for biologists, sociologists and psychologists to observe human motion and overall action.

Human activities can be categorized into four levels: gestures, actions, interactions, and group activities. The simple process consists of building systems for a model for feature extraction, action learning and classification, and action recognition and segmentation. Basically, there are three steps: detection, tracking, and recognition. For example, to recognize the activity of shaking hands, arms and hands are detected, then a spatial and temporal description is created because of tracking results, and the final description is compared with existing patterns in a part of data used for training the model to determine the action type. There are lots of publicly available datasets to use for human action studies, such as KTH dataset (walking, jogging, running, boxing, hand waving and hand clapping), Weizman dataset (running, walking, skipping, jumping-jack, jumping-forward-on-two-legs, jumping-in-place-on-two-legs, galloping sideways, waving-two-hands, waving one-hand and bending), IXMAS dataset (checking watch, crossing arms, scratching head, sitting down, getting up, turning around, walking, waving, punching, kicking, pointing, picking, overhead throwing and bottom up throwing), and HOHA datasets  Hollywood human action datasets, and many others. HOHA1 dataset consists of video samples covering eight actions – answering phone, getting out a car, hand shaking, hugging, kissing, sitting down, sitting up, and standing up – from 32 movies, and HOHA2 contains video samples covering 12 actions – answering phone, getting out a car, hand shaking, hugging, kissing, sitting down, sitting up, standing up, driving car, eating, fighting, and running – and 10 classes of scenes from 69 movies.

Researchers have used models with the best result of 100% percent accuracy over the Weizmann dataset, while on KTH one they accomplished the best accuracy of 97.6%, Surprisingly, machine learning performed poorly on movie datasets, achieving only 56.8% and 58.3% on HOHA1 and HOHA2 respectively. Even though movies nowadays tend to be as realistic as possible, without exaggerations of actions, movements, and emotions, it seems that a realistic movie set is still too artificial for a computer to learn realistic movement on it. However, there have been recent advancements for improvement which use all the movie tools possible. For example, movie scripts (such as, provide text descriptions of the scenes, characters and human actions, so temporal alignment can be used to align speech sections and to transfer time information to scene descriptions.


MP2-MD description example

MP2-MD description example


New research focuses on assistive technologies as well, for example Max Planck-Institut’s Computer Vision and Multimodal Computing center proposed a new MPII Movie Description dataset features movie snippets aligned to scripts and descriptive video service. DVS is a linguistic description, conceived as an assistive technology, allowing visually impaired people to follow a movie, using scene and people descriptions, to better characterize the environment. Max Planck researchers benchmarked the computer vision algorithms to recognize different scenes, human activities and actions, and various objects, and they have achieved much better results than the previous studies have. The MPII-MD datasets contains a parallel corpus of over 68 000 sentences and video snippets from 94 HD movies.

Max Planck researchers' approach: training the visual classifiers, concatenating the scores of selected robust classifiers and using them as the input for the LSTM neural network

Max Planck researchers’ approach: training the visual classifiers, concatenating the scores of selected robust classifiers and using them as the input for the LSTM neural network

This dataset includes audio descriptions, which make movies accessible to millions of blind and visually impaired people all over the world, because it provides an audio narrative of the most important aspects of the visual information, namely actions, gestures, scenes, and character appearance. AD is prepared by trained describers and read by professional narrators, and more and more movies are being audio-transcribed everyday. However, it’s a difficult task, since it may take up to 60 person-hours to describe a 2-hour movie. Therefore, unfortunately, only a small subset of movies and TV programs are available for the blind, so if you’re up for a challenge, automating this would be a noble task. Generating video descriptions also involves the knowledge of computational linguistics and computer vision combined.

Because of this and similar challenges, automatic descriptions of visual content have received lots of attention in the studies and online communities over the last couple of years. Along with the MPII-MD dataset, there is a Montreal Video Annotation Dataset, which is another useful large-scale dataset. Many of the proposed methods for image captioning rely on pre-trained object classifier convolutional neural networks (artificial neural networks where individual neurons are tiled to respond to overlapping regions in the visual field) and long-short term memory recurrent networks (artificial neural network that consists of LSTM blocks – smart network units that can remember a value for an arbitrary length of time, and determines when the input is significant enough to remember or forget the value).


Audio description examples for the Harry Potter and the Prisoner of Azkaban

Audio description examples for the Harry Potter and the Prisoner of Azkaban

Despite the recent advances in the video description domain, the video description performance on the MPII-MD and M-VAD still remains relatively low. The factors that contribute to higher performance include the presence of frequent words, sentence length and simplicity, as well as the presence of “visual” words, such as “nod”, “walk”, “sit”, “smile” etc. A high bias in the data towards humans as subjects and verbs similar to “look” has been observed as well. Future work has to focus on dealing with less frequent words and handle less visual descriptions. The strategies would perhaps include considering external text corpora, other modalities (audio, dialogue), and to look across multiple sentences, which would allow better understanding and describing the story of the movie.