No more drama! Multimodal face-recognition techniques

Facial recognition algorithms nowadays have problems with extreme facial expressions, especially dramatic ones that appear in multimedia applications, social networks and digital entertainment. For example, various dramatic poses, different illumination or various expressions make things harder for computer algorithms to track and recognize faces. One of these niches with extreme facial gestures is sports – victory celebrations and expressions during difficult games such as tennis present a challenge for modern machine learning and computer vision algorithms.

Social media images have two main problems: first, these photographs are usually taken in real-life conditions, of different lighting, illumination and conditions; and second, the vast number of images consists a huge database where lots of algorithms tend to perform slowly. Researchers from IEEE have recently tried to deal with these issues and have proposed a deep learning framework to jointly learn face representation using multimodal information. The proposed deep learning structure consists of convolutional neural networks and a stacked auto-encoder (one of the main types of deep networks consisting of a stacked ensemble of auto-encoders, which learns a compressed, distributed encoding/representation for a certain set of data).

The next step is to extract these features and to concatenate them to form a high-dimensional feature vector, and the stacked auto-encoder compresses its dimension. These networks were trained to learn these facial features on a CASIA-WebFace. This dataset, developed at the Center for Biometrics and Security Research, is a large-scale collection consisting of 10 575 subjects and 494 414 images. Our researches have used a subset of 9000 images to train the network to recognize these features, and the rest was used for testing – to see if these networks have successfully learned how to recognize faces. For those outside the field of computer vision, compare that to learning a subset of class material, and then a teacher tests your knowledge using a different set, but with similar content. The Labeled Faces in the Wild database was used for verification, and this model achieved the 98.43% verification rate.

Multimodal deep face representation consists of two steps: 1. multimodal feature extraction using convolutional neural networks 2. feature-level fusion of these features using stacked auto-encoders

Multimodal deep face representation consists of two steps: 1. multimodal feature extraction using convolutional neural networks 2. feature-level fusion of these features using stacked auto-encoders

 

Most of face-recognition algorithms tend to extract just a single face representation from the available face image. However, recent research focused on extracting holistic-level features. IEEE researches first act by extracting multimodal features from the holistic face image, and they use the 3D model to render a frontal face. With the help of a number of image patches, the mentioned stacked auto-encoders compress these information into a compact face signature.

Patch sampling (2D & 3D interactions)

Patch sampling (2D & 3D interactions)

This combination of different techniques yielded a high-accuracy result, and can be used in future image recognition, detection and tracking systems, especially with those in non-ideal real-world conditions. The research of social media images and datasets is just taking its first steps, but studies in computer vision and machine learning like these will give it a huge boost.

_______________

References: