What dress goes with these shoes? Ask a computer!

Cornell clothing combination framework

Today’s world of users active on social media produces millions of photos every day, and a subset of these includes a large number of clothing and accessories. Computer vision researchers decided to use machine learning algorithms to answer the old question: “What outfit goes well with this pair of shoes?”. The required concept is learning visual similarity and the notion of compatibility across categories. Cornell university researchers’ have devised a new framework to learn a feature transformation from images of items into a latent space that expresses compatibility. The compatibility was modeled on the co-occurrence in large-scale user behavior data, focused on co-purchased data from Amazon.com.

This novel learning framework allows learning a feature transformation from the images of the items to a latent space, called the style space. That way images of items from different categories, which are matched together, are close in the style space, and those that are not matched are far apart in it. The challenge in retrieving bundles of compatible objects lies in the fact that they come from visually distinct categories: for example, white shirt and black pants can be similar in the style space even though they come from visually distinct categories.

The illustration of the proposed style-categorization framework

The illustration of the proposed style-categorization framework


The proposed framework consists of four parts. The first one includes the input data: item images, category labels and links between categories that describe co-occurrences. Second, to learn the style across different categories (cross-category fit), training examples are sampled from the input data to chose co-occurring heterogeneous dyads, i.e. two elements of a pair belonging to different high-level categories that frequently co-occur. The third part includes the use of Siamese Convolutional Neural Network architecture, where training examples were pairs of items that are either compatible or incompatible. With Siamese CNNs feature transformations from the image space are learned for the latent style space. In the end, structured bundles of compatible items are created by querying the learned latent space and retrieving the nearest neighbors from each category to the query item.

The evaluation of the learning framework was made on a large-scale dataset from Amazon.com, collected by McAuley et al. Cornell researchers have observed in their experiments that the learned style space expresses extensive semantic information about visual clothing style, and that the proposed framework outperforms most of the modern approaches. Two key concepts of the proposed sampling approach are the mentioned heterogeneous dyads and co-occurrences. Heterogeneous dyads are pairs where the two elements come from different categories, and we’re trying to do a cross-category fit. On the other hand, co-occurrence refers to elements occurring together. Items from the same category are generally visually very similar to each other and items from different ones are not. Convolutional neural networks tend to map visually similar items close in the output feature space, but to discourage this and to learn about the cross-category fit, the researchers have enforced all close training pairs to be heterogeneous dyads, and this helped pulling together items from different categories that are visually dissimilar but match in style. On the other hand, negative or distant pairs can include both: two items within the same category or from two different categories, to help separate visually similar items from the same category, which have different style.

Outfit generation included sampling of meaningful elements, for example ‘dress’ and ‘shirt’ did not get joined together, since they both cover the upper-body parts. One challenge with large-scale datasets, like the Amazon dataset used, is label noise, for example, there are shirts that are falsely labeled as shoes. To address this challenge the researchers introduced a robust nearest neighbor lookup method which enabled them to robustly sample outfits and ignore images with false labels. To continue, the stylistic insights the network learned were visualized by clustering the space for each category, and then for each category pair, they retrieved the closest clusters in the style space (outfits that should be worn together), and the farthest ones (that should not be worn together).

To further improve the framework, an online user study was conducted to try to understand how users think about style and compatibility, and compare this proposed learning framework against baselines. The questions like Given this pair of shoes, which of the two presented shirts fits better helped with predictions, and the additional questions regarding the reasons why they chose the specific combination produced interesting results. For example, users tend to choose the option that is functional (like long socks for boots); users sometimes choose the stylistically similar item which isn’t stylistically compatible; and lastly, users sometimes pick the item they like more, even though it isn’t the best match. Therefore, even the most advanced computer vision techniques and machine learning algorithms can’t quite predict human taste and subjective reason, which are often either unconscious or motivated by prior experience.