From Skynet to robots on Mars: computer vision overview


Think about it: Mars is currently the only planet completely populated by robots! And after decades of studying machine learning, they are able to “see” the environment that surrounds them, move around freely while avoiding obstacles, and gather data using their sensory perception in order to find adequate samples.

Curiosity using computer vision to move around the Martian surface

Curiosity using computer vision to move around the Martian surface


Computer vision is an interdisciplinary field of computer science which acquires and analyzes data from the real world to produce numerical or symbolic information. That way our computers are trying to process high-dimensional data the same way that people do the same using their vision to perceive images, faces and similar data. Therefore, a vast part of this discipline is comprised of different models for completing these tasks, usually with the help of machine learning advances, which gets its inputs from mathematics (especially geometry), physics, statistics, and cognitive science and neuroscience. Since the computer’s ability to see or perceive is trying to simulate the human brain, different techniques, algorithms and models have drawn their inspiration from major findings in neuroscience.


Let there be light!


Computer vision has many sub-disciplines such as face/head tracking and detection, object recognition and pose estimation, video tracking, scene reconstruction, learning, motion capture and estimation, image detection and restoration, and many others. Artificial intelligence can actually be seen as its mother discipline, since it deals with machine learning and computer vision, in order to acquire deep understanding of the environment. Sometimes there are philosophical issues as well, since it can be asked whether a computer can really see or realize what’s going on. For example, the famous Searle’s though experiment, the Chinese Room, talks about a man who receives complete instructions such as “when you get this Chinese symbol, produce this English word”. Would we tell that this man knows Chinese? Certainly not. Even though to an outside observer it may seem that he knows it, most of the people would deny it sentience. The similar issue is observed in AI as well: could we say that a computer is sentient if it’s only following instructions? Do Curiosity, Mars Rover or New Horizons spacecrafts really see the environment, or are they just following the man-made instructions? Is Deep Thought really having deep thoughts when it’s playing chess? The problem of hard AI goes like that: a great majority of scientists believe that human-like complete understanding of the world around us ought to be impossible for computers, since there’s always a missing link of not really understanding what’s going on.


Neuroscientific advances, especially in neurobiology, give significant data for computer vision, since most of the models and methods rely on studying human vision. Image sensors detect electromagnetic radiation, which use technology based on our understanding of quantum physics. The main object is to track and study light, and complete understanding of that is impossible without modern physics, where light and particles usually in the focus of research. Since Einstein’s theory of relativity, we have found out that there is a limit to the maximal velocity, and that’s the speed of light. The famous Einstein’s equation E=mc^2 tells a story that mass times the speed of light squared is equal to the energy, i.e. wherever there’s some mass, it’s physically equal to energy. That’s the same way stars produce energy by fusing hydrogen from their cores into helium, and therefore a certain part of their mass becomes energy. Quantum physics, on the other hand, tells us a story about subatomic levels, where fundamental and non-fundamental particles don’t behave as predictable as it may seem. The major discoveries of quantum physics tell us that we can describe particles states using probabilities and statistics, and that the world is not that exact as we thought it was. That’s the same reason Einstein opposed quantum mechanics until his death, since he believed that physics should be governed by exact laws, and that our complete understanding of the world can be described without probabilities. Therefore, computer vision nowadays uses image sensors which are designed using quantum physics, and the process by which light interacts with various surfaces is explained like that as well. Nota bene, Einstein got his Nobel prize in 1921 on the basis of photoelectric effect which tells a story of how light interacts with various surfaces, i.e. that metals emit electrons when you shine light over them.

Signal processing

Signal processing example


Neuroscience and computer vision also meet in signal processing that deals with theories and applications of transferring information between different physical and abstract systems. Mathematical and statistical methods are used to formalize, represent and analyze different inputs and outputs. It’s especially important in audio, speech, image and video processing.


August 29, 1997: Skynet has become self-aware


A signal propagating down an axon to the cell body and dendrites of the next cell

A signal propagating down an axon to the cell body and dendrites of the next cell

Artificial neural networks tend to simulate the human nervous system and brain functions, deriving its knowledge from physics, biology, and neuroscience. These models are learning models, and they are inspired by biological, especially human, neural networks. The main purpose is to estimate functions that perform a certain task under a large number of possible inputs. These neural networks tend to imitate real ones and are designed as a system of interconnected “neurons” that can communicate with each other. A neural pathway in humans is a series of interconnected neurons. The neurons themselves consists of axons and dendrites, where the axons are terminations of neural branches that conduct electrical impulses, and dendrites are tree-like endings which propagate the electrochemical stimulation received from other neural cells to other cell bodies. Artificial neural networks tend to simulate this kind of interaction and message exchange. For example, if there was a neural network trying to detect numbers and letters from images (similar to what CAPTCHA does), a set of input neurons could be activated by different pixels. There would be a main function which would judge which of these are relevant ones, and the results would be passed on to other neurons, which would try to connect these letters or numerals with the existing ones. This process finishes when an output neuron is activated, and a match is produced for the end user.


In machine learning there are two main types of learning. Supervised learning deals with labeled data. For example, if there was a dataset of various images, each piece of data would be annotated and described. On the other hand, unsupervised learning deals with unlabeled data, and the computer has to find its way to differentiate between different data subsets, clusters or similar images. While learning, neural networks and similar machine-learning algorithms use training sets and test sets. The computer is “trained” on a subset of the dataset, and then we use the rest to see if it had learned well on the basis of previously processed data, the same way that exams in various educational processes test the student whether he or she learned similar concepts by changing some parameters. The same way the teacher or a professor uses different numbers in an equation, a computer could be fed with new images or new data, and it has to derive correct conclusions, approximations or estimations on the basis of how it had been done before with the annotated data. So, it has to infer a certain function, and apply it to other data, which can be used to map new examples. Furthermore, unsupervised learning tries to find a hidden structure in unlabeled data, and it’s mostly used for clustering and various statistical distributions. There are no signals or comparisons for computer to try and compare it to labeled data, and it’s mostly used for pattern recognition and regression, a statistical method for estimating the relationships among different variables. So, the supervised learning is similar to learning with a teacher that corrects your outputs, while the unsupervised one is similar to your self-learning processes, without a certain continuous feedback.


How to see like a computer

Face tracking example (click for video) Face tracking example

One can see that in computer graphics one produces image data from three-dimensional models, while computer vision tends to produce three-dimensional models from image data. Other fields of computer science are related to computer vision as well, especially in the field of image processing and analysis. However, there is a big overlap with virtual/augmented reality and animation, since facial detection/recognition/tracking techniques are often used for further improvements in animated motion pictures and in movie industry as well. Motion capture and head/face tracking tend to give input for animators and special effects teams, so that they can produce realistic movement and facial expressions. Face tracking and recognition is often used in social media as well, for recognizing and tagging specific people in various images or in video as well. Nowadays, a big area of tracking and recognition from multimedia data deals with subtle variances in emotion detection, gender and ethnic recognition, and in devising ways to capture microexpressions and macromovements even more accurately. Computer vision techniques are used in image restoration as well, and are of great use for humanities too, since one can track specific brush movements and techniques usually invisible to the human eye, in order to recognize if a certain painting is a forgery or not.


So, the main method is the image acquisition using mentioned image sensors, radars, cameras and similar devices, which are then pre-processed, so that the data collect has little noise and is accurate as possible. These images are then detected, analyzed and segmented, to see which points, parts or subsets of these images are relevant for the task in question. For example, if a computer is trained to find faces, it will focuses on the candidates for facial recognition, and not on background environments or other objects. The final point is processing and decisions, where the final analyses are made, and a certain output is given, based on the fact if the computer found a match or not.


Old techniques for New Horizons

Curiosity moving around the Martian surfice: a selfie Curiosity moving around the Martian surfarce: a selfie

Machine vision is a very important subfield of computer vision, which overlaps with robotics. Computer vision techniques are used to guide robots, and to track their movement. Robots are equipped with image sensors, cameras and other different sensors, so that they can capture data from the environment and perform wanted tasks. Robotic movement is based on edge detection and object recognition, so that a robot can move through the environment without bumping into different obstacles. Machine-learning algorithms help the robot to recognize wanted patterns and to collect the requested data. For example, different spacecrafts have to be trained to collect the requested material, and not unwanted or useless ones.


For example, Mars Science laboratory is a space probe mission that was launched by NASA in 2011, and it successfully landed Curiosity on Mars, in the Gale crater in 2012. Curiosity’s goal is to study climate and geology, and take pictures of its surroundings, which included the world’s most valuable selfie. One of the famous examples is Rosetta, launched by the European Space Agency, which performed a detailed study of a comet with its lander module Philae. It performed the first successful comet landing, and a fly-by of Mars as well. Computer vision techniques are crucial for these missions, in order to obtain data from images as well, and to move efficiently in rugged and difficult areas. The most recent example is an interplanetary space probe called the New Horizons, which was launched by NASA in 2006, and it reached Pluto in July this year. Researchers from John Hopkins University and the Southwest Research Institute collaborated with NASA. Our computer vision have advanced highly after 2006, but it consists of various devices and modules that perform quite well, and are able to give us detailed data. For example, New Horizons has the Long-Range Reconnaissance Imager, which is designed for high resolution and responsivity at visible wavelengths. There is also Alice, an ultraviolet imaging spectrometer, which resolves 1024 wavelength bands in ultraviolet wavelengths, in order to determine Pluto’s atmospheric composition. New Horizons encompasses computer vision techniques with physical models to characterize geology and morphology of Pluto and its moon Charon, to map the surface compositions of Pluto, Charon and possibly other Kuiper belt objects, and to recognize new objects.


Let it go!


Nowadays, computer vision is the most important part of modern medicine, since we use computational models to detect otherwise invisible diseases or areas. Robotic surgery has its advantages in the fact that there are no unwanted human factors such as imprecision or exhaustion, so sophisticated techniques such as laparoscopy are nowadays most often performed using robotic assistance. The other fields of application include detection of tumors and similar malign changes, but the most important application is for the neuroscience itself, to learn more about brain structure using various X-ray, ultrasonic and similar images to produce three-dimensional, and often enlarged, models of otherwise inaccessible parts of the human body.


Face tracking and animation combined: an application example Face tracking and animation combined: an application example

Nowadays, and unfortunately, most of applications are actually military ones, in order to track enemy soldiers, missiles and weapons, and to use computer vision techniques for accurate aims and missile guidance. This is the only area where we don’t want to see advances in computer vision. Well, except for robots if they get self-aware and realize how awful we are. Similar applications include various autonomous vehicles and drones, which have been also used for military purposes, but now they are trying to be re-branded into delivering your packages like Amazon tries to do, but generally people are not really that into drones finding their way into their backyard.


Entertainment advanced as well using computer vision, since animated motion pictures are getting more and more realistic every day. Animators and producers use real-life face, head and motion tracking to deliver the most realistic characters, based on human or animal movement and expressions. Similar to these examples, the special effects sections use these models to create fantasy creatures, based on humans or non-human animals, and IT firms create virtual reality gadgets, videogames and other augmented reality products. Sometimes these are oriented towards making the clients’ lives easier, such as virtual try-ons of, say, make-up or glasses; sometimes for displaying additional information while observing the world (Terminator and Predator style!), and sometimes they are here to entertain and create virtual environments for entertainment.


The future


Newest advances include knowledge graphs, bases and datasets for robots, for example Robo Brain is an online library of information that computer vision scientists can use to give their robots understanding of the world they see around them. This could be the beginning of the Skynet, so we’re still careful at the moment.


Google's DeepDream

Google’s DeepDream

However, the most important trend nowadays is the so-called deep learning, which focuses on modeling high-level abstractions using complex structures, most often by attempting to make better representations and learn these from large-scale unlabeled data. Deep learning has advanced computer vision and speech and image recognition techniques, and the so-called convolutional neural networks (where the individual neurons are tiled in the way that they resemble the overlapping regions in the visual field, inspired by real-life biological systems) seem to have the best performances. One of the most amazing examples recently is Google’s DeepDream, which uses a convolutional neural network to find and enhance different patterns in images in the way that it creates dreamlike hallucinogenic images by deliberate over-processing.


Nowadays, there are various applications such as Knoxwell, which enable millions of users to do, for example, armchair archeology and analyze faces and patterns on Ancient Greek pottery. Examples like this show that the future of computer vision lies not only in advancing the machine learning algorithms, but in using people to use these ingenious advances for research and improvement, as well as entertainment. Let’s see what will happen next!




  • Szeliski, Richard (2010): Computer Vision: Algorithms and Applications. Berlin: Springer-Verlag. ISBN 978-1848829343.
  • Klette, Reinhard (2014): Concise Computer Vision. Berlin: Springer-Verlag. ISBN 978-1-4471-6320-6.
  • Turek, Fred (2011): “Machine Vision Fundamentals, How to Make Robots See”, in: NASA Tech Briefs magazine 35 (6): 60–62
  • Computer Vision Online