On the face it is written: how computer recognition of faces works

Algorithms (technologies)

To identify a person from a photo from the point of view of a computer means two very different tasks: firstly, to find the person in the picture (if it is there), and secondly, to isolate from the image those features that distinguish this person from other people from the database.

1. Find

Attempts to teach a computer to find a face in photographs have been carried out since the early 1970s. Many approaches were tried, but the most important breakthrough occurred much later - with the creation in 2001 by Paul Viola and Michael Jones of a method of cascading boosting, that is, a chain of weak classifiers. Although there are trickier algorithms now, one can argue that the good old Viola — Jones — works in both your cell phone and the camera. It's all about remarkable speed and reliability: even in the distant 2001, an average computer using this method could process 15 pictures per second. Today, the efficiency of the algorithm meets all reasonable requirements. The main thing you need to know about this method is that it is surprisingly simple. You won’t even believe how much.

  1. Step 1. We remove the color and turn the image into a brightness matrix.
  2. Step 2. Put one of the square masks on it - they are called signs of Haar. We walk with her throughout the image, changing position and size.
  3. Step 3. Add the digital brightness values ​​from those matrix cells that fell under the white part of the mask, and subtract from them those values ​​that fell under the black part. If in at least one of the cases the difference between the white and black areas turned out to be above a certain threshold, we take this image area for further work. If not, forget about it, there is no face here.
  4. Step 4. Repeat from step 2 already with a new mask - but only in the area of ​​the image that passed the first test.

Why does it work? Look at the sign [1]. In almost all photographs, the eye area is always slightly darker than the area immediately below. Look at the sign [2]: the light area in the middle corresponds to the nose bridge located between the dark eyes. At first glance, black and white masks are not at all like faces, but for all their primitiveness, they have a high generalizing power.

Why so fast? In the described algorithm, one important point was not noted. To subtract the brightness of one part of the image from the other, it would be necessary to add the brightness of each pixel, and there can be many of them. Therefore, in fact, before applying the mask, the matrix is ​​translated into an integral representation: the values ​​in the brightness matrix are added in advance so that the integral brightness of the rectangle can be obtained by adding only four numbers.

How to assemble a cascade? Although each stage of masking gives a very large error (the actual accuracy is not much more than 50%), the strength of the algorithm lies in the cascading organization of the process. This allows you to quickly throw out the analysis of the area where the person is definitely not there, and spend effort only on those areas that can give a result. This principle of assembling weak classifiers in a sequence is called boosting (more about it can be found in the October issue of “PM” or here). The general principle is this: even big mistakes, when multiplied by each other, will become small.

2. Simplify

To find the features of a person that would allow identification of its owner means to reduce reality to a formula. We are talking about simplification, and very radical. For example, there can be a huge number of different combinations of pixels even on a miniature photo 64 x 64 pixels - (28) 64 x 64 = 232768 pieces. At the same time, in order to number each of 7.6 billion people on Earth, only 33 bits would be enough. Moving from one digit to another, you need to throw out all the extraneous noise, but keep the most important individual features. Statisticians familiar with these tasks have developed many data simplification tools. For example, the method of the main components, which laid the foundation for the identification of individuals. However, recently convolutional neural networks have left the old methods far behind. Their structure is quite peculiar, but, in fact, this is also a simplification method: its task is to reduce a specific image to a set of features.

Step 1 We apply a fixed-size mask to the image (it is called the convolution kernel correctly), multiply the brightness of each pixel in the image by the brightness values ​​in the mask. We find the average value for all pixels in the "window" and write it in one cell of the next level.

Step 2. We shift the mask to a fixed step, multiply again and again write the average in the feature map.

Step 3. Walking through the entire image with one mask, repeat with another - we get a new map of signs.

Step 4. Reduce the size of our cards: take a few neighboring pixels (for example, a 2x2 or 3x3 square) and transfer only one maximum value to the next level. We do the same for cards received with all other masks.

Step 5, 6. For the purposes of mathematical hygiene, replace all negative values ​​with zeros. Repeat from step 2 as many times as we want to get layers in the neural network.

Step 7, 8. From the last card of signs, we collect not a convolutional, but a fully connected neural network: we turn all cells of the last level into neurons that, with a certain weight, affect the neurons of the next layer. Last step. In networks trained to classify objects (to distinguish cats from dogs in the photo, etc.), here is the output layer, that is, a list of probabilities of detecting a particular response. In the case of faces, instead of a specific answer, we get a short set of the most important features of the face. For example, in Google FaceNet, these are 128 abstract numeric parameters.

3. Identify

The very last stage, identification itself, is the simplest and even trivial step. It comes down to assessing the similarity of the resulting list of signs to those that are already in the database. In mathematical jargon, this means finding the distance from a given vector to the nearest region of known persons in the space of signs. In the same way, another problem can be solved - to find people similar to each other.

Why does it work? The convolutional neural network is “sharpened” in order to extract the most characteristic features from the image, and to do this automatically and at different levels of abstraction. If the first levels usually respond to simple patterns such as hatching, gradient, clear borders, etc., then with each new level, the complexity of the signs increases. Masks that the neural network tries on at high levels often often resemble human faces or fragments of them. In addition, unlike the principal component method, neural networks combine features in a non-linear (and unexpected) manner.

Where do the masks come from? Unlike those masks that are used in the Viola-Jones algorithm, neural networks do without human help and find masks in the learning process. To do this, you need to have a large training sample in which there would be pictures of a variety of faces against a very different background. As for the resulting set of features that the neural network gives out, it is formed by the method of triples. Threes are sets of images in which the first two are photographs of the same person, and the third is a photograph of the other. The neural network learns to find such signs that bring the first images as close as possible to each other and at the same time exclude the third.

Whose neural network is better? Identification of persons has long since left the academy in big business. And here, as in any business, manufacturers seek to prove that it is their algorithms that are better, although they do not always provide data from open testing. For example, according to the MegaFace contest, the Russian Vokord company deepVo V3 algorithm is currently showing the best accuracy with a result of 92%. Google's FaceNet v8 in the same contest shows only 70%, and DeepFace from Facebook with the stated accuracy of 97% did not participate in the contest at all. It is necessary to interpret such figures with caution, but it is now clear that the best algorithms have almost reached the human face recognition accuracy.

Living makeup (art)

In the winter of 2016, at the 58th annual Grammy Awards, Lady Gaga performed a tribute to David Bowie, who died shortly before. During the performance, live lava spread across her face, leaving a sign recognized by all Bowie fans on her forehead and cheek - an orange lightning. The video projection created the effect of the moving makeup: the computer tracked the singer’s movements in real time and projected onto the face of the picture, given its shape and position. It’s easy to find a video on the Web, on which it is noticeable that the projection is still imperfect and slightly delayed with sudden movements.

Asmai Nobumichi Asai has been developing video-mapping technology for faces of Omote since 2014 and since 2015 has been actively demonstrating around the world, collecting a decent list of awards. He founded the company WOW Inc. became an Intel partner and received a good incentive for development, and cooperation with Ishikawa Watanabe from the University of Tokyo made it possible to accelerate the projection. However, the main thing happens on the computer, and many developers use similar solutions to apply masks to the face, whether it be the helmet of an Empire soldier or the make-up “under David Bowie”.

Alexander Khanin, Founder and CEO of VisionLabs

“Such a system does not need a powerful computer, masks can be applied even on mobile devices. The system is capable of working directly on a smartphone, without sending data to the cloud or to the server. ”

“This task is called tracking points on the face. There are many similar solutions in the public domain, but professional projects are fast and photorealistic, ”said Alexander Khanin, the head of VisionLabs. “The most difficult thing is to determine the position of the points taking into account facial expressions and the individual shape of the face or in extreme conditions: with strong turns of the head, insufficient lighting and high exposure.” In order to teach the system to find points, the neural network is trained - first by hand, meticulously marking photo by photo. “At the entrance, this is a picture, and at the output, a marked set of points, ” Alexander explains. - Next, the detector starts, the face is determined, its three-dimensional model is built, on which the mask is superimposed. Markers are applied to each stream frame in real time. ”

That’s how the invention of Nobumichi Asai works. Previously, a Japanese engineer scans the heads of his models, receiving accurate three-dimensional prototypes and preparing a video sequence taking into account the shape of the face. The task is also facilitated by small reflective markers, which are glued to the artist before entering the stage. Five infrared cameras monitor their movements by transferring tracking data to a computer. Then everything happens as we were told in VisionLabs: a face is detected, a three-dimensional model is built, and the Ishikawa Watanabe projector comes into play.

He introduced the DynaFlash device in 2015: it is a high-speed projector that can track and compensate for the movement of the plane on which the picture is displayed. The screen can be tilted, but the image will not be distorted and will be broadcast with a frequency of up to a thousand 8-bit frames per second: the delay does not exceed three milliseconds imperceptible to the eye. For Asai, such a projector was a godsend; live makeup really began to work in real time. On the video recorded in 2017 for the popular Inori duet in Japan, the lag is completely not visible. The faces of the dancers turn into living skulls, then into weeping masks. It looks fresh and attracts attention - but technology is already rapidly becoming fashionable. Soon, a butterfly, sitting on the cheek of a leading weather forecast, or performers, each time changing their appearance on the stage, will surely become the most common thing.

Face hacking (activism)

Mechanics teaches that every action creates a reaction, and the rapid development of systems for observing and identifying a person is no exception. Today, neural networks allow you to compare a random blurry photo from the street with pictures uploaded to social network accounts and find out the identity of a passer in seconds. At the same time, artists, activists and machine vision specialists are creating tools that can restore people's privacy, personal space, which is shrinking at such a dizzying speed.

Identification can be prevented at different stages of the operation of the algorithms. As a rule, the first steps of the recognition process are exposed to attacks - the detection of figures and faces in the image. As military camouflage deceives our eyesight by hiding an object, violating its geometric proportions and silhouette, people try to confuse machine vision with colored contrasting spots that distort important parameters for it: face contours, eye, mouth, etc. Fortunately, computer Vision is not yet so perfect as ours, which leaves great freedom in the choice of colors and shapes of such a “camouflage”.

Pink and purple, yellow and blue tones dominate the HyperFace clothing line, the first samples of which designer Adam Harvey and startup Hyphen Labs presented in January 2017. Pixel patterns provide machine vision with an ideal — from its point of view — picture of a human face that the computer catches as if it were a false target. A few months later, Moscow programmer Grigory Bakunov and his colleagues even developed a special application that generates makeup options that interfere with the operation of identification systems. And although the authors, thinking, decided not to put the program in the public domain, the same Adam Harvey offers several ready-made options.

A person in a mask or with a strange make-up on his face may not be visible to computer systems, but other people will surely pay attention to him. However, there are ways to do and vice versa. Indeed, from the point of view of the neural network, an image does not contain images in the usual sense for us; for her, a picture is a set of numbers and coefficients. Therefore, completely different objects may look for her something completely similar. Knowing these nuances of AI, you can conduct a more subtle attack and correct the image only slightly - so that changes will be almost invisible to a person, but machine vision will be completely deceived. In November 2017, researchers showed how small changes in the color of a turtle or a baseball make Google InceptionV3 confidently see a gun or a cup of espresso instead. And Mahmoud Sharif and his colleagues from Carnegie Mellon University designed a spotted pattern for eyeglass frames: it has almost no effect on the perception of the faces of others, but computer identification using Face ++ confidently confuses it with the face of the person “under whom” the frame-shaped pattern is designed.

The article “It is written on the face” was published in the journal Popular Mechanics (No. 12, December 2017).


10 km without a parachute: what to do if you fell out of an airplane
Military Technology: Why M & M's Invented
5 easy ways to light a match-free fire in nature