Computer vision has become a very prominent feature in our day-to-day lives, pushing developers to continually re-invent the wheel and create several useful applications. In the late 1960’s, artificial intelligence pioneers hoped to mimic the human visual system, giving birth to the concept of computer vision. From the 1970’s, we began to see the development of computer vision algorithms we still use today.
The history of computer vision can be broadly divided into two distinct periods:
Shallow reasoning mainly deals with extracting features manually from an image or video. To create feature extractors, such as edge detectors and ripple detectors, among many others, one has to have domain knowledge. However, creating such feature extractors can be a cumbersome process. With these extracted features, we can only go so far. It is very difficult to generalize the features to many use cases, giving rise to the name ‘shallow reasoning’.
Solutions designed before 2010 mainly focused on leveraging shallow reasoning and could solve some basic tasks, such as digit recognition and early face detection. Digit recognition has been used for reading bank cheques, or postal codes on mail. Face detection has been used to auto-focus on facial features in cameras.
Since the advent of deep learning, many computer vision tasks have moved from shallow reasoning to deep learning. The main advantage of this is that the feature extractors no longer have to be manually created – they can now be automatically learned by the model. In doing so, this focus has shifted to higher layers of abstraction to achieve the goal.
Since 2010, computer vision solutions have shifted to leverage complex reasoning methods. Currently, complex reasoning methods include detecting possible diagnoses from medical images, detecting obstacles in self-driving cars, reverse-image searches, or even creating 3-D models from 2-D images. This is only a snapshot of the applications of complex reasoning methods. To learn more about complex reasoning, check out part two of our blog post! LINK TO PART TWO
Let’s dig a little deeper into the earlier methods of computer vision.
Concepts of shallow reasoning are often referred as classical computer vision techniques. The most popular techniques include local binary patterns (LBP), histograms of oriented gradients (HOG), scale invariant feature transform (SIFT), among many others.
In Python, an image can be stored as a Numpy array. The array holds the texture and noise of the image.
Texture: Local variations in intensity due to the structural heterogeneity
Noise: Uncertainty in image sensed by the instrument
When analyzing an image, we focus on the texture of an image, as the noise can vary from camera to camera. The texture of an image is often analyzed to decipher the content of an image, from the shape, color and patterns. Feature extractors are used to identify aspects of the image texture.
Image texture broadly consists of 2 paradigms: structural and statistical. Structural texture consists of feature extractors such as, Local Binary Patterns, Fourier Transform and Wavelets. Meanwhile, statistical texture consists of Co-Occurrence Matrices and Oriented Histograms. Let’s take a look at them!
The Local Binary Pattern feature vector, in its simplest form, is created in the following manner:
Divide the examined window into cells (e.g. 16x16 pixels for each cell).
For each pixel in a cell, compare the pixel to each of its 8 neighbors (on its left-top, left-middle, left-bottom, right-top, etc.). Follow the pixels along a circle, i.e. clockwise or counter-clockwise.
Where the center pixel's value is greater than the neighbor's value, write "0". Otherwise, write "1". This gives an 8-digit binary number (which is usually converted to decimal for convenience).
Compute the histogram, over the cell, of the frequency of each "number" occurring (i.e., each combination of which pixels are smaller and which are greater than the center). This histogram can be seen as a 256-dimensional feature vector.
Optionally, normalize the histogram.
Concatenate (normalized) histograms of all cells. This gives a feature vector for the entire window.
The feature vector can now be processed using a machine learning algorithm to classify images.
The Fourier Transform decomposes an image into sine and cosine components. The decomposition process converts the spatial domain of an image into the Fourier domain, or frequency domain. In the output Fourier domain, each point represents a frequency contained in the input image. The output is then inversed.
Let’s look at the image below.
From the image, we have extracted three patches. In the yellow patch, we see a sharp boundary between the sky and mountains. The red patch contains a smooth gradient of colour going through it. The blue patch contains a solid grey colour, without much pattern or change. These patches can be used to find features in the rest of the image. We take the discrete Fourier transform for a select number of patches, apply this to the whole image, and compute the inverse. This gives us the image with frequencies only in the selected band, which can be used in machine learning algorithms.
In situations where the Fourier Transform fails to provide a reasonable image analysis, wavelets can be used. While the Fourier Transform is able to provide a description of how frequency signals are localized in space, it gives no insight into the localization in time. The Wavelet Transform allows for an easier way of analyzing dynamic signals that change over time. It offers a higher resolution in both time and frequency domain. The infinite wave of sin() and cos(), as seen in the Fourier Transform, do not provide insight into changes in time. Meanwhile, wavelets act as a ‘mini-wave’ that shows a snapshot of a single burst, referring to a specific time.
In computer vision, wavelets can be used for a plethora of image processing methods due to its better spatial domain localization property, which is essential for image analysis. Wavelets can remove the noise of an image by capturing the wavelet of the noise, modifying the wavelet coefficients and computing the inverse of the wavelet. Wavelets can also be used for evaluate and process color images without distorting the image. Wavelets have also been used for image super-resolution, enhancement, compression, analysis and classification. For example, the Gabor Wavelet, as shown below, is one of the most important kernels used in fingerprint and retina analysis.
A co-occurrence matrix or co-occurrence distribution (also referred to as ‘gray-level co-occurrence matrices’ GLCMs) is a matrix that is defined over an image to be the distribution of co-occurring pixel values (grayscale values, or colors) at a given offset. It is used as an approach to texture analysis with various applications especially in medical image analysis. The texture of an image is calculated by identifying how often pairs of pixel with specific values and in a specified spatial relationship occur in an image, creating a GLCM, and then extracting statistical measures from this matrix. The statistical measures include analyzing the contrast, correlation, energy and homogeneity of the textures found.
The histogram of oriented gradients (HOG) is a feature descriptor used in computer vision and image processing for the purpose of object detection. The technique counts occurrences of gradient orientation in localized portions of an image. This method is similar to that of edge orientation histograms, scale-invariant feature transform descriptors, and shape contexts, but differs in that it is computed on a dense grid of uniformly spaced cells and uses overlapping local contrast normalization for improved accuracy.