In our previous blog post, we began our journey into the history of computer vision. The field of computer vision has revolutionized from the shallow, manual processing techniques to the more complex reasoning method. Check out our previous blog post to learn more about the shallow processing methods of the 1970’s – 2010’s. LINK TO PREVIOUS BLOG
Since 2010, the computer vision field has been constantly evolving and innovating image and video processing techniques that have been seamlessly integrated into our regular lives. Can you spot a computer vision application that you use in your daily life?
The advanced concepts and algorithms of the current age of computer vision are collectively known as ‘complex reasoning’ concepts in computer vision. Complex reasoning can be categorized into four features:
Image Classification – detecting disease and diagnoses from medical images
Image Detection – autonomous ‘self-driving’ cars detecting obstacles to navigate around
Image Segmentation – segment 3D medical scan results of patients for diagnosis and analysis, which previously needed a high level of expertise.
Image Retrieval – search for images in a database that are similar to a query image, as used in large search engines, such as Google.
In recent times, complex reasoning concepts are adapting to a deep learning-based structure. From a deep-learning perspective, a computer vision task can be divided into three levels:
Low-level processing
Intermediate-level processing
High-level processing
Additionally, all these tasks need a database to store their information. Any computer vision architecture consists of a database of information and task-solvers build around it.
Low-level processing of a computer vision task involves image/video acquisition and pre-processing for further analysis levels. There are several modes of image capturing that cater to different features:
RGB cameras...........Capturing light in Red,Green,Blue ranges.
RGBD cameras.........Capturing Depth along with RGB information
LIDAR cameras.........Projects lasers and measures only distance
X-ray machines........Gives us a response of tissues with x-rays waves
Sonogram...............An image formed by acoustic information
Binocular cameras..........Gives us 2 images that are separated
Telescopes.............Images in different wavelengths of very far objects
Microscopes............Images of very small objects
We acquire data from any of the above instruments and store them in a format that our code can understand. Generally, this format is a Numpy array in Python.
Consider this black and white image of a house.
We acquire this image for our mode of capture and store the value at each pixel. This value will range from 0 to 1, where 0 represents black and 1 represents white.
A coloured RGB image is divided into three layers – one for each color, red, green and blue. Each of the layers holds a value for each pixel in the image, in the range of 0 to 255.
Intermediate-level processing of a computer vision task involves extracting features from the data. There are many techniques used for feature extraction. Many of these concepts have existed from the age of shallow reasoning in computer vision. However, with the advancement of deep-learning, methods of higher accuracy have been developed and put to good use.
From the black and white image of a house in the previous section, we can apply a feature extractor for edge detection. The output presents the edges between items in the image. Edge detection feature extractors were one of the first image processing techniques to be developed. It is considered a shallow reasoning tool.
We can use image segmentation tools to label different aspects of the image. The output identifies and labels segments of the image, such as the roof, grass, sky and tree. This process requires a more advanced form of processing and is considered a complex reasoning technique.
High-level processing of a computer vision task is the highest level of abstraction. This level of processing allows us to understand the image and extrapolate information. Looking at our example of the black and white house, we have identified the edges and the segmentation of the image. With higher-level processing, we can deduce and reason the aspects that make a house, and so forth.
Now that we have understood the basic structure of a complex reasoning computer vision task, let’s put it into action.
The task is to detect and track all of the people appearing in a surveillance camera video feed. The program should be able to undergo:
Object Classification.........Identify and classify a person
Object Localization.........Locate the person in space
Object Re-Identification.....Re-identify a classified person
Object Tracking...............Track the movement of a classified person
We can segregate our program into the three levels of processing of complex reasoning.
Low-Level Processing – acquiring the surveillance footage and storing the data in a Numpy array on disk
Intermediate-Level Processing – object classification and object localization
High-Level Processing – object re-identification and tracking
Tools and algorithms to use:
Object detection Algorithm........YOLO V3
Object Re-identification Algorithm.......Deep SORT
Tracking Algorithm..........Kalman Filter
Using the following command we can run object tracking on the video:
python object_tracker.py --video ./data/video/test.mp4 --output ./data/video/results.avi --weights ./weights/yolov3-tiny.tf --tiny