Highlights: Prior to Yolo majority of approaches for object detection tried to adapt the classifiers for the purpose of detection. In YOLO, an object detection has been framed as a regression problem to spatially separated bounding boxes and associated class probabilities. In this post we will learn about the YOLO Object Detection system, and how to implement such a system in TensorFlow 2.0.
About Yolo:Our unified architecture is extremely fast. Our base YOLO model processes imagesin real-time at 45 frames per second. A smaller version of the network, Fast YOLO,processes an astounding 155 frames per second … — You Only Look Once: Unified, Real-Time Object Detection, 2015
Yolo is a state-of-the-art, object detection system (network). It was developed by Joseph Redmon. The biggest advantage over other popular architectures is speed. The Yolo model family models are really fast, much faster than R-CNN and others. This means that we can achieve real-time object detection.
At the time of first publishing (2016.) compared to systems like R-CNN and DPM, YOLO has achieved the state-of-the-art mAP (mean Average Precision). On the other hand, YOLO struggles to accurately localize objects. However, it learns general representation of the objects. In newer version, there are some improvements in both speed and accuracy.
Alternatives (at the time of the publishing): Other approaches mainly used a sliding window approach over the entire image and the classifier is used on these regions (DPM – deformable part models). In addition, R-CNN used region proposal method. This method first generated potential bounding boxes. Then, a classifier was run on those boxes, and post-processing has been applied to remove duplicate detections and refine bounding boxes.
YOLO has reframed an object detection problem into a single regression problem. It goes directly from image pixels, up to bounding box coordinates and class probabilities. Hence, a single convolutional network predicts multiple bounding boxes and class probabilities for those boxes.
As Yolo works with only one look at the image, sliding windows is not the right approach. Instead, the entire image can be splitted into the grid. This grid will be dimensions. Now, each cell is responsible for predicting a few different things.
First thing, each cell is responsible for predicting some number of bounding boxes. Also, each cell will predict confidence value for each bounding box. In other word, this is a probability that a box contains an object. In case that there is no object in some grid cell, it is important that confidence value is very low for that cell.
When we visualize all of these predictions, we get a map of all the objects and a bunch of boxes which is ranked by their confidence value.
The second thing, each cell is responsible for predicting class probabilities. This is not saying that some grid cell contains some object, this is just a probability. So, if a grid cell predicts car, it is not saying that there is a car, it is just saying that if there is an object, than that object is a car.
Let us describe in more details how output looks like.
In Yolo, anchor boxes are used to predict bounding boxes. The main idea of anchor boxes is to predefine two different shapes. They are called anchor boxes or anchor box shapes. In this way, we will be able to associate two predictions with the two anchor boxes. In general, we might use even more anchor boxes (five or even more). Anchors were calculated on the COCO dataset using k-means clustering.