Mendidik dengan Kekuatan Fitrah

Sesungguhnya segala puji hanya milik Allah Subhanahu wa Ta’ala Yang kita memuji-Nya, kita memohon pertolongan dan pengampunan dari-Nya, yang kita memohon dari kejelekan jiwa-jiwa kami dan keburukan…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Revolutionary Object Detection Algorithm from Facebook AI

A 2020 approach to state of the art object detection

A detailed breakdown of an unorthodox object detection algorithm “Detection Transformer or DETR” from Facebook AI released in May 2020.

Let’s jump right in.

In object detection world, the research has pivoted to new approaches and techniques to improve the accuracy further. Object detection is one of the most researched fields in Artificial Intelligence. All major university/company are publishing papers every year.

In the initial stage of object detection research, researchers had a focus on improving the image features from the backbone. The idea was to create a backbone architecture that creates most suitable features for further box classification and prediction.

Furthermore, the introductions of FPNs, ResNet modules, Inception modules, etc. brought the idea of effective routing of features that can lead to better refinement in later layers. It further increased the accuracy of object detection models. Afterwards, researchers started focusing on the efficiency part of the task. Several algorithms like Yolo, SSD, MobileNet, SqueezeNet etc. named as one stage detectors that can be used in real time came along the way.

Now, after years of research, object detection has turned into this beautiful mess of many moving parts. Complex object detectors of today have all the capabilities like high accuracy, real time, small object predictions in low resolution images and many more that were unheard of in less than a decade ago.

Still object detection tasks lacks the simplicity of classification task in terms of training, testing and a unified architecture that can generalize well with limited parameters.

In DETR, object detection problem is modeled as a direct set prediction problem.

The approach don’t require hand crafted algorithms like non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task. It makes the detection pipeline a simple end to end unified architecture.

The two novel components of the new framework, called DEtection TRansformer or DETR, are

Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel.[1]

Given an image, the model must predict an unordered set (or list) of all the objects present, each represented by its class, along with a tight bounding box surrounding each one. Transformer acts as a reasoning agent between the image features and the prediction. [2]

The paper ‘Attention Is All You Need’ introduces a novel architecture called Transformer. As the title indicates, it uses the attention-mechanism.

Like LSTM, Transformer is an architecture for transforming one sequence into another one with the help of two parts (Encoder and Decoder), but it differs from the previously described/existing sequence-to-sequence models because it does not imply any Recurrent Networks (GRU, LSTM, etc.).[3]

Fig 1 : Architecture of Transformer From ‘Attention Is All You Need’ by Vaswani et al.

The Encoder is on the left and the Decoder is on the right. Both Encoder and Decoder are composed of modules that can be stacked on top of each other multiple times, which is described by Nx in the figure. We see that the modules consist mainly of Multi-Head Attention and Feed Forward layers .[3]

Advantages of the DETR pipeline

It contains three main components as follows

1. CNN backbone to extract a compact feature representation

2. An encoder-decoder transformer

3. A simple feed forward network (FFN) that makes the final detection prediction.

The authors have used ResNet50 as the backbone of DETR. Ideally, any backbone can be used depending upon the complexity of the task at hand

Backbones provide low dimensional representation of the image having refined features.

Before we move to details of transformer encoder and decoder, I recommend you to go through this great explanation before proceeding further.

Fig 4 : Transformer Architecture in DETR [1]

As you can see, it is very similar to the original transformer block with minute differences adjusted to this task.

First, a 1x1 convolution reduces the channel dimension of the high-level activation map from C to a smaller dimension d, creating a new feature map d×H×W. The encoder expects a sequence as input so it is collapsed to one dimension, resulting in a d×HW feature map.

Each encoder layer has a standard architecture and consists of a multi-head self-attention module and a feed forward network (FFN).

Since the transformer architecture is permutation-invariant, they supplement it with fixed positional encodings that are added to the input of each attention layer.

The decoder follows the standard architecture of the transformer, transforming N embeddings of size d using multi-headed self- and encoder-decoder attention mechanisms.

The difference with the original transformer is that DETR model decodes the N objects in parallel at each decoder layer.

These N input embeddings are learnt positional encodings that they refer to as object queries,and similarly to the encoder, they are added them to the input of each attention layer.

The N object queries are transformed into an output embedding by the decoder. They are then independently decoded into box coordinates and class labels by a feed forward network (FFN), resulting N final predictions.

The decoder receives queries (initially set to zero), output positional encoding (object queries), and encoder memory, and produces the final set of predicted class labels and bounding boxes through multiple multi-head self-attention and decoder-encoder attention. The first self-attention layer in the first decoder layer can be skipped.[1]

FNN is a 3-layer perceptron with ReLU activation function and hidden dimension d, and a linear projection layer. FFN layers are effectively multi-layer 1x1 convolutions, which have Md input and output channels.

The FFN predicts the normalized center coordinates, height and width of the box w.r.t. the input image, and the linear layer predicts the class label using a softmax function.

Since we predict a fixed-size set of N bounding boxes, where N is usually much larger than the actual number of objects of interest in an image, an additional special class label ∅is used to represent that no object is detected within a slot. This class plays a similar role to the “background” class in the standard object detection approaches.[1]

The following explanation may seem a lot to grasp but trust me when you read it carefully it is just two simple steps.

A short detour to what is bipartite matching from GeeksForGeeks

Fig 5 : Bipartite Matching [4]

The loss function is an optimal bipartite matching function. Allow me to simplify it.

Next, they find a bipartite matching between these two sets using a matching function across a permutation of N elements with the lowest cost as follows:

Fig 6 : Best match between pred and gt with lowest cost [1]

where Lmatch(yi,ˆyσ(i)) is a pair-wise matching cost between ground truth yi and a prediction with index σ(i). It is formulated as an assignment problem with m ground truth and n predictions and is computed efficiently with the Hungarian algorithm over mxn matrix.

Each element i of the ground truth set can be seen as a yi= (ci,bi) where ci is the target class label (which may be∅) and bi∈[0,1] is a vector that has four attributes — normalized ground truth box center coordinates, height and width relative to the image size. For the prediction with index σ(i) we define probability of class ci as ˆpσ(i)(ci) and the predicted box as ˆbσ(i). The first part of loss takes care of class prediction and the second part is the loss for the box prediction. The matching cost is defined as follows:[1]

Fig 7 : Matching Cost Function [1]

After receiving all matched pairs for the set, the next step is to compute the loss function, the Hungarian loss.

Fig 8 : Hungarian Loss between pred and gt [1]

where ˆσ is the optimal assignment/best match computed in the matching cost function.

It does a negative log likelihood between all N permutations of predictions and ground truth to penalize the extra and incorrect boxes and classifications corresponding to them. It is same as most of the other object detectors out there.

In the paper, the author’s down-weight the log-probability term when ci=∅(no object) by a factor 10 to account for class imbalance. It is similar to FasterRCNN and other two stage detectors to account for positive to negative imbalance ratio.[1]

The paper uses a linear combination of L1 and Generalized IOU loss (scale invariant in nature).

Fig 9 : Box Loss Function [1]

This loss helps to predict the box directly without any anchor reference or scaling issue. These two losses are normalized by the number of objects inside the batch.

Fig 10 : DETR Results on COCO [1]

Authors have provided the inference code (< 50 lines of pytorch code) on the last page of the paper[1].

Fig 11 : DETR Inference code [1]

[3] https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04

Add a comment

Related posts:

Chapter 27

Something about the age of 27 has always been iconic to me. I attribute this to my heavy rotation of mainly Nirvana, 90’s, and 60’s rock that dominated my high school playlists. I missed the grunge…

What is a digital marketing plan?

A digital marketing plan is a document that outlines your company’s online marketing strategy. It includes your goals, target markets, and what tactics you will use to achieve them. When creating…

Why is America tolerating saudi Arabia?

The USA likes to pride itself about their tenacity when fighting terrorism. they talk about how we are destroying the Taliban or ISIS, and how we are making the world a safer place. But how can we…