Deploying YOLOv3 on Ubuntu Linux

During this post, I will discuss the experience of deploying and testing the YOLOv3 object detection neural network. YOLO^[1] (You Only Look Once) is a state-of-the-art, well-known object detection neural network in the field of computer vision. The project originated in 2016, and several changes have been made on the old one that leads to the current version YOLOv3^[2]. Until now, there are many variants which vary in model depth, and thus, they have different performance in terms of accuracy and speed.

YOLO Introduction

What makes YOLO superior to all other kinds of object detection algorithms or neural networks is that it’s fast! And performs quite good! As shown in Figure 1, YOLO is on the top left corner of the figure, which is far away from other latest methods, which means it’s more efficient for the same mAP (mean Average Precision).

Figure 1: YOLO performance compared with other algorithms

The whole concept behind the YOLO is that it is a kind of end-to-end evaluation. It scans the whole image once and divides the image into sub-regions. Then, it predicts the objects as well as their bounding boxes within one forwarding pass.

As shown in figure 2, YOLO v1 has 24 convolutional layers and 2 fully connected layers, where convolutional layers are dedicated for feature extraction and the FC layers are for bounding box prediction. The basic structure can be generalized into the following points:

Adjust the image into a fixed size.
Divide the image into 7*7 grids. Every grid will detect the centroid of the objects that fall in this grid.
Each grid can propose two bounding boxes, where each bounding box contains five values: x, y, w, h, and confidence.
Each grid also predicts a length-twenty vector that denotes the probabilities of the objects that belong to a category. So each vector has 2*5+20=30 values.
Finally, Non-maximal suppression is used for sifting the bounding boxes.
- Set a threshold to select the bounding boxes that have higher confidence values.
- Then, for each category, add the bounding box with the highest confidence to the output list.
- For the remaining bounding boxes in that category, if the bounding box overlaps with the one that has the highest confidence mentioned above, we remove it. Otherwise, we add it to the output list.
- Do the above two steps for all other categories, and we will get a final output list.

Figure 3: YOLO v2 structure (Darknet-19).

Although YOLO v1 is fast, it has deficiencies in accuracy and recall. Then, it comes to the YOLO v2^[3] that addressed those issues (Figure 3). What has been added to the v2 is:

Adopted Darknet-19, get rid of the FC layers. Hence, it can accept images with different resolutions.
Add batch normalization after convolution, which can provide the effect of regularization, speed up the convergence, and avoid over-fitting.
Use high-resolution data for multi-scale training.
Other techniques like fine-grained features, direct location prediction, etc.

YOLO v2 fixed the deficiencies of v1, but still can not handle the problem like detecting overlapping objects. Hence, we have YOLO v3 that utilizes the Darknet-53 (Figure 4). Darknet-53 has the highest FPOS over ResNet, so it’s efficient to run on a GPU. YOLO v3 also uses prediction from the last three layers of different sizes and makes the final prediction based on the combined features from those layers. Another improvement is using logistic regression instead of a softmax classifier, so make it possible to predict one object for multiple classes.

Figure 4: YOLO v3 structure (Darknet-53).

All in all, it’s thrilling to test a state-of-the-art neural network, and curious to see how it performs on some datasets (e.g. KITTI). Next, I’ll start deploying this research work on my desktop.

Desktop Setup

My machine runs a Ubuntu Linux 19.10 OS with an Nvidia 2080Ti graphic card and an i9 9900KS CPU. The system already has Opencv and CUDA installed. Both of them are the optional dependencies of YOLO for faster performance and video detection. My target is to run the YOLO neural network in real-time.

Dataset

Since I already have the KITTI dataset stored locally, it would be convenient for me to use the KITTI dataset. It is recorded by well-equipped autonomous vehicles using grayscale and colored stereo cameras, a Lidar and an IMU. The images’ quality is among average, as the color is not very good in my own opinion compared with the sensors recently. I downloaded the RAW dataset from the date 2011/09/26 and drives from 0001, 0005, 0011, 0013, 0014, 0017, 0018, 0048, 0051, and 0056. During the testing, I concatenated all those raw camera frames together to two 3:29 length videos, where one is grayscale and another one is colored.

YOLO Compilation

Instructions to install YOLO are easy to follow and all information is on the author’s website^[4][5].

In fact, we are going to compile an open-sourced neural network tool called Darknet, which implements several object detection architectures inside, including our target, YOLOv3, and it’s variants. Here, I will share the steps I used to compile the YOLO darknet and run the tests.

Because I need to run the tests on the GPU and the visualization function needs the OpenCV library, I compiled the Darknet with CUDA and OpenCV enabled. However, the steps of installing CUDA and OpenCV are not included since I’ve had it on my machine.

Another issue that I encountered is the compatibility problem^[6]. Due to the lack of maintenance of the Darknet project, the code in the author’s repo is outdated. The original code can not be compiled with OpenCV 4.2, where the code uses the old C library API and they’re deprecated. Luckily, I found a folk^[7] that keeps updating and works with the latest OpenCV. It extends more functionalities as well. Hence, the following codes are based on tiagoshibata’s version on the master branch.

# Clone the repo to local at any place you want
git clone https://github.com/tiagoshibata/darknet.git

# go to the folder we just downloaded
cd darknet

As I mentioned, we will compile darknet with CUDA and OpenCV. Thus, we need to change two lines in the “Makefile” of the “darknet” folder. After committing the changes, we use the “make” command to compile the Darknet.

# change the “Makefile” in the folder
vim Makefile

# change at line 1
GPU=1

# change at line 4
OPENCV=1

# compile Darknet
make

You will see the output from the terminal like this if everything is okay:

mkdir -p obj
gcc -I/usr/local/cuda/include/  -Wall -Wfatal-errors  -Ofast....
gcc -I/usr/local/cuda/include/  -Wall -Wfatal-errors  -Ofast....
gcc -I/usr/local/cuda/include/  -Wall -Wfatal-errors  -Ofast....
.....
gcc -I/usr/local/cuda/include/  -Wall -Wfatal-errors  -Ofast -lm....

Here I met the same error when I compiled OpenCV with CUDA: “gcc” version after 7 is not supported, which means the “gcc” and “g++” are too new for CUDA 10^[8]. This issue can be solved by altering the default “gcc” and “g++” versions, which I posted in my earlier notes^[8].

The Darknet’s compilation is done and as simple as one actual command “make” if your system met all prerequisites.

Experiments & Testing

Now we have the tool and data, let’s check how the tool can be used.

If you want to perform detection for an image, you can simply run the command: “./darknet detect cfg/yolov3.cfg yolov3.weights data/dog.jpg”. You can change the network’s configuration in the “cfg” folder and the image for detecting in the “data” folder. The corresponding network’s weights must be downloaded from YOLO homepage as well if we don’t want to train the network from scratch. The same rules applied to all following commands.
If you want to detect multiple images, we can change the command to: “./darknet detect cfg/yolov3.cfg yolov3.weights”. The terminal will load the model and show a prompt to input the image’s path, one at a time. It will prompt again after detection, you can exit with “ctrl+c” anytime.
If you want to perform detection on your webcam, it’s possible by running command: “./darknet detector demo cfg/coco.data cfg/yolov3.cfg yolov3.weights”. Additionally, you can use “-c <num>” to specify which camera to use, where 0 is the default camera in OpenCV.
If you want to detect the objects in a video stream, you should add file name behind the previous command: “./darknet detector demo cfg/coco.data cfg/yolov3.cfg yolov3.weights <video file>”. This is the command we need exactly for our experiment.

Note that the last two commands need CUDA and OpenCV enabled while the first two don’t need any dependencies. If we run any of the previous four commands, the program will create the neural network first and output the network’s structure in the terminal window.

batch = 1, time_steps = 1, train = 0 
   layer   filters  size/strd(dil)      input                output
   0 conv     32       3 x 3/ 1    608 x 608 x   3 ->  608 x 608 x  32 0.639 BF
   1 conv     64       3 x 3/ 2    608 x 608 x  32 ->  304 x 304 x  64 3.407 BF
   2 conv     32       1 x 1/ 1    304 x 304 x  64 ->  304 x 304 x  32 0.379 BF
   3 conv     64       3 x 3/ 1    304 x 304 x  32 ->  304 x 304 x  64 3.407 BF
   4 Shortcut Layer: 1,  wt = 0, wn = 0, outputs: 304 x 304 x  64 0.006 BF
   5 conv    128       3 x 3/ 2    304 x 304 x  64 ->  152 x 152 x 128 3.407 BF
   6 conv     64       1 x 1/ 1    152 x 152 x 128 ->  152 x 152 x  64 0.379 BF
   7 conv    128       3 x 3/ 1    152 x 152 x  64 ->  152 x 152 x 128 3.407 BF
   8 Shortcut Layer: 5,  wt = 0, wn = 0, outputs: 152 x 152 x 128 0.003 BF
   9 conv     64       1 x 1/ 1    152 x 152 x 128 ->  152 x 152 x  64 0.379 BF
  10 conv    128       3 x 3/ 1    152 x 152 x  64 ->  152 x 152 x 128 3.407 BF
  11 Shortcut Layer: 8,  wt = 0, wn = 0, outputs: 152 x 152 x 128 0.003 BF
  12 conv    256       3 x 3/ 2    152 x 152 x 128 ->   76 x  76 x 256 3.407 BF
  13 conv    128       1 x 1/ 1     76 x  76 x 256 ->   76 x  76 x 128 0.379 BF
  14 conv    256       3 x 3/ 1     76 x  76 x 128 ->   76 x  76 x 256 3.407 BF
  15 Shortcut Layer: 12,  wt = 0, wn = 0, outputs:  76 x  76 x 256 0.001 BF
  16 conv    128       1 x 1/ 1     76 x  76 x 256 ->   76 x  76 x 128 0.379 BF
  17 conv    256       3 x 3/ 1     76 x  76 x 128 ->   76 x  76 x 256 3.407 BF
  18 Shortcut Layer: 15,  wt = 0, wn = 0, outputs:  76 x  76 x 256 0.001 BF
  19 conv    128       1 x 1/ 1     76 x  76 x 256 ->   76 x  76 x 128 0.379 BF
  20 conv    256       3 x 3/ 1     76 x  76 x 128 ->   76 x  76 x 256 3.407 BF
  21 Shortcut Layer: 18,  wt = 0, wn = 0, outputs:  76 x  76 x 256 0.001 BF
  22 conv    128       1 x 1/ 1     76 x  76 x 256 ->   76 x  76 x 128 0.379 BF
  23 conv    256       3 x 3/ 1     76 x  76 x 128 ->   76 x  76 x 256 3.407 BF
  24 Shortcut Layer: 21,  wt = 0, wn = 0, outputs:  76 x  76 x 256 0.001 BF
  25 conv    128       1 x 1/ 1     76 x  76 x 256 ->   76 x  76 x 128 0.379 BF
  26 conv    256       3 x 3/ 1     76 x  76 x 128 ->   76 x  76 x 256 3.407 BF
  27 Shortcut Layer: 24,  wt = 0, wn = 0, outputs:  76 x  76 x 256 0.001 BF
  28 conv    128       1 x 1/ 1     76 x  76 x 256 ->   76 x  76 x 128 0.379 BF
  29 conv    256       3 x 3/ 1     76 x  76 x 128 ->   76 x  76 x 256 3.407 BF
  30 Shortcut Layer: 27,  wt = 0, wn = 0, outputs:  76 x  76 x 256 0.001 BF
  31 conv    128       1 x 1/ 1     76 x  76 x 256 ->   76 x  76 x 128 0.379 BF
  32 conv    256       3 x 3/ 1     76 x  76 x 128 ->   76 x  76 x 256 3.407 BF
  33 Shortcut Layer: 30,  wt = 0, wn = 0, outputs:  76 x  76 x 256 0.001 BF
  34 conv    128       1 x 1/ 1     76 x  76 x 256 ->   76 x  76 x 128 0.379 BF
  35 conv    256       3 x 3/ 1     76 x  76 x 128 ->   76 x  76 x 256 3.407 BF
  36 Shortcut Layer: 33,  wt = 0, wn = 0, outputs:  76 x  76 x 256 0.001 BF
  37 conv    512       3 x 3/ 2     76 x  76 x 256 ->   38 x  38 x 512 3.407 BF
  38 conv    256       1 x 1/ 1     38 x  38 x 512 ->   38 x  38 x 256 0.379 BF
  39 conv    512       3 x 3/ 1     38 x  38 x 256 ->   38 x  38 x 512 3.407 BF
  40 Shortcut Layer: 37,  wt = 0, wn = 0, outputs:  38 x  38 x 512 0.001 BF
  41 conv    256       1 x 1/ 1     38 x  38 x 512 ->   38 x  38 x 256 0.379 BF
  42 conv    512       3 x 3/ 1     38 x  38 x 256 ->   38 x  38 x 512 3.407 BF
  43 Shortcut Layer: 40,  wt = 0, wn = 0, outputs:  38 x  38 x 512 0.001 BF
  44 conv    256       1 x 1/ 1     38 x  38 x 512 ->   38 x  38 x 256 0.379 BF
  45 conv    512       3 x 3/ 1     38 x  38 x 256 ->   38 x  38 x 512 3.407 BF
  46 Shortcut Layer: 43,  wt = 0, wn = 0, outputs:  38 x  38 x 512 0.001 BF
  47 conv    256       1 x 1/ 1     38 x  38 x 512 ->   38 x  38 x 256 0.379 BF
  48 conv    512       3 x 3/ 1     38 x  38 x 256 ->   38 x  38 x 512 3.407 BF
  49 Shortcut Layer: 46,  wt = 0, wn = 0, outputs:  38 x  38 x 512 0.001 BF
  50 conv    256       1 x 1/ 1     38 x  38 x 512 ->   38 x  38 x 256 0.379 BF
  51 conv    512       3 x 3/ 1     38 x  38 x 256 ->   38 x  38 x 512 3.407 BF
  52 Shortcut Layer: 49,  wt = 0, wn = 0, outputs:  38 x  38 x 512 0.001 BF
  53 conv    256       1 x 1/ 1     38 x  38 x 512 ->   38 x  38 x 256 0.379 BF
  54 conv    512       3 x 3/ 1     38 x  38 x 256 ->   38 x  38 x 512 3.407 BF
  55 Shortcut Layer: 52,  wt = 0, wn = 0, outputs:  38 x  38 x 512 0.001 BF
  56 conv    256       1 x 1/ 1     38 x  38 x 512 ->   38 x  38 x 256 0.379 BF
  57 conv    512       3 x 3/ 1     38 x  38 x 256 ->   38 x  38 x 512 3.407 BF
  58 Shortcut Layer: 55,  wt = 0, wn = 0, outputs:  38 x  38 x 512 0.001 BF
  59 conv    256       1 x 1/ 1     38 x  38 x 512 ->   38 x  38 x 256 0.379 BF
  60 conv    512       3 x 3/ 1     38 x  38 x 256 ->   38 x  38 x 512 3.407 BF
  61 Shortcut Layer: 58,  wt = 0, wn = 0, outputs:  38 x  38 x 512 0.001 BF
  62 conv   1024       3 x 3/ 2     38 x  38 x 512 ->   19 x  19 x1024 3.407 BF
  63 conv    512       1 x 1/ 1     19 x  19 x1024 ->   19 x  19 x 512 0.379 BF
  64 conv   1024       3 x 3/ 1     19 x  19 x 512 ->   19 x  19 x1024 3.407 BF
  65 Shortcut Layer: 62,  wt = 0, wn = 0, outputs:  19 x  19 x1024 0.000 BF
  66 conv    512       1 x 1/ 1     19 x  19 x1024 ->   19 x  19 x 512 0.379 BF
  67 conv   1024       3 x 3/ 1     19 x  19 x 512 ->   19 x  19 x1024 3.407 BF
  68 Shortcut Layer: 65,  wt = 0, wn = 0, outputs:  19 x  19 x1024 0.000 BF
  69 conv    512       1 x 1/ 1     19 x  19 x1024 ->   19 x  19 x 512 0.379 BF
  70 conv   1024       3 x 3/ 1     19 x  19 x 512 ->   19 x  19 x1024 3.407 BF
  71 Shortcut Layer: 68,  wt = 0, wn = 0, outputs:  19 x  19 x1024 0.000 BF
  72 conv    512       1 x 1/ 1     19 x  19 x1024 ->   19 x  19 x 512 0.379 BF
  73 conv   1024       3 x 3/ 1     19 x  19 x 512 ->   19 x  19 x1024 3.407 BF
  74 Shortcut Layer: 71,  wt = 0, wn = 0, outputs:  19 x  19 x1024 0.000 BF
  75 conv    512       1 x 1/ 1     19 x  19 x1024 ->   19 x  19 x 512 0.379 BF
  76 conv   1024       3 x 3/ 1     19 x  19 x 512 ->   19 x  19 x1024 3.407 BF
  77 conv    512       1 x 1/ 1     19 x  19 x1024 ->   19 x  19 x 512 0.379 BF
  78 max                5x 5/ 1     19 x  19 x 512 ->   19 x  19 x 512 0.005 BF
  79 route  77 		                           ->   19 x  19 x 512 
  80 max                9x 9/ 1     19 x  19 x 512 ->   19 x  19 x 512 0.015 BF
  81 route  77 		                           ->   19 x  19 x 512 
  82 max               13x13/ 1     19 x  19 x 512 ->   19 x  19 x 512 0.031 BF
  83 route  82 80 78 77 	                   ->   19 x  19 x2048 
  84 conv    512       1 x 1/ 1     19 x  19 x2048 ->   19 x  19 x 512 0.757 BF
  85 conv   1024       3 x 3/ 1     19 x  19 x 512 ->   19 x  19 x1024 3.407 BF
  86 conv    512       1 x 1/ 1     19 x  19 x1024 ->   19 x  19 x 512 0.379 BF
  87 conv   1024       3 x 3/ 1     19 x  19 x 512 ->   19 x  19 x1024 3.407 BF
  88 conv    255       1 x 1/ 1     19 x  19 x1024 ->   19 x  19 x 255 0.189 BF
  89 yolo
[yolo] params: iou loss: mse (2), iou_norm: 0.75, cls_norm: 1.00, scale_x_y: 1.00
  90 route  86 		                           ->   19 x  19 x 512 
  91 conv    256       1 x 1/ 1     19 x  19 x 512 ->   19 x  19 x 256 0.095 BF
  92 upsample                 2x    19 x  19 x 256 ->   38 x  38 x 256
  93 route  92 61 	                           ->   38 x  38 x 768 
  94 conv    256       1 x 1/ 1     38 x  38 x 768 ->   38 x  38 x 256 0.568 BF
  95 conv    512       3 x 3/ 1     38 x  38 x 256 ->   38 x  38 x 512 3.407 BF
  96 conv    256       1 x 1/ 1     38 x  38 x 512 ->   38 x  38 x 256 0.379 BF
  97 conv    512       3 x 3/ 1     38 x  38 x 256 ->   38 x  38 x 512 3.407 BF
  98 conv    256       1 x 1/ 1     38 x  38 x 512 ->   38 x  38 x 256 0.379 BF
  99 conv    512       3 x 3/ 1     38 x  38 x 256 ->   38 x  38 x 512 3.407 BF
 100 conv    255       1 x 1/ 1     38 x  38 x 512 ->   38 x  38 x 255 0.377 BF
 101 yolo
[yolo] params: iou loss: mse (2), iou_norm: 0.75, cls_norm: 1.00, scale_x_y: 1.00
 102 route  98 		                           ->   38 x  38 x 256 
 103 conv    128       1 x 1/ 1     38 x  38 x 256 ->   38 x  38 x 128 0.095 BF
 104 upsample                 2x    38 x  38 x 128 ->   76 x  76 x 128
 105 route  104 36 	                           ->   76 x  76 x 384 
 106 conv    128       1 x 1/ 1     76 x  76 x 384 ->   76 x  76 x 128 0.568 BF
 107 conv    256       3 x 3/ 1     76 x  76 x 128 ->   76 x  76 x 256 3.407 BF
 108 conv    128       1 x 1/ 1     76 x  76 x 256 ->   76 x  76 x 128 0.379 BF
 109 conv    256       3 x 3/ 1     76 x  76 x 128 ->   76 x  76 x 256 3.407 BF
 110 conv    128       1 x 1/ 1     76 x  76 x 256 ->   76 x  76 x 128 0.379 BF
 111 conv    256       3 x 3/ 1     76 x  76 x 128 ->   76 x  76 x 256 3.407 BF
 112 conv    255       1 x 1/ 1     76 x  76 x 256 ->   76 x  76 x 255 0.754 BF
 113 yolo
[yolo] params: iou loss: mse (2), iou_norm: 0.75, cls_norm: 1.00, scale_x_y: 1.00
Total BFLOPS 141.531 
avg_outputs = 1083728 
 Allocate additional workspace_size = 106.46 MB 
Loading weights from weights/yolov3-spp.weights...
 seen 64, trained: 32013 K-images (500 Kilo-batches_64) 
Done! Loaded 114 layers from weights-file

As shown in the terminal’s output, we’re going to use YOLOv3-spp net for our experiment, where spp means spatial pyramid pooling layers. The spp layer can output uniformed dimensions regardless of the inputs’ shapes, and it will get the best features from the feature maps as well. Hence, YOLOv3-spp has the best “mAP” accuracy among other variants on the homepage’s board. But it’s heavier than others too, which has 141.5 BFLOPS (Billion Floating Point Operations Per Second). The weights of this network can be downloaded from homepage’s board as well (size around 250M).

Besides the outputs of the network’s structure, we can also find the results for the detection task also (shown below). Surprisingly, detecting an image is fast for YOLOv3-spp on RTX2080Ti in 348.4 milliseconds. There are three objects detected in the image, and their confidences are printed out also.

data/dog.jpg: Predicted in 348.431000 milli-seconds.
bicycle: 98%
dog: 97%
truck: 90%

Figure 5: YOLOv3-spp image detection result.

Video Detection Test

Now, let’s dive into video testing. I’m going to use two videos for the experiment: a grayscale one and a colored one. The raw KITTI dataset is saved in “png” image files. Thus, I need to make those two videos first. It can be done with OpenCV in Python easily, here’s a code snippet from my other project. It was used for other purposes, but it can help to save the videos for me.

def show_vid_arr(vid: Sequence, title: str = 'default', rate: int = 10,
                 out: Optional[str] = None, color: Optional[bool] = False) -> None:
    """
    play video from ndarray, press 'esc' to quit window

    :param vid: video array
    :param title: window title
    :param rate: sample rate of the video in Hz
    :param out: if given name, the video played will be saved into a Mpeg4 file in current folder.
    :param color: if True save color video, enabled when param 'out' is given
    """
    delay = int(1/rate*1000)
    if out is not None and isinstance(out, str):

        if color:
            resolution = vid[0].shape[:2][::-1]
            writer = cv2.VideoWriter(out, cv2.VideoWriter_fourcc('M', 'J', 'P', 'G'), rate, resolution)
        else:
            resolution = vid[0].shape[::-1]
            writer = cv2.VideoWriter(out, cv2.VideoWriter_fourcc('M', 'J', 'P', 'G'), rate, resolution, 0)
        print('Writer object created. Target file name: %s, resolution: (%d, %d)' % ((out,)+resolution))
    else:
        writer = None

    for frame in vid:
        if color:
            frame = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)

        cv2.imshow(title, frame)

        if writer is not None:
            # frame_uint8 = np.uint8(frame)
            writer.write(frame)
        if cv2.waitKey(delay) & 0xFF == 27:
            break

    cv2.waitKey(1)
    cv2.destroyAllWindows()
    if writer is not None:
        writer.release()
        print('Writer object released.')
    for i in range(1, 5):
        cv2.waitKey(1)
    return

Here I renamed two videos to “kitti_color.mp4” and “kitti_gray.mp4”. Then, I executed the command: “./darknet detector demo cfg/coco.data cfg/yolov3-spp.cfg weights/yolov3-spp.weights data/kitti_color.mp4 -out_filename res.avi -out_filename res.avi”, where “-out_filename” can save the results to local disk as a video file called “res.avi”.

First, let’s see how it performed on a colored video stream. I took some screenshots below.

Figure 6: YOLOv3-spp detection results in a colored video stream

As shown in the figures above, the YOLOv3-spp is capable of detecting a variety of objects although the KITTI is aimed at traffic scenes. It accurately detected nearly all vehicles in an instant and succeeded in identifying some trucks, bikes as well. It is also possible to detect the stop signs and parking meters, but they are usually hard to distinguish for the network. Some combinations are detected from the video like a man riding a bicycle or a man with a backpack. Both objects are detected from those combinations.

Figure 7: YOLOv3 detection results which have defects.

Nevertheless, the results have many false detections too in the same video stream. The recall is high, but accuracy isn’t very high. For example, in the first picture of Figure 4, a man was detected but actually there’s no pedestrian on the road. Another one happened in the last picture of Figure 4, a traffic sign for guidance was detected as a parking meter. Besides, it has the issue of bounding box shifting shown in the middle picture of Figure 4.

The average speed is around 34.7 FPS (frames per second) where the KITTI dataset was collected at 10 Hz (10 FPS)^[9]. So, the neural network is running in real time on my desktop for the KITTI colored video feed. The speed is really impressive with high recall.

Then I performed the detection on a grayscale of the same video stream, the overall performance is identical to colored detection. As shown in Figure 5 below, it managed to detect multiple vehicles, trains, trucks, bikes, and pedestrians. However, the same issues happened in the grayscale video as well. Some ghost detections are found like the last picture in Figure 5, a horse actually didn’t exist. Parking meters and stop signs still confuse the network, and the bounding box shifting issue still exists. The average speed is around 34.3 FPS, which is slightly lower than the speed in the colored video task but still achieved real-time processing.

Figure 8: YOLOv3-spp detection results in a grayscale video stream.

Advantages

After the experiment, I found that the YOLOv3-spp has the following advantages:

Superior execution speed
High recalls, massive objects detections
Multiple classes detection
Good performance in vehicles and pedestrians
Uniformed performance in colored and grayscale videos

Limitations

YOLOv3-spp has its limitations in the following aspects:

Bounding box offset issue. It seems like the bounding box is one frame slower than the video frame.
Only stop signs can be detected, and it’s often confused with parking meters.
Some wired detections happened (e.g. boats in the sky, horse on the road, ghost pedestrians).
Detection on the overlapped area. Although it can detect a man with a bike and a man with a backpack, it fails often when other combinations happen. But this is the issue for all object detection algorithms that are based on the images only.

Conclusion

In this post, I deployed a YOLO object detection neural network on my desktop and tested it with the KITTI raw dataset. The information and tutorial materials provided by the author are sufficient to follow by an experienced user. For the beginners, they may need extra Linux system knowledge to debug the error messages while compiling and skills to use OpenCV for generating video streams from raw images. It is better to provide an executable (compiled) binary for the public. A critical problem is that the original project is not well-maintained. The latest OpenCV 4.2 isn’t compatible with the latest Darknet implementation. Thus, I have to use a folked community-based version.

Yet, I’m surprised by this tool and the YOLO’s performance. First, the Darknet doesn’t need any dependencies to run. The two dependencies I used (CUDA & OpenCV) are optional for the video feeds, which means the author coded almost everything in C. This makes the Darknet framework very fast and efficient that helped me to accomplish the real-time processing. And finally, I realized what the state-of-the-art object detection neural network looks like. I wish one day I can improve this real-time object detection algorithm by using multiple sensors for cross validation.

References

[1] Redmon, Joseph, et al. “You only look once: Unified, real-time object detection.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[2] Redmon, Joseph, and Ali Farhadi. “Yolov3: An incremental improvement.” arXiv preprint arXiv:1804.02767 (2018).

[3] Redmon, Joseph, and Ali Farhadi. “YOLO9000: better, faster, stronger.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

[4] Darknet-Open Source Neural Networks in C, https://pjreddie.com/darknet/install/

[5] YOLO – Real-time Object Detection, https://pjreddie.com/darknet/yolo/

[6] Issue – Fix compilation with latest OpenCV, https://github.com/pjreddie/darknet/pull/1348

[7] tiagoshibata/darknet, https://github.com/tiagoshibata/darknet

[8] Installing OpenCV 4.2.0 With CUDA on Ubuntu 19.10, https://big533.cc/wordpress/index.php/2020/03/14/installing-opencv-4-2-0-with-cuda-on-ubuntu-19-10/

[9] KITTI sensors setup, http://www.cvlibs.net/datasets/kitti/setup.php