During this post, I will discuss the experience of deploying and testing the YOLOv3 object detection neural network. YOLO[1] (You Only Look Once) is a state-of-the-art, well-known object detection neural network in the field of computer vision. The project originated in 2016, and several changes have been made on the old one that leads to the current version YOLOv3[2]. Until now, there are many variants which vary in model depth, and thus, they have different performance in terms of accuracy and speed.
YOLO Introduction
What makes YOLO superior to all other kinds of object detection algorithms or neural networks is that it’s fast! And performs quite good! As shown in Figure 1, YOLO is on the top left corner of the figure, which is far away from other latest methods, which means it’s more efficient for the same mAP (mean Average Precision).
The whole concept behind the YOLO is that it is a kind of end-to-end evaluation. It scans the whole image once and divides the image into sub-regions. Then, it predicts the objects as well as their bounding boxes within one forwarding pass.
As shown in figure 2, YOLO v1 has 24 convolutional layers and 2 fully connected layers, where convolutional layers are dedicated for feature extraction and the FC layers are for bounding box prediction. The basic structure can be generalized into the following points:
- Adjust the image into a fixed size.
- Divide the image into 7*7 grids. Every grid will detect the centroid of the objects that fall in this grid.
- Each grid can propose two bounding boxes, where each bounding box contains five values: x, y, w, h, and confidence.
- Each grid also predicts a length-twenty vector that denotes the probabilities of the objects that belong to a category. So each vector has 2*5+20=30 values.
- Finally, Non-maximal suppression is used for sifting the bounding boxes.
- Set a threshold to select the bounding boxes that have higher confidence values.
- Then, for each category, add the bounding box with the highest confidence to the output list.
- For the remaining bounding boxes in that category, if the bounding box overlaps with the one that has the highest confidence mentioned above, we remove it. Otherwise, we add it to the output list.
- Do the above two steps for all other categories, and we will get a final output list.
Although YOLO v1 is fast, it has deficiencies in accuracy and recall. Then, it comes to the YOLO v2[3] that addressed those issues (Figure 3). What has been added to the v2 is:
- Adopted Darknet-19, get rid of the FC layers. Hence, it can accept images with different resolutions.
- Add batch normalization after convolution, which can provide the effect of regularization, speed up the convergence, and avoid over-fitting.
- Use high-resolution data for multi-scale training.
- Other techniques like fine-grained features, direct location prediction, etc.
YOLO v2 fixed the deficiencies of v1, but still can not handle the problem like detecting overlapping objects. Hence, we have YOLO v3 that utilizes the Darknet-53 (Figure 4). Darknet-53 has the highest FPOS over ResNet, so it’s efficient to run on a GPU. YOLO v3 also uses prediction from the last three layers of different sizes and makes the final prediction based on the combined features from those layers. Another improvement is using logistic regression instead of a softmax classifier, so make it possible to predict one object for multiple classes.
All in all, it’s thrilling to test a state-of-the-art neural network, and curious to see how it performs on some datasets (e.g. KITTI). Next, I’ll start deploying this research work on my desktop.
Desktop Setup
My machine runs a Ubuntu Linux 19.10 OS with an Nvidia 2080Ti graphic card and an i9 9900KS CPU. The system already has Opencv and CUDA installed. Both of them are the optional dependencies of YOLO for faster performance and video detection. My target is to run the YOLO neural network in real-time.
Dataset
Since I already have the KITTI dataset stored locally, it would be convenient for me to use the KITTI dataset. It is recorded by well-equipped autonomous vehicles using grayscale and colored stereo cameras, a Lidar and an IMU. The images’ quality is among average, as the color is not very good in my own opinion compared with the sensors recently. I downloaded the RAW dataset from the date 2011/09/26 and drives from 0001, 0005, 0011, 0013, 0014, 0017, 0018, 0048, 0051, and 0056. During the testing, I concatenated all those raw camera frames together to two 3:29 length videos, where one is grayscale and another one is colored.
YOLO Compilation
Instructions to install YOLO are easy to follow and all information is on the author’s website[4][5].
In fact, we are going to compile an open-sourced neural network tool called Darknet, which implements several object detection architectures inside, including our target, YOLOv3, and it’s variants. Here, I will share the steps I used to compile the YOLO darknet and run the tests.
Because I need to run the tests on the GPU and the visualization function needs the OpenCV library, I compiled the Darknet with CUDA and OpenCV enabled. However, the steps of installing CUDA and OpenCV are not included since I’ve had it on my machine.
Another issue that I encountered is the compatibility problem[6]. Due to the lack of maintenance of the Darknet project, the code in the author’s repo is outdated. The original code can not be compiled with OpenCV 4.2, where the code uses the old C library API and they’re deprecated. Luckily, I found a folk[7] that keeps updating and works with the latest OpenCV. It extends more functionalities as well. Hence, the following codes are based on tiagoshibata’s version on the master branch.
# Clone the repo to local at any place you want
git clone https://github.com/tiagoshibata/darknet.git
# go to the folder we just downloaded
cd darknet
As I mentioned, we will compile darknet with CUDA and OpenCV. Thus, we need to change two lines in the “Makefile” of the “darknet” folder. After committing the changes, we use the “make” command to compile the Darknet.
# change the “Makefile” in the folder
vim Makefile
# change at line 1
GPU=1
# change at line 4
OPENCV=1
# compile Darknet
make
You will see the output from the terminal like this if everything is okay:
mkdir -p obj
gcc -I/usr/local/cuda/include/ -Wall -Wfatal-errors -Ofast....
gcc -I/usr/local/cuda/include/ -Wall -Wfatal-errors -Ofast....
gcc -I/usr/local/cuda/include/ -Wall -Wfatal-errors -Ofast....
.....
gcc -I/usr/local/cuda/include/ -Wall -Wfatal-errors -Ofast -lm....
Here I met the same error when I compiled OpenCV with CUDA: “gcc” version after 7 is not supported, which means the “gcc” and “g++” are too new for CUDA 10[8]. This issue can be solved by altering the default “gcc” and “g++” versions, which I posted in my earlier notes[8].
The Darknet’s compilation is done and as simple as one actual command “make” if your system met all prerequisites.
Experiments & Testing
Now we have the tool and data, let’s check how the tool can be used.
- If you want to perform detection for an image, you can simply run the command: “
./darknet detect cfg/yolov3.cfg yolov3.weights data/dog.jpg
”. You can change the network’s configuration in the “cfg” folder and the image for detecting in the “data” folder. The corresponding network’s weights must be downloaded from YOLO homepage as well if we don’t want to train the network from scratch. The same rules applied to all following commands. - If you want to detect multiple images, we can change the command to: “
./darknet detect cfg/yolov3.cfg yolov3.weights
”. The terminal will load the model and show a prompt to input the image’s path, one at a time. It will prompt again after detection, you can exit with “ctrl+c” anytime. - If you want to perform detection on your webcam, it’s possible by running command: “
./darknet detector demo cfg/coco.data cfg/yolov3.cfg yolov3.weights
”. Additionally, you can use “-c <num>” to specify which camera to use, where 0 is the default camera in OpenCV. - If you want to detect the objects in a video stream, you should add file name behind the previous command: “
./darknet detector demo cfg/coco.data cfg/yolov3.cfg yolov3.weights <video file>
”. This is the command we need exactly for our experiment.
Note that the last two commands need CUDA and OpenCV enabled while the first two don’t need any dependencies. If we run any of the previous four commands, the program will create the neural network first and output the network’s structure in the terminal window.
batch = 1, time_steps = 1, train = 0
layer filters size/strd(dil) input output
0 conv 32 3 x 3/ 1 608 x 608 x 3 -> 608 x 608 x 32 0.639 BF
1 conv 64 3 x 3/ 2 608 x 608 x 32 -> 304 x 304 x 64 3.407 BF
2 conv 32 1 x 1/ 1 304 x 304 x 64 -> 304 x 304 x 32 0.379 BF
3 conv 64 3 x 3/ 1 304 x 304 x 32 -> 304 x 304 x 64 3.407 BF
4 Shortcut Layer: 1, wt = 0, wn = 0, outputs: 304 x 304 x 64 0.006 BF
5 conv 128 3 x 3/ 2 304 x 304 x 64 -> 152 x 152 x 128 3.407 BF
6 conv 64 1 x 1/ 1 152 x 152 x 128 -> 152 x 152 x 64 0.379 BF
7 conv 128 3 x 3/ 1 152 x 152 x 64 -> 152 x 152 x 128 3.407 BF
8 Shortcut Layer: 5, wt = 0, wn = 0, outputs: 152 x 152 x 128 0.003 BF
9 conv 64 1 x 1/ 1 152 x 152 x 128 -> 152 x 152 x 64 0.379 BF
10 conv 128 3 x 3/ 1 152 x 152 x 64 -> 152 x 152 x 128 3.407 BF
11 Shortcut Layer: 8, wt = 0, wn = 0, outputs: 152 x 152 x 128 0.003 BF
12 conv 256 3 x 3/ 2 152 x 152 x 128 -> 76 x 76 x 256 3.407 BF
13 conv 128 1 x 1/ 1 76 x 76 x 256 -> 76 x 76 x 128 0.379 BF
14 conv 256 3 x 3/ 1 76 x 76 x 128 -> 76 x 76 x 256 3.407 BF
15 Shortcut Layer: 12, wt = 0, wn = 0, outputs: 76 x 76 x 256 0.001 BF
16 conv 128 1 x 1/ 1 76 x 76 x 256 -> 76 x 76 x 128 0.379 BF
17 conv 256 3 x 3/ 1 76 x 76 x 128 -> 76 x 76 x 256 3.407 BF
18 Shortcut Layer: 15, wt = 0, wn = 0, outputs: 76 x 76 x 256 0.001 BF
19 conv 128 1 x 1/ 1 76 x 76 x 256 -> 76 x 76 x 128 0.379 BF
20 conv 256 3 x 3/ 1 76 x 76 x 128 -> 76 x 76 x 256 3.407 BF
21 Shortcut Layer: 18, wt = 0, wn = 0, outputs: 76 x 76 x 256 0.001 BF
22 conv 128 1 x 1/ 1 76 x 76 x 256 -> 76 x 76 x 128 0.379 BF
23 conv 256 3 x 3/ 1 76 x 76 x 128 -> 76 x 76 x 256 3.407 BF
24 Shortcut Layer: 21, wt = 0, wn = 0, outputs: 76 x 76 x 256 0.001 BF
25 conv 128 1 x 1/ 1 76 x 76 x 256 -> 76 x 76 x 128 0.379 BF
26 conv 256 3 x 3/ 1 76 x 76 x 128 -> 76 x 76 x 256 3.407 BF
27 Shortcut Layer: 24, wt = 0, wn = 0, outputs: 76 x 76 x 256 0.001 BF
28 conv 128 1 x 1/ 1 76 x 76 x 256 -> 76 x 76 x 128 0.379 BF
29 conv 256 3 x 3/ 1 76 x 76 x 128 -> 76 x 76 x 256 3.407 BF
30 Shortcut Layer: 27, wt = 0, wn = 0, outputs: 76 x 76 x 256 0.001 BF
31 conv 128 1 x 1/ 1 76 x 76 x 256 -> 76 x 76 x 128 0.379 BF
32 conv 256 3 x 3/ 1 76 x 76 x 128 -> 76 x 76 x 256 3.407 BF
33 Shortcut Layer: 30, wt = 0, wn = 0, outputs: 76 x 76 x 256 0.001 BF
34 conv 128 1 x 1/ 1 76 x 76 x 256 -> 76 x 76 x 128 0.379 BF
35 conv 256 3 x 3/ 1 76 x 76 x 128 -> 76 x 76 x 256 3.407 BF
36 Shortcut Layer: 33, wt = 0, wn = 0, outputs: 76 x 76 x 256 0.001 BF
37 conv 512 3 x 3/ 2 76 x 76 x 256 -> 38 x 38 x 512 3.407 BF
38 conv 256 1 x 1/ 1 38 x 38 x 512 -> 38 x 38 x 256 0.379 BF
39 conv 512 3 x 3/ 1 38 x 38 x 256 -> 38 x 38 x 512 3.407 BF
40 Shortcut Layer: 37, wt = 0, wn = 0, outputs: 38 x 38 x 512 0.001 BF
41 conv 256 1 x 1/ 1 38 x 38 x 512 -> 38 x 38 x 256 0.379 BF
42 conv 512 3 x 3/ 1 38 x 38 x 256 -> 38 x 38 x 512 3.407 BF
43 Shortcut Layer: 40, wt = 0, wn = 0, outputs: 38 x 38 x 512 0.001 BF
44 conv 256 1 x 1/ 1 38 x 38 x 512 -> 38 x 38 x 256 0.379 BF
45 conv 512 3 x 3/ 1 38 x 38 x 256 -> 38 x 38 x 512 3.407 BF
46 Shortcut Layer: 43, wt = 0, wn = 0, outputs: 38 x 38 x 512 0.001 BF
47 conv 256 1 x 1/ 1 38 x 38 x 512 -> 38 x 38 x 256 0.379 BF
48 conv 512 3 x 3/ 1 38 x 38 x 256 -> 38 x 38 x 512 3.407 BF
49 Shortcut Layer: 46, wt = 0, wn = 0, outputs: 38 x 38 x 512 0.001 BF
50 conv 256 1 x 1/ 1 38 x 38 x 512 -> 38 x 38 x 256 0.379 BF
51 conv 512 3 x 3/ 1 38 x 38 x 256 -> 38 x 38 x 512 3.407 BF
52 Shortcut Layer: 49, wt = 0, wn = 0, outputs: 38 x 38 x 512 0.001 BF
53 conv 256 1 x 1/ 1 38 x 38 x 512 -> 38 x 38 x 256 0.379 BF
54 conv 512 3 x 3/ 1 38 x 38 x 256 -> 38 x 38 x 512 3.407 BF
55 Shortcut Layer: 52, wt = 0, wn = 0, outputs: 38 x 38 x 512 0.001 BF
56 conv 256 1 x 1/ 1 38 x 38 x 512 -> 38 x 38 x 256 0.379 BF
57 conv 512 3 x 3/ 1 38 x 38 x 256 -> 38 x 38 x 512 3.407 BF
58 Shortcut Layer: 55, wt = 0, wn = 0, outputs: 38 x 38 x 512 0.001 BF
59 conv 256 1 x 1/ 1 38 x 38 x 512 -> 38 x 38 x 256 0.379 BF
60 conv 512 3 x 3/ 1 38 x 38 x 256 -> 38 x 38 x 512 3.407 BF
61 Shortcut Layer: 58, wt = 0, wn = 0, outputs: 38 x 38 x 512 0.001 BF
62 conv 1024 3 x 3/ 2 38 x 38 x 512 -> 19 x 19 x1024 3.407 BF
63 conv 512 1 x 1/ 1 19 x 19 x1024 -> 19 x 19 x 512 0.379 BF
64 conv 1024 3 x 3/ 1 19 x 19 x 512 -> 19 x 19 x1024 3.407 BF
65 Shortcut Layer: 62, wt = 0, wn = 0, outputs: 19 x 19 x1024 0.000 BF
66 conv 512 1 x 1/ 1 19 x 19 x1024 -> 19 x 19 x 512 0.379 BF
67 conv 1024 3 x 3/ 1 19 x 19 x 512 -> 19 x 19 x1024 3.407 BF
68 Shortcut Layer: 65, wt = 0, wn = 0, outputs: 19 x 19 x1024 0.000 BF
69 conv 512 1 x 1/ 1 19 x 19 x1024 -> 19 x 19 x 512 0.379 BF
70 conv 1024 3 x 3/ 1 19 x 19 x 512 -> 19 x 19 x1024 3.407 BF
71 Shortcut Layer: 68, wt = 0, wn = 0, outputs: 19 x 19 x1024 0.000 BF
72 conv 512 1 x 1/ 1 19 x 19 x1024 -> 19 x 19 x 512 0.379 BF
73 conv 1024 3 x 3/ 1 19 x 19 x 512 -> 19 x 19 x1024 3.407 BF
74 Shortcut Layer: 71, wt = 0, wn = 0, outputs: 19 x 19 x1024 0.000 BF
75 conv 512 1 x 1/ 1 19 x 19 x1024 -> 19 x 19 x 512 0.379 BF
76 conv 1024 3 x 3/ 1 19 x 19 x 512 -> 19 x 19 x1024 3.407 BF
77 conv 512 1 x 1/ 1 19 x 19 x1024 -> 19 x 19 x 512 0.379 BF
78 max 5x 5/ 1 19 x 19 x 512 -> 19 x 19 x 512 0.005 BF
79 route 77 -> 19 x 19 x 512
80 max 9x 9/ 1 19 x 19 x 512 -> 19 x 19 x 512 0.015 BF
81 route 77 -> 19 x 19 x 512
82 max 13x13/ 1 19 x 19 x 512 -> 19 x 19 x 512 0.031 BF
83 route 82 80 78 77 -> 19 x 19 x2048
84 conv 512 1 x 1/ 1 19 x 19 x2048 -> 19 x 19 x 512 0.757 BF
85 conv 1024 3 x 3/ 1 19 x 19 x 512 -> 19 x 19 x1024 3.407 BF
86 conv 512 1 x 1/ 1 19 x 19 x1024 -> 19 x 19 x 512 0.379 BF
87 conv 1024 3 x 3/ 1 19 x 19 x 512 -> 19 x 19 x1024 3.407 BF
88 conv 255 1 x 1/ 1 19 x 19 x1024 -> 19 x 19 x 255 0.189 BF
89 yolo
[yolo] params: iou loss: mse (2), iou_norm: 0.75, cls_norm: 1.00, scale_x_y: 1.00
90 route 86 -> 19 x 19 x 512
91 conv 256 1 x 1/ 1 19 x 19 x 512 -> 19 x 19 x 256 0.095 BF
92 upsample 2x 19 x 19 x 256 -> 38 x 38 x 256
93 route 92 61 -> 38 x 38 x 768
94 conv 256 1 x 1/ 1 38 x 38 x 768 -> 38 x 38 x 256 0.568 BF
95 conv 512 3 x 3/ 1 38 x 38 x 256 -> 38 x 38 x 512 3.407 BF
96 conv 256 1 x 1/ 1 38 x 38 x 512 -> 38 x 38 x 256 0.379 BF
97 conv 512 3 x 3/ 1 38 x 38 x 256 -> 38 x 38 x 512 3.407 BF
98 conv 256 1 x 1/ 1 38 x 38 x 512 -> 38 x 38 x 256 0.379 BF
99 conv 512 3 x 3/ 1 38 x 38 x 256 -> 38 x 38 x 512 3.407 BF
100 conv 255 1 x 1/ 1 38 x 38 x 512 -> 38 x 38 x 255 0.377 BF
101 yolo
[yolo] params: iou loss: mse (2), iou_norm: 0.75, cls_norm: 1.00, scale_x_y: 1.00
102 route 98 -> 38 x 38 x 256
103 conv 128 1 x 1/ 1 38 x 38 x 256 -> 38 x 38 x 128 0.095 BF
104 upsample 2x 38 x 38 x 128 -> 76 x 76 x 128
105 route 104 36 -> 76 x 76 x 384
106 conv 128 1 x 1/ 1 76 x 76 x 384 -> 76 x 76 x 128 0.568 BF
107 conv 256 3 x 3/ 1 76 x 76 x 128 -> 76 x 76 x 256 3.407 BF
108 conv 128 1 x 1/ 1 76 x 76 x 256 -> 76 x 76 x 128 0.379 BF
109 conv 256 3 x 3/ 1 76 x 76 x 128 -> 76 x 76 x 256 3.407 BF
110 conv 128 1 x 1/ 1 76 x 76 x 256 -> 76 x 76 x 128 0.379 BF
111 conv 256 3 x 3/ 1 76 x 76 x 128 -> 76 x 76 x 256 3.407 BF
112 conv 255 1 x 1/ 1 76 x 76 x 256 -> 76 x 76 x 255 0.754 BF
113 yolo
[yolo] params: iou loss: mse (2), iou_norm: 0.75, cls_norm: 1.00, scale_x_y: 1.00
Total BFLOPS 141.531
avg_outputs = 1083728
Allocate additional workspace_size = 106.46 MB
Loading weights from weights/yolov3-spp.weights...
seen 64, trained: 32013 K-images (500 Kilo-batches_64)
Done! Loaded 114 layers from weights-file
As shown in the terminal’s output, we’re going to use YOLOv3-spp net for our experiment, where spp means spatial pyramid pooling layers. The spp layer can output uniformed dimensions regardless of the inputs’ shapes, and it will get the best features from the feature maps as well. Hence, YOLOv3-spp has the best “mAP” accuracy among other variants on the homepage’s board. But it’s heavier than others too, which has 141.5 BFLOPS (Billion Floating Point Operations Per Second). The weights of this network can be downloaded from homepage’s board as well (size around 250M).
Besides the outputs of the network’s structure, we can also find the results for the detection task also (shown below). Surprisingly, detecting an image is fast for YOLOv3-spp on RTX2080Ti in 348.4 milliseconds. There are three objects detected in the image, and their confidences are printed out also.
data/dog.jpg: Predicted in 348.431000 milli-seconds.
bicycle: 98%
dog: 97%
truck: 90%
Video Detection Test
Now, let’s dive into video testing. I’m going to use two videos for the experiment: a grayscale one and a colored one. The raw KITTI dataset is saved in “png” image files. Thus, I need to make those two videos first. It can be done with OpenCV in Python easily, here’s a code snippet from my other project. It was used for other purposes, but it can help to save the videos for me.
def show_vid_arr(vid: Sequence, title: str = 'default', rate: int = 10,
out: Optional[str] = None, color: Optional[bool] = False) -> None:
"""
play video from ndarray, press 'esc' to quit window
:param vid: video array
:param title: window title
:param rate: sample rate of the video in Hz
:param out: if given name, the video played will be saved into a Mpeg4 file in current folder.
:param color: if True save color video, enabled when param 'out' is given
"""
delay = int(1/rate*1000)
if out is not None and isinstance(out, str):
if color:
resolution = vid[0].shape[:2][::-1]
writer = cv2.VideoWriter(out, cv2.VideoWriter_fourcc('M', 'J', 'P', 'G'), rate, resolution)
else:
resolution = vid[0].shape[::-1]
writer = cv2.VideoWriter(out, cv2.VideoWriter_fourcc('M', 'J', 'P', 'G'), rate, resolution, 0)
print('Writer object created. Target file name: %s, resolution: (%d, %d)' % ((out,)+resolution))
else:
writer = None
for frame in vid:
if color:
frame = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)
cv2.imshow(title, frame)
if writer is not None:
# frame_uint8 = np.uint8(frame)
writer.write(frame)
if cv2.waitKey(delay) & 0xFF == 27:
break
cv2.waitKey(1)
cv2.destroyAllWindows()
if writer is not None:
writer.release()
print('Writer object released.')
for i in range(1, 5):
cv2.waitKey(1)
return
Here I renamed two videos to “kitti_color.mp4” and “kitti_gray.mp4”. Then, I executed the command: “./darknet detector demo cfg/coco.data cfg/yolov3-spp.cfg weights/yolov3-spp.weights data/kitti_color.mp4 -out_filename res.avi -out_filename res.avi
”, where “-out_filename” can save the results to local disk as a video file called “res.avi”.
First, let’s see how it performed on a colored video stream. I took some screenshots below.
As shown in the figures above, the YOLOv3-spp is capable of detecting a variety of objects although the KITTI is aimed at traffic scenes. It accurately detected nearly all vehicles in an instant and succeeded in identifying some trucks, bikes as well. It is also possible to detect the stop signs and parking meters, but they are usually hard to distinguish for the network. Some combinations are detected from the video like a man riding a bicycle or a man with a backpack. Both objects are detected from those combinations.
Nevertheless, the results have many false detections too in the same video stream. The recall is high, but accuracy isn’t very high. For example, in the first picture of Figure 4, a man was detected but actually there’s no pedestrian on the road. Another one happened in the last picture of Figure 4, a traffic sign for guidance was detected as a parking meter. Besides, it has the issue of bounding box shifting shown in the middle picture of Figure 4.
The average speed is around 34.7 FPS (frames per second) where the KITTI dataset was collected at 10 Hz (10 FPS)[9]. So, the neural network is running in real time on my desktop for the KITTI colored video feed. The speed is really impressive with high recall.
Then I performed the detection on a grayscale of the same video stream, the overall performance is identical to colored detection. As shown in Figure 5 below, it managed to detect multiple vehicles, trains, trucks, bikes, and pedestrians. However, the same issues happened in the grayscale video as well. Some ghost detections are found like the last picture in Figure 5, a horse actually didn’t exist. Parking meters and stop signs still confuse the network, and the bounding box shifting issue still exists. The average speed is around 34.3 FPS, which is slightly lower than the speed in the colored video task but still achieved real-time processing.
Advantages
After the experiment, I found that the YOLOv3-spp has the following advantages:
- Superior execution speed
- High recalls, massive objects detections
- Multiple classes detection
- Good performance in vehicles and pedestrians
- Uniformed performance in colored and grayscale videos
Limitations
YOLOv3-spp has its limitations in the following aspects:
- Bounding box offset issue. It seems like the bounding box is one frame slower than the video frame.
- Only stop signs can be detected, and it’s often confused with parking meters.
- Some wired detections happened (e.g. boats in the sky, horse on the road, ghost pedestrians).
- Detection on the overlapped area. Although it can detect a man with a bike and a man with a backpack, it fails often when other combinations happen. But this is the issue for all object detection algorithms that are based on the images only.
Conclusion
In this post, I deployed a YOLO object detection neural network on my desktop and tested it with the KITTI raw dataset. The information and tutorial materials provided by the author are sufficient to follow by an experienced user. For the beginners, they may need extra Linux system knowledge to debug the error messages while compiling and skills to use OpenCV for generating video streams from raw images. It is better to provide an executable (compiled) binary for the public. A critical problem is that the original project is not well-maintained. The latest OpenCV 4.2 isn’t compatible with the latest Darknet implementation. Thus, I have to use a folked community-based version.
Yet, I’m surprised by this tool and the YOLO’s performance. First, the Darknet doesn’t need any dependencies to run. The two dependencies I used (CUDA & OpenCV) are optional for the video feeds, which means the author coded almost everything in C. This makes the Darknet framework very fast and efficient that helped me to accomplish the real-time processing. And finally, I realized what the state-of-the-art object detection neural network looks like. I wish one day I can improve this real-time object detection algorithm by using multiple sensors for cross validation.
References
[1] Redmon, Joseph, et al. “You only look once: Unified, real-time object detection.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
[2] Redmon, Joseph, and Ali Farhadi. “Yolov3: An incremental improvement.” arXiv preprint arXiv:1804.02767 (2018).
[3] Redmon, Joseph, and Ali Farhadi. “YOLO9000: better, faster, stronger.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
[4] Darknet-Open Source Neural Networks in C, https://pjreddie.com/darknet/install/
[5] YOLO – Real-time Object Detection, https://pjreddie.com/darknet/yolo/
[6] Issue – Fix compilation with latest OpenCV, https://github.com/pjreddie/darknet/pull/1348
[7] tiagoshibata/darknet, https://github.com/tiagoshibata/darknet
[8] Installing OpenCV 4.2.0 With CUDA on Ubuntu 19.10, https://big533.cc/wordpress/index.php/2020/03/14/installing-opencv-4-2-0-with-cuda-on-ubuntu-19-10/
[9] KITTI sensors setup, http://www.cvlibs.net/datasets/kitti/setup.php
2 replies on “Deploying YOLOv3 on Ubuntu Linux”
Yeah. I am just here to leave a comment supporting my bro. Great stuff!
Hahahah, thanks bro!