#ComputerVision – Object Detection with #YoloV3 and #MobileNetSSD

Buy Me A Coffee

Hi !

I have a ToDo in my list, to add some new drone demos. In order to do this, I was planning to perform some tests with pretrained models and use them. The 1st 2 in my list are Yolo and MobileNetSSD (see references).

YoloV3

Let’s start with one of the most popular object detection tools, YOLOV3. The official definition:

YOLO (You Only Look Once) is a real-time object detection algorithm that is a single deep convolutional neural network that splits the input image into a set of grid cells, so unlike image classification or face detection, each grid cell in YOLO algorithm will have an associated vector in the output that tells us:

If an object exists in that grid cell.

The class of that object (i.e label).

The predicted bounding box for that object (location).

YoloV3

I pickup some sample code from GitHub repositories and, as usual, from PyImageSearch (see references), and I created a real-time object detection scenario using my webcam as the input feed for YoloV3.

Object Detection live sample with Yolo V3

The final demo, works great; we can use the 80 classes that YoloV3 supports and it’s working at ~2FPS.

MobileNetSSD

Another very popular Object Detection Tool is MobileNetSSD. And, the important part here is SSD, Single Shot Detection. Let’s go to the definition:

Single Shot object detection or SSD takes one single shot to detect multiple objects within the image. As you can see in the above image we are detecting coffee, iPhone, notebook, laptop and glasses at the same time.

It composes of two parts

– Extract feature maps, and

– Apply convolution filter to detect objects

SSD is developed by Google researcher teams to main the balance between the two object detection methods which are YOLO and RCNN.

There are specifically two models of SSD are available

– SSD300: In this model the input size is fixed to 300×300. It is used in lower resolution images, faster processing speed and it is less accurate than SSD512

– SSD512: In this model the input size is fixed to 500×500. It is used in higher resolution images and it is more accurate than other models.

SSD is faster than R-CNN because in R-CNN we need two shots one for generating region proposals and one for detecting objects whereas in SSD It can be done in a single shot.

The MobileNet SSD method was first trained on the COCO dataset and was then fine-tuned on PASCAL VOC reaching 72.7% mAP (mean average precision).

For this demo, I’ll use the SSD300 model. Even, if the drone support better quality images and the SSD500 model works with bigger images, SSD300 is a good fit for this.

bject Detection with MobileNetSSD

This sample works at ~20FPS, and this triggered my curiosity to learn more about the 2nd one. I started to read a lot about this, and found some amazing articles and papers. At the end, if you are interested on my personal take, I really enjoyed this 30 min video about the different detectors side-by-side

Source Code

YoloV3 webcam live object detection

# Bruno Capuano 2020
# display the camera feed using OpenCV
# display FPS
# load YOLO object detector trained with COCO Dataset (80 classes)
# analyze each camera frame using YoloV3 searching for banana classes
import numpy as np
import time
import cv2
import os
def initYoloV3():
global labelColors, layerNames, net
# random color collection for each class label
np.random.seed(42)
labelColors = np.random.randint(0, 255, size=(len(Labels), 3), dtype="uint8")
# load model
net = cv2.dnn.readNetFromDarknet(configPath, weightsPath)
layerNames = net.getLayerNames()
layerNames = [layerNames[i[0] 1] for i in net.getUnconnectedOutLayers()]
def analyzeFrame(frame, displayBoundingBox = True, displayClassName = True, displayConfidence = True):
global H, W
# init
if W is None or H is None:
(H, W) = frame.shape[:2]
if net is None:
initYoloV3()
yoloV3ImgSize = (416, 416)
blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, yoloV3ImgSize, swapRB=True, crop=False)
net.setInput(blob)
start = time.time()
layerOutputs = net.forward(layerNames)
end = time.time()
boxes = []
confidences = []
classIDs = []
for output in layerOutputs:
for detection in output:
scores = detection[5:]
classID = np.argmax(scores)
confidence = scores[classID]
if confidence > confidenceDef:
box = detection[0:4] * np.array([W, H, W, H])
(centerX, centerY, width, height) = box.astype("int")
x = int(centerX (width / 2))
y = int(centerY (height / 2))
boxes.append([x, y, int(width), int(height)])
confidences.append(float(confidence))
classIDs.append(classID)
idxs = cv2.dnn.NMSBoxes(boxes, confidences, confidenceDef, thresholdDef)
if len(idxs) > 0:
for i in idxs.flatten():
(x, y) = (boxes[i][0], boxes[i][1])
(w, h) = (boxes[i][2], boxes[i][3])
if (displayBoundingBox):
color = [int(c) for c in labelColors[classIDs[i]]]
cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2)
if(displayClassName and displayConfidence):
text = "{}: {:.4f}".format(Labels[classIDs[i]], confidences[i])
cv2.putText(frame, text, (x, y 5), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
elif(displayClassName):
text = str(f"{Labels[classIDs[i]]}:")
cv2.putText(frame, text, (x, y 5), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
# Camera Settings
camera_Width = 640 # 1024 # 1280 # 640
camera_Heigth = 480 # 780 # 960 # 480
frameSize = (camera_Width, camera_Heigth)
video_capture = cv2.VideoCapture(1)
time.sleep(2.0)
(W, H) = (None, None)
# YOLO Settings
weightsPath = "yolov3.weights"
configPath = "yolov3.cfg"
LabelsPath = "coco.names"
Labels = open(LabelsPath).read().strip().split("\n")
confidenceDef = 0.5
thresholdDef = 0.3
net = (None)
labelColors = (None)
layerNames = (None)
i = 0
detectionEnabled = False
while True:
i = i + 1
start_time = time.time()
ret, frameOrig = video_capture.read()
frame = cv2.resize(frameOrig, frameSize)
if(detectionEnabled):
analyzeFrame(frame)
if (time.time() start_time ) > 0:
fpsInfo = "FPS: " + str(1.0 / (time.time() start_time)) # FPS = 1 / time to process loop
font = cv2.FONT_HERSHEY_DUPLEX
cv2.putText(frame, fpsInfo, (10, 20), font, 0.4, (255, 255, 255), 1)
cv2.imshow('@elbruno – YoloV3 Object Detection', frame)
# key controller
key = cv2.waitKey(1) & 0xFF
if key == ord("d"):
if (detectionEnabled == True):
detectionEnabled = False
else:
detectionEnabled = True
if key == ord("q"):
break
video_capture.release()
cv2.destroyAllWindows()

MobileNetSSD webcam live object detection

# Bruno Capuano 2020
# display the camera feed using OpenCV
# display FPS
# load MobileNetSSD object detector trained with COCO Dataset (20 classes)
# analyze each camera frame using MobileNet
# enable disable obj detection pressing D key
import numpy as np
import time
import cv2
import os
def initMobileNetSSD():
global classesMobileNetSSD, colorsMobileNetSSD, net
classesMobileNetSSD = ["background", "aeroplane", "bicycle", "bird", "boat",
"bottle", "bus", "car", "cat", "chair", "cow", "diningtable",
"dog", "horse", "motorbike", "person", "pottedplant", "sheep",
"sofa", "train", "tvmonitor"]
colorsMobileNetSSD = np.random.uniform(0, 255, size=(len(classesMobileNetSSD), 3))
net = cv2.dnn.readNetFromCaffe(prototxtFile, modelFile)
def analyzeFrame(frame, displayBoundingBox = True, displayClassName = True, displayConfidence = True):
global H, W
# init
if W is None or H is None:
(H, W) = frame.shape[:2]
if net is None:
initMobileNetSSD()
mobileNetSSDImgSize = (300, 300)
blob = cv2.dnn.blobFromImage(cv2.resize(frame, mobileNetSSDImgSize), 0.007843, mobileNetSSDImgSize, 127.5)
net.setInput(blob)
detections = net.forward()
for i in np.arange(0, detections.shape[2]):
confidence = detections[0, 0, i, 2]
if confidence > confidenceDef:
idx = int(detections[0, 0, i, 1])
box = detections[0, 0, i, 3:7] * np.array([W, H, W, H])
(startX, startY, endX, endY) = box.astype("int")
if(displayBoundingBox):
cv2.rectangle(frame, (startX, startY), (endX, endY), colorsMobileNetSSD[idx], 2)
if(displayClassName and displayConfidence):
label = "{}: {:.2f}%".format(classesMobileNetSSD[idx], confidence * 100)
y = startY 15 if startY 15 > 15 else startY + 15
cv2.putText(frame, label, (startX, y), cv2.FONT_HERSHEY_SIMPLEX, 0.5, colorsMobileNetSSD[idx], 2)
elif (displayClassName):
label = str(f"{classesMobileNetSSD[idx]}")
y = startY 15 if startY 15 > 15 else startY + 15
cv2.putText(frame, label, (startX, y), cv2.FONT_HERSHEY_SIMPLEX, 0.5, colorsMobileNetSSD[idx], 2)
# Camera Settings
camera_Width = 640 # 1024 # 1280 # 640
camera_Heigth = 480 # 780 # 960 # 480
frameSize = (camera_Width, camera_Heigth)
video_capture = cv2.VideoCapture(1)
time.sleep(2.0)
(W, H) = (None, None)
# MobileNetSSD Settings
confidenceDef = 0.5
thresholdDef = 0.3
prototxtFile = "MobileNetSSD_deploy.prototxt.txt"
modelFile = "MobileNetSSD_deploy.caffemodel"
net = (None)
classesMobileNetSSD = (None)
colorsMobileNetSSD = (None)
i = 0
detectionEnabled = False
while True:
i = i + 1
start_time = time.time()
ret, frameOrig = video_capture.read()
frame = cv2.resize(frameOrig, frameSize)
if(detectionEnabled):
analyzeFrame(frame)
if (time.time() start_time ) > 0:
fpsInfo = "FPS: " + str(1.0 / (time.time() start_time)) # FPS = 1 / time to process loop
font = cv2.FONT_HERSHEY_DUPLEX
cv2.putText(frame, fpsInfo, (10, 20), font, 0.4, (255, 255, 255), 1)
cv2.imshow('@elbruno – MobileNetSSD Object Detection', frame)
# key controller
key = cv2.waitKey(1) & 0xFF
if key == ord("d"):
if (detectionEnabled == True):
detectionEnabled = False
else:
detectionEnabled = True
if key == ord("q"):
break
video_capture.release()
cv2.destroyAllWindows()

Happy coding!

Greetings

El Bruno

Resources