#ComputerVision – Object Detection with #YoloV3 and #MobileNetSSD

Hi !

I have a ToDo in my list, to add some new drone demos. In order to do this, I was planning to perform some tests with pretrained models and use them. The 1st 2 in my list are Yolo and MobileNetSSD (see references).

YoloV3

Let’s start with one of the most popular object detection tools, YOLOV3. The official definition:

YOLO (You Only Look Once) is a real-time object detection algorithm that is a single deep convolutional neural network that splits the input image into a set of grid cells, so unlike image classification or face detection, each grid cell in YOLO algorithm will have an associated vector in the output that tells us:
If an object exists in that grid cell.
The class of that object (i.e label).
The predicted bounding box for that object (location).
YoloV3

I pickup some sample code from GitHub repositories and, as usual, from PyImageSearch (see references), and I created a real-time object detection scenario using my webcam as the input feed for YoloV3.

Object Detection live sample with Yolo V3

The final demo, works great; we can use the 80 classes that YoloV3 supports and it’s working at ~2FPS.

MobileNetSSD

Another very popular Object Detection Tool is MobileNetSSD. And, the important part here is SSD, Single Shot Detection. Let’s go to the definition:

Single Shot object detection or SSD takes one single shot to detect multiple objects within the image. As you can see in the above image we are detecting coffee, iPhone, notebook, laptop and glasses at the same time.
It composes of two parts
– Extract feature maps, and
– Apply convolution filter to detect objects
SSD is developed by Google researcher teams to main the balance between the two object detection methods which are YOLO and RCNN.
There are specifically two models of SSD are available
– SSD300: In this model the input size is fixed to 300×300. It is used in lower resolution images, faster processing speed and it is less accurate than SSD512
– SSD512: In this model the input size is fixed to 500×500. It is used in higher resolution images and it is more accurate than other models.
SSD is faster than R-CNN because in R-CNN we need two shots one for generating region proposals and one for detecting objects whereas in SSD It can be done in a single shot.
The MobileNet SSD method was first trained on the COCO dataset and was then fine-tuned on PASCAL VOC reaching 72.7% mAP (mean average precision).

For this demo, I’ll use the SSD300 model. Even, if the drone support better quality images and the SSD500 model works with bigger images, SSD300 is a good fit for this.

This sample works at ~20FPS, and this triggered my curiosity to learn more about the 2nd one. I started to read a lot about this, and found some amazing articles and papers. At the end, if you are interested on my personal take, I really enjoyed this 30 min video about the different detectors side-by-side

Source Code

YoloV3 webcam live object detection

	# Bruno Capuano 2020
	# display the camera feed using OpenCV
	# display FPS
	# load YOLO object detector trained with COCO Dataset (80 classes)
	# analyze each camera frame using YoloV3 searching for banana classes

	import numpy as np
	import time
	import cv2
	import os

	def initYoloV3():
	global labelColors, layerNames, net

	# random color collection for each class label
	np.random.seed(42)
	labelColors = np.random.randint(0, 255, size=(len(Labels), 3), dtype="uint8")

	# load model
	net = cv2.dnn.readNetFromDarknet(configPath, weightsPath)
	layerNames = net.getLayerNames()
	layerNames = [layerNames[i[0] – 1] for i in net.getUnconnectedOutLayers()]


	def analyzeFrame(frame, displayBoundingBox = True, displayClassName = True, displayConfidence = True):
	global H, W

	# init
	if W is None or H is None:
	(H, W) = frame.shape[:2]
	if net is None:
	initYoloV3()

	yoloV3ImgSize = (416, 416)
	blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, yoloV3ImgSize, swapRB=True, crop=False)
	net.setInput(blob)
	start = time.time()
	layerOutputs = net.forward(layerNames)
	end = time.time()

	boxes = []
	confidences = []
	classIDs = []

	for output in layerOutputs:
	for detection in output:
	scores = detection[5:]
	classID = np.argmax(scores)
	confidence = scores[classID]

	if confidence > confidenceDef:
	box = detection[0:4] * np.array([W, H, W, H])
	(centerX, centerY, width, height) = box.astype("int")

	x = int(centerX – (width / 2))
	y = int(centerY – (height / 2))

	boxes.append([x, y, int(width), int(height)])
	confidences.append(float(confidence))
	classIDs.append(classID)

	idxs = cv2.dnn.NMSBoxes(boxes, confidences, confidenceDef, thresholdDef)

	if len(idxs) > 0:
	for i in idxs.flatten():
	(x, y) = (boxes[i][0], boxes[i][1])
	(w, h) = (boxes[i][2], boxes[i][3])

	if (displayBoundingBox):
	color = [int(c) for c in labelColors[classIDs[i]]]
	cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2)
	if(displayClassName and displayConfidence):
	text = "{}: {:.4f}".format(Labels[classIDs[i]], confidences[i])
	cv2.putText(frame, text, (x, y – 5), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
	elif(displayClassName):
	text = str(f"{Labels[classIDs[i]]}:")
	cv2.putText(frame, text, (x, y – 5), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)


	# Camera Settings
	camera_Width = 640 # 1024 # 1280 # 640
	camera_Heigth = 480 # 780 # 960 # 480
	frameSize = (camera_Width, camera_Heigth)
	video_capture = cv2.VideoCapture(1)
	time.sleep(2.0)
	(W, H) = (None, None)

	# YOLO Settings
	weightsPath = "yolov3.weights"
	configPath = "yolov3.cfg"
	LabelsPath = "coco.names"
	Labels = open(LabelsPath).read().strip().split("\n")
	confidenceDef = 0.5
	thresholdDef = 0.3
	net = (None)
	labelColors = (None)
	layerNames = (None)

	i = 0
	detectionEnabled = False
	while True:
	i = i + 1
	start_time = time.time()

	ret, frameOrig = video_capture.read()
	frame = cv2.resize(frameOrig, frameSize)

	if(detectionEnabled):
	analyzeFrame(frame)

	if (time.time() – start_time ) > 0:
	fpsInfo = "FPS: " + str(1.0 / (time.time() – start_time)) # FPS = 1 / time to process loop
	font = cv2.FONT_HERSHEY_DUPLEX
	cv2.putText(frame, fpsInfo, (10, 20), font, 0.4, (255, 255, 255), 1)

	cv2.imshow('@elbruno – YoloV3 Object Detection', frame)

	# key controller
	key = cv2.waitKey(1) & 0xFF
	if key == ord("d"):
	if (detectionEnabled == True):
	detectionEnabled = False
	else:
	detectionEnabled = True

	if key == ord("q"):
	break

	video_capture.release()
	cv2.destroyAllWindows()

view raw WebCamObjectDetectionYoloV3.py hosted with ❤ by GitHub

MobileNetSSD webcam live object detection

	# Bruno Capuano 2020
	# display the camera feed using OpenCV
	# display FPS
	# load MobileNetSSD object detector trained with COCO Dataset (20 classes)
	# analyze each camera frame using MobileNet
	# enable disable obj detection pressing D key

	import numpy as np
	import time
	import cv2
	import os

	def initMobileNetSSD():
	global classesMobileNetSSD, colorsMobileNetSSD, net
	classesMobileNetSSD = ["background", "aeroplane", "bicycle", "bird", "boat",
	"bottle", "bus", "car", "cat", "chair", "cow", "diningtable",
	"dog", "horse", "motorbike", "person", "pottedplant", "sheep",
	"sofa", "train", "tvmonitor"]
	colorsMobileNetSSD = np.random.uniform(0, 255, size=(len(classesMobileNetSSD), 3))
	net = cv2.dnn.readNetFromCaffe(prototxtFile, modelFile)

	def analyzeFrame(frame, displayBoundingBox = True, displayClassName = True, displayConfidence = True):
	global H, W

	# init
	if W is None or H is None:
	(H, W) = frame.shape[:2]
	if net is None:
	initMobileNetSSD()

	mobileNetSSDImgSize = (300, 300)
	blob = cv2.dnn.blobFromImage(cv2.resize(frame, mobileNetSSDImgSize), 0.007843, mobileNetSSDImgSize, 127.5)

	net.setInput(blob)
	detections = net.forward()

	for i in np.arange(0, detections.shape[2]):
	confidence = detections[0, 0, i, 2]

	if confidence > confidenceDef:
	idx = int(detections[0, 0, i, 1])
	box = detections[0, 0, i, 3:7] * np.array([W, H, W, H])
	(startX, startY, endX, endY) = box.astype("int")
	if(displayBoundingBox):
	cv2.rectangle(frame, (startX, startY), (endX, endY), colorsMobileNetSSD[idx], 2)
	if(displayClassName and displayConfidence):
	label = "{}: {:.2f}%".format(classesMobileNetSSD[idx], confidence * 100)
	y = startY – 15 if startY – 15 > 15 else startY + 15
	cv2.putText(frame, label, (startX, y), cv2.FONT_HERSHEY_SIMPLEX, 0.5, colorsMobileNetSSD[idx], 2)
	elif (displayClassName):
	label = str(f"{classesMobileNetSSD[idx]}")
	y = startY – 15 if startY – 15 > 15 else startY + 15
	cv2.putText(frame, label, (startX, y), cv2.FONT_HERSHEY_SIMPLEX, 0.5, colorsMobileNetSSD[idx], 2)


	# Camera Settings
	camera_Width = 640 # 1024 # 1280 # 640
	camera_Heigth = 480 # 780 # 960 # 480
	frameSize = (camera_Width, camera_Heigth)
	video_capture = cv2.VideoCapture(1)
	time.sleep(2.0)
	(W, H) = (None, None)

	# MobileNetSSD Settings
	confidenceDef = 0.5
	thresholdDef = 0.3
	prototxtFile = "MobileNetSSD_deploy.prototxt.txt"
	modelFile = "MobileNetSSD_deploy.caffemodel"
	net = (None)
	classesMobileNetSSD = (None)
	colorsMobileNetSSD = (None)

	i = 0
	detectionEnabled = False
	while True:
	i = i + 1
	start_time = time.time()

	ret, frameOrig = video_capture.read()
	frame = cv2.resize(frameOrig, frameSize)

	if(detectionEnabled):
	analyzeFrame(frame)

	if (time.time() – start_time ) > 0:
	fpsInfo = "FPS: " + str(1.0 / (time.time() – start_time)) # FPS = 1 / time to process loop
	font = cv2.FONT_HERSHEY_DUPLEX
	cv2.putText(frame, fpsInfo, (10, 20), font, 0.4, (255, 255, 255), 1)

	cv2.imshow('@elbruno – MobileNetSSD Object Detection', frame)

	# key controller
	key = cv2.waitKey(1) & 0xFF
	if key == ord("d"):
	if (detectionEnabled == True):
	detectionEnabled = False
	else:
	detectionEnabled = True

	if key == ord("q"):
	break

	video_capture.release()
	cv2.destroyAllWindows()

view raw WebCamObjectDetectionMobileNetSSD.py hosted with ❤ by GitHub

Happy coding!

Greetings

El Bruno