Object detection is one of the most fundamental problems in Computer Vision, and has powered many of the significant advances in this field. It has applications in a wide range of areas such as advertising, driverless cars, healthcare, robotics, security, and others. This quasi-technical paper discusses the evolution, critical areas of research, and recent innovations in this area.
[siteorigin_widget class=”thinkup_builder_divider”][/siteorigin_widget]
Introduction to Object Detection
Object detection is a sub-field of computer vision that addresses the problem of detecting instances of objects within digital images and videos. It generally includes object classification (understanding ‘what’ the object instance is) and localization (understanding ‘where’ the object instance is.) The overall goal is to develop a semantic understanding of objects and their complex patterns (e.g., the types of objects, their locations, their concepts or meanings, or their movements over time.)
What makes object detection incredibly powerful is that it forms the basis for most real-world computer vision applications today. Apart from its primary function, it serves as the basis for conducting other computer vision tasks, such as activity recognition, event detection, face recognition, instance & semantic segmentation, image captioning, object tracking, scene understanding, vision-based emotion detection, video summarization, and others.
Object detection has a wide range of applications today. Key examples include:
- Advertising: Brand & Logo detection, Human Behavior Analysis
- Autonomous Driving: Pedestrian Detection, Traffic Light Detection
- Defense & Space Exploration: Remote Sensing Target Detection
- Healthcare: Medical Imaging & Diagnostics
- Robotics: Robot Vision & Manipulation
- Security: Video Surveillance, Traffic Monitoring
- Others: Digital Watermarking, Video Communication
Object detection is still in the formative stages. For instance, there are challenges in building detectors that can generalize accurately under changing viewpoints, backgrounds or lighting conditions; and in high clutter, occlusions or low resolution. As a result, while object detection offers significant value in digital transformation, there are still substantial challenges to be addressed.
The Evolution of Object Detection
Object detection has been an area of research for several decades. While it started gaining prominence in the 1990s, it was only in this century that breakthroughs took place.
The evolution of object detection can be best described in terms of two phases: Pre-AlexNet and Post-AlexNet
The Pre-AlexNet (Traditional) Phase
This phase witnessed the development of three major frameworks that shaped a large part of object detection’s journey to date.
- The Viola-Jones (VJ) Detector, 2001
- The Histogram of Oriented Gradients (HOG) Detector, 2005
- The Deformable Part-based Model (DPM) Framework, 2008
Traditional object detection is primarily based on handcrafted features and shallow neural networks. However, AlexNet’s performance in the 2012 ImageNet Challenge brought about a significant shift in this field – both in terms of technology development and operational practices. Deep Learning networks gradually became the norm due to their ability to learn deeper, more complex computer vision features.
While most traditional detection approaches may have been replaced today, they still have a role to play. For instance, the widely applied ‘sliding windows’ concept heavily borrows from the Viola-Jones Detector. Similarly, the HOG Detector provides a practical framework for crucial computer vision tasks (such as pedestrian detection) that remains relevant today. Finally, modern object detectors borrow heavily from the DPM practices of bounding box regression, mixture models, hard negative mining, and others.
The Post-AlexNet (Deep Learning) Phase
Two approaches emerged during this phase.
- Dual-Stage Detection: This has been the predominant approach so far, and remains a powerful paradigm. The first stage generates the regions of interest (object proposals), while the second stage classifies these proposals and locates the objects using bounding box regression. Examples include Feature Pyramid Networks (FPN), Spatial Pyramid Pooling Networks (SPP-net), and the R-CNN Family (R-CNN, Fast R-CNN, Faster R-CNN, Mask R-CNN).
- Single-Stage Detection: This paradigm came into prominence with the advent of YOLO (You Only Look Once), and subsequently gained significant traction with the Single Shot Multibox Detector (SSD), RetinaNet, and others. In this paradigm, the detector skips the region proposal stage, and directly classifies and locates the bounding boxes.
The single-stage versus dual-stage detection is a trade-off between speed and accuracy.
While single-stage detectors are generally faster, the dual-stage ones are usually more accurate. For instance, SSD seldom beats Faster R-CNN in accuracy metrics though it is inherently much faster. Efforts are in progress to build detectors that can achieve both accuracy and speed.
A Complex Subject To Understand
Computer vision, in general, and object detection, in particular, are considered to be more complex than most other areas of Artificial Intelligence. Consider the following diagram as an example. This is a high-level architecture of the RetinaNet detector that powers several computer vision applications today.
This architecture is complex in terms of implementing and optimizing during model development. Furthermore, as evident from the diagram, building on RetinaNet necessitates practitioners to understand the ResNet architecture and Feature Pyramid Network, both complex areas on their own. This is just one of many such examples. As more advanced architectures and engines get developed, both the breadth and depth of technical understanding will also need to keep increasing. This implies the need for hard-core computer vision specialists to take optimal advantage of these innovations.
[siteorigin_widget class=”thinkup_builder_divider”][/siteorigin_widget]
Key Innovations and Recent Advances
Most innovations in recent times have largely focused on the following areas.
- Advances in Backbone Engines, such as VGG, Deep Residual Networks (ResNet), DenseNet, Squeeze & Excitation Networks (SENet), ResNeXt & DetNet.
- Development of new detectors (such as CenterNet, DeepRegionLet, RefineDet, and others), and the upgradation of the earlier ones (e.g., new YOLO versions.)
- Innovations in feature representations (e.g., feature fusion and high-resolution feature learning with large receptive fields), and context encoding & refinement.
- Localization improvements (e.g., bounding box refinement, and new loss functions) and anchor-free detection methods.
- Robustness in detection through rotation-invariant loss functions, scale adaptive detection, learning with enriched features, multi-task loss functions, and others.
- Other direct and indirect areas, such as adversarial learning, capsule networks, knowledge distillation, light-weight object detection, memory-efficient networks, and transfer learning.
The past two years have witnessed exciting developments in this field. Innovations like CornerNet, CenterNet, ExtremeNet, FCOS & FSAF have pushed the envelope on anchor-free detection. Others like DetNAS, EfficientNet & NAS-FPN have focused on driving Auto-ML capabilities in object detection. Facebook released the Detectron2 framework, its next-generation object detection system. Newer detectors and other related artifacts were released by the open-source community.
Specifically, in 2020, YOLOv4 was released in April with much-improved speed and accuracy vis-à-vis its predecessors. The architecture includes the CSPDarkNet53 as the backbone; Spatial pyramid pooling (SPP) and Path Aggregation Network (PAN) as the neck; and YOLOv3 as the head. In May, Facebook announced Detection Transformers (DETR), a new approach to object detection using transformers (that have gained significant success in NLP in the past two years.) In June, Google and John Hopkins released DetectoRS that combined Recursive Feature Pyramids with Switchable Atrous Convolution. The researchers claim SOTA results in object detection, instance segmentation, and panoptic segmentation. While this still needs to be validated by the global AI community, early results seem to be encouraging.
Current Trends and Future Directions
The key focus areas in the preceding section continue to gain R&D attention. In addition, the current trends and future directions pertain to the following areas:
3D Object Detection: The detection of 3D objects has its own requirements. For instance, these objects do not follow any specific orientation, and this poses considerable challenges. Certain advances have been made in recent years, but a lot still remains to be done to consistently achieve high performance. This is a key area of research today, particularly in the domain of autonomous driving.
Anchor-Free Detection: Anchor-free techniques (such as keypoint-based and center-based methods) have enabled AI researchers to address some of the known limitations of anchor-based methods. For instance, these detectors help to eliminate the hyperparameter setup related to anchors. However, many anchor-free detectors are known to exhibit average to mediocre performance during production inference, particularly for large-scale workloads. This area has received a lot of focus in recent years, and continues to remain significant.
Imbalance Problems: In real-life computer vision applications, imbalances cause issues in achieving high detection accuracy. These include Class Imbalances (foreground-background, foreground-foreground, etc.), Scale Imbalance (box-level, feature-level, etc.), Spatial Imbalance (IoU distribution, regression loss, etc.), and Objective Imbalance. Significant attention is paid today to address these issues.
Low-shot & Weakly Supervised Detection: Most detectors require significant amounts of labeled data for training. This is highly inefficient as it increases the development cycle time and costs. Efforts are in progress to build detectors with limited labeled data, such as image-level annotations instead of bounding boxes.
Real-time, high-speed Detection: Object detection is resource-intensive, both in terms of human intervention and model processing (compute). As a result, high-speed detection in real-time, particularly for mobile devices, is an important development area. AutoML in object detection is also an important focus area.
Small-Object Detection: Most detectors struggle with small objects. The inaccuracies in small-object detection are considerably higher than those related to the medium or big sized ones. For instance, remote-sensing of small targets in disaster management or military applications is still under-developed. This is an important area of research today.
Video-based Object Detection: Modern object detection is primarily designed for images, and not explicitly for videos. As a result, videos require to be chunked into individual frames before detection can happen. This creates inefficiencies like delays in detection, overheads in converting the videos into frames, and non-consideration of the frame-level relationships. Addressing these aspects is a critical area of current and future development.
The above list is, by no means, exhaustive. Other focus areas include domain adaptation, few-shot detection of complex cases, camouflaged object detection, panoptic segmentation, saliency detection, shadow instance detection, etc.
Closing Comments
Object detection has been witnessing a constant spate of innovations over the past few years. Combined with the accelerated pace of developments in Deep Learning and High-end computing, it is driving critical applications in the field of computer vision today. At the same time, its evolutionary journey is still in the early stages. Many problems are yet to be addressed, and newer challenges keep cropping up. The global AI community has rightly focused on the core areas that enable the mainstreaming of this field (such as backbone networks, highly accurate single-stage detectors, faster production inference, and others.) Moreover, newer detection paradigms are constantly being explored to push the boundaries of innovation. The next few years are going to be truly exciting!
Acknowledgment:
Object Detection in 20 Years: A Survey – Zhengxia Zou, Zhenwei Shi, Yuhong Guo & Jieping Ye (2019)