深兰科技|目标检测二十年间的那些事儿

2020-07-31 10:00

（3） Fast RCNN

2015年，R． Girshick提出了Fast RCNN检测器［19］，这是对R－CNN和SPPNet的进一步改进。Fast RCNN使我们能够在相同的网络配置下同时训练检测器和边界框回归器。在VOC07数据集上，Fast RCNN将mAP从58．5％（ RCNN）提高到70．0％，检测速度是R－CNN的200多倍。

虽然Fast－RCNN成功地融合了R－CNN和SPPNet的优点，但其检测速度仍然受到提案检测的限制。然后，一个问题自然而然地出现了：“ 我们能用CNN模型生成对象提案吗？ ” 之后的Faster R－CNN解决了这个问题。

（4） Faster RCNN

2015年，S． Ren等人提出了Faster RCNN检测器［20］，在Fast RCNN之后不久。Faster RCNN 是第一个端到端的，也是第一个接近实时的深度学习检测器（COCO mAP＠．5＝42．7％，COCO mAP＠［．5，．95］＝21．9％， VOC07 mAP＝73．2％，VOC12 mAP＝70．4％）。Faster RCNN的主要贡献是引入了区域提案网络（RPN）从而允许几乎所有的cost－free的区域提案。从RCNN到Faster RCNN，一个目标检测系统中的大部分独立块，如提案检测、特征提取、边界框回归等，都已经逐渐集成到一个统一的端到端学习框架中。

虽然Faster RCNN突破了Fast RCNN的速度瓶颈，但是在后续的检测阶段仍然存在计算冗余。后来提出了多种改进方案，包括RFCN和 Light head RCNN。

（5） Feature Pyramid Networks（FPN）

2017年，T．－Y．Lin等人基于Faster RCNN提出了特征金字塔网络（FPN）［21］。在FPN之前，大多数基于深度学习的检测器只在网络的顶层进行检测。虽然CNN较深层的特征有利于分类识别，但不利于对象的定位。为此，开发了具有横向连接的自顶向下体系结构，用于在所有级别构建高级语义。由于CNN通过它的正向传播，自然形成了一个特征金字塔，FPN在检测各种尺度的目标方面显示出了巨大的进步。在基础的Faster RCNN系统中使用FPN骨架可在无任何修饰的条件下在MS－COCO数据集上以单模型实现state－of－the－art 的效果（COCO mAP＠．5＝59．1％，COCO mAP＠［．5，．95］＝ 36．2％）。FPN现在已经成为许多最新探测器的基本组成部分。

基于卷积神经网络的单级检测器

单阶段检测的发展及各类检测器的结构［2］

（1） You Only Look Once （YOLO）

YOLO由R． Joseph等人于2015年提出［22］。它是深度学习时代的第一个单级检测器。YOLO非常快：YOLO的一个快速版本运行速度为155fps， VOC07 mAP＝52．7％，而它的增强版本运行速度为45fps， VOC07 mAP＝63．4％， VOC12 mAP＝57．9％。YOLO是“ You Only Look Once ” 的缩写。从它的名字可以看出，作者完全抛弃了之前的“提案检测＋验证”的检测范式。相反，它遵循一个完全不同的设计思路：将单个神经网络应用于整个图像。该网络将图像分割成多个区域，同时预测每个区域的边界框和概率。后来R． Joseph在 YOLO 的基础上进行了一系列改进，其中包括以路径聚合网络（Path aggregation Network， PAN）取代FPN，定义新的损失函数等，陆续提出了其 v2、v3及v4版本（截止本文的2020年7月，Ultralytics发布了“YOLO v5”，但并没有得到官方承认），在保持高检测速度的同时进一步提高了检测精度。

必须指出的是，尽管与双级探测器相比YOLO的探测速度有了很大的提高，但它的定位精度有所下降，特别是对于一些小目标而言。YOLO的后续版本及在它之后提出的SSD更关注这个问题。

（2） Single Shot MultiBox Detector （SSD）

SSD由W． Liu等人于2015年提出［23］。这是深度学习时代的第二款单级探测器。SSD的主要贡献是引入了多参考和多分辨率检测技术，这大大提高了单级检测器的检测精度，特别是对于一些小目标。SSD在检测速度和准确度上都有优势（VOC07 mAP＝76．8％，VOC12 mAP＝74．9％， COCO mAP＠．5＝46．5％，mAP＠［．5，．95］＝26．8％，快速版本运行速度为59fps）。SSD与其他的检测器的主要区别在于，前者在网络的不同层检测不同尺度的对象，而后者仅在其顶层运行检测。

（3） RetinaNet

单级检测器有速度快、结构简单的优点，但在精度上多年来一直落后于双级检测器。T．－Y．Lin等人发现了背后的原因，并在2017年提出了RetinaNet［24］。他们的观点为精度不高的原因是在密集探测器训练过程中极端的前景－背景阶层不平衡（the extreme foreground－background class imbalance）现象。为此，他们在RetinaNet中引入了一个新的损失函数 “ 焦点损失（focal loss）”，通过对标准交叉熵损失的重构，使检测器在训练过程中更加关注难分类的样本。焦损耗使得单级检测器在保持很高的检测速度的同时，可以达到与双级检测器相当的精度。（COCO mAP＠．5＝59．1％，mAP＠［．5，．95］＝39．1％）。

参考文献：

［1］Zhengxia Zou， Zhenwei Shi， Member， IEEE， Yuhong Guo， and Jieping Ye， Object Detection in 20 Years： A Survey Senior Member， IEEE

［2］Xiongwei Wu， Doyen Sahoo， Steven C．H． Hoi， Recent Advances in Deep Learning for Object Detection， arXiv：1908．03673v1

［3］K． He， X． Zhang， S． Ren， J． Sun， Deep residual learning for image recognition， in： CVPR， 2016．

［4］R． Girshick， J． Donahue， T． Darrell， J． Malik， Rich feature hierarchies for accurate object detection and semantic segmentation， in： CVPR， 2014．

［5］K． He， G． Gkioxari， P． Dollar， R． Girshick， Mask r－cnn， in： ICCV， 2017．

［6］L．－C． Chen， G． Papandreou， I． Kokkinos， K． Murphy， A． L． Yuille， Semantic image segmentation with deep convolutional nets and fully connected crfs， in： arXiv preprint arXiv：1412．7062， 2014．

［7］Y． LeCun， Y． Bengio， and G． Hinton， “Deep learning，” nature， vol． 521， no． 7553， p． 436， 2015．

［8］P． Viola and M． Jones， “Rapid object detection using a boosted cascade of simple features，” in Computer Vision and Pattern Recognition， 2001． CVPR 2001． Proceedings of the 2001 IEEE Computer Society Conference on， vol． 1． IEEE， 2001， pp． I–I．

［9］P． Viola and M． J． Jones， “Robust real－time face detection，” International journal of computer vision， vol． 57， no． 2， pp． 137–154， 2004．

［10］C． Papageorgiou and T． Poggio， “A trainable system for object detection，” International journal of computer vision， vol． 38， no． 1， pp． 15–33， 2000．

［11］N． Dalal and B． Triggs， “Histograms of oriented gradients for human detection，” in Computer Vision and Pattern Recognition， 2005． CVPR 2005． IEEE Computer Society Conference on， vol． 1． IEEE， 2005， pp． 886–893．

［12］P． Felzenszwalb， D． McAllester， and D． Ramanan， “A discriminatively trained， multiscale， deformable part model，” in Computer Vision and Pattern Recognition， 2008． CVPR 2008． IEEE Conference on． IEEE， 2008， pp． 1–8．

［13］P． F． Felzenszwalb， R． B． Girshick， and D． McAllester， “Cascade object detection with deformable part models，” in Computer vision and pattern recognition （CVPR）， 2010 IEEE conference on． IEEE， 2010， pp． 2241–2248．

［14］P． F． Felzenszwalb， R． B． Girshick， D． McAllester， and D． Ramanan， “Object detection with discriminatively trained part－based models，” IEEE transactions on pattern analysis and machine intelligence， vol． 32， no． 9， pp． 1627– 1645， 2010．

［15］A． Krizhevsky， I． Sutskever， and G． E． Hinton， “Imagenet classification with deep convolutional neural networks，” in Advances in neural information processing systems， 2012， pp． 1097–1105．

［16］R． Girshick， J． Donahue， T． Darrell， and J． Malik， “Regionbased convolutional networks for accurate object detection and segmentation，” IEEE transactions on pattern analysis and machine intelligence， vol． 38， no． 1， pp． 142– 158， 2016．

［17］K． E． Van de Sande， J． R． Uijlings， T． Gevers， and A． W． Smeulders， “Segmentation as selective search for object recognition，” in Computer Vision （ICCV）， 2011 IEEE International Conference on． IEEE， 2011， pp． 1879–1886．

［18］K． He， X． Zhang， S． Ren， and J． Sun， “Spatial pyramid pooling in deep convolutional networks for visualrecognition，” in European conference on computer vision． Springer， 2014， pp． 346–361．

［19］R． Girshick， “Fast r－cnn，” in Proceedings of the IEEE international conference on computer vision， 2015， pp． 1440–1448．

［20］S． Ren， K． He， R． Girshick， and J． Sun， “Faster r－cnn： Towards real－time object detection with region proposal networks，” in Advances in neural information processing systems， 2015， pp． 91–99．

［21］T．－Y． Lin， P． Dollar， R． B． Girshick， K． He， B． Hariharan， and S． J． Belongie， “Feature pyramid networks for object detection．” in CVPR， vol． 1， no． 2， 2017， p． 4．

［22］J． Redmon， S． Divvala， R． Girshick， and A． Farhadi， “You only look once： Unified， real－time object detection，” in Proceedings of the IEEE conference on computer vision and pattern recognition， 2016， pp． 779–788．

［23］W． Liu， D． Anguelov， D． Erhan， C． Szegedy， S． Reed， C．－Y． Fu， and A． C． Berg， “Ssd： Single shot multibox detector，” in European conference on computer vision． Springer， 2016， pp． 21–37．

［24］T．－Y． Lin， P． Goyal， R． Girshick， K． He， and P． Dollar， “Focal loss for dense object detection，” IEEE transactions on pattern analysis and machine intelligence， 2018．

<上一页 1 2 3