Posted by Siyuan Qiao, Student Researcher and Liang-Chieh Chen, Research Scientist, Google Research
People are able to retrieve the visual information about 3D environments from a picture quite easily — we can identify objects, determine instance sizes, and reconstruct 3D scene layout, all using the limited signals contained in 2D images. This ability is commonly known as the inverse projection problem, which refers to reconstructing the ambiguous mapping from the retinal images to the sources of retinal stimulation. Real-world computer vision applications, such as autonomous driving, heavily rely on these capabilities to localize and identify 3D objects, which require vision models to infer the spatial location, semantic class, and instance label for each 3D point projected to the 2D images. The ability to reconstruct the 3D world from images can be decomposed into two disjoint computer vision tasks: monocular depth estimation (predicting depth from a single image) and video panoptic segmentation (the unification of instance segmentation and semantic segmentation, in the video domain). However, research has generally considered each task separately. Tackling these tasks jointly with a unified computer vision model could result in easier deployment and greater efficiency by sharing computation among multiple tasks.
Driven by the potential value of a model that predicts depth and video panoptic segmentation at the same time, we present “ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation”, accepted to CVPR 2021. In this work, we propose a new task, depth-aware video panoptic segmentation, that aims to simultaneously tackle monocular depth estimation and video panoptic segmentation. For the new task, we present two derived datasets accompanied by a new evaluation metric called depth-aware video panoptic quality (DVPQ). This new metric includes the metrics for depth estimation and video panoptic segmentation, requiring a vision model to simultaneously tackle the two sub-tasks. To this end, we extend Panoptic-DeepLab by adding network branches for depth and video predictions to create ViP-DeepLab,
This article is purposely trimmed, please visit the source to read the full article.
The post Holistic Video Scene Understanding with ViP-DeepLab appeared first on Google AI Blog.