Human pose estimation using convolutional neural networks

Human Pose Estimation is an ongoing research topic in Computer Vision, the aim of which is to locate a sparse set of points in a given image, corresponding to the human body joints. It is of wide interest to the fields of Automatic Human Behavior Understanding and Human Computer Interaction, wit...

Πλήρης περιγραφή

Λεπτομέρειες βιβλιογραφικής εγγραφής
Κύριος συγγραφέας: Ντίνου, Ιωάννα
Άλλοι συγγραφείς: Οικονόμου, Γεώργιος
Μορφή: Thesis
Γλώσσα:English
Έκδοση: 2018
Θέματα:
Διαθέσιμο Online:http://hdl.handle.net/10889/11344
Περιγραφή
Περίληψη:Human Pose Estimation is an ongoing research topic in Computer Vision, the aim of which is to locate a sparse set of points in a given image, corresponding to the human body joints. It is of wide interest to the fields of Automatic Human Behavior Understanding and Human Computer Interaction, with applications ranging from animation to medical aid. Like many other Computer Vision disciplines, the field of Human Pose Estimation has benefited from the advent and development of Deep Learning and Convolutional Neural Networks, and has been boosted to attain impressive results in challenging datasets. The state of the art methods for Human Pose Estimation build on deep networks that produce heatmaps, in a loosely way to spatially locate the joints. One of the appealing factors behind the success of Human Pose Estimation methods relies on the fact that very good results can be obtained from just a single monocular RGB image, making them suitable for most portable camera systems, such as those embedded in mobile phones. However, the use of monocular RGB images has an important drawback, as it incurs in a loss of information that seems crucial for a full scene understanding. To wit: human vision locates objects within the 3D space thanks to the stereo vision and the parallax effect, tools that do not apply when working with monocular static images. While there is plenty of research in both depth and human pose estimation, the combination of them is yet to be unearthed. This thesis proposes a simple yet effective method for human pose estimation, which applies a preprocessing step to the monocular images, tasked with generating the depth maps. Namely, this thesis joins both approaches in a single framework that augments the input RGB images with their corresponding depth maps, in a cascaded manner. First, this thesis uses a network that generates a pixel-wise depth map from an input monocular image. This depth map is concatenated along with the color information, and subsequently forwarded to another network that estimates the heatmaps corresponding to the joint locations. The input to the latter network is the 4D volume made up of the RGB+Depth information, and the topology of both networks build upon the newly introduced hourglass architecture. The proposed approach is evaluated in one of the most recent and extensive benchmarks in Human Pose Estimation, showing the importance of using the depth maps to achieve better performance.