Περίληψη: | Human Pose Estimation is an ongoing research topic in Computer Vision,
the aim of which is to locate a sparse set of points in a given image, corresponding
to the human body joints. It is of wide interest to the fields of
Automatic Human Behavior Understanding and Human Computer Interaction,
with applications ranging from animation to medical aid. Like many
other Computer Vision disciplines, the field of Human Pose Estimation has
benefited from the advent and development of Deep Learning and Convolutional
Neural Networks, and has been boosted to attain impressive results
in challenging datasets. The state of the art methods for Human Pose Estimation
build on deep networks that produce heatmaps, in a loosely way to
spatially locate the joints.
One of the appealing factors behind the success of Human Pose Estimation
methods relies on the fact that very good results can be obtained from
just a single monocular RGB image, making them suitable for most portable
camera systems, such as those embedded in mobile phones. However, the
use of monocular RGB images has an important drawback, as it incurs in a
loss of information that seems crucial for a full scene understanding. To wit:
human vision locates objects within the 3D space thanks to the stereo vision
and the parallax effect, tools that do not apply when working with monocular
static images.
While there is plenty of research in both depth and human pose estimation,
the combination of them is yet to be unearthed. This thesis proposes a
simple yet effective method for human pose estimation, which applies a preprocessing
step to the monocular images, tasked with generating the depth
maps. Namely, this thesis joins both approaches in a single framework that
augments the input RGB images with their corresponding depth maps, in a
cascaded manner. First, this thesis uses a network that generates a pixel-wise
depth map from an input monocular image. This depth map is concatenated
along with the color information, and subsequently forwarded to another
network that estimates the heatmaps corresponding to the joint locations.
The input to the latter network is the 4D volume made up of the RGB+Depth
information, and the topology of both networks build upon the newly introduced
hourglass architecture.
The proposed approach is evaluated in one of the most recent and extensive
benchmarks in Human Pose Estimation, showing the importance of
using the depth maps to achieve better performance.
|