Advances in monocular exemplar-based human body pose analysis

  1. ROGEZ, GRÉGORY
Dirigida por:
  1. Carlos Orrite Uruñuela Director/a

Universidad de defensa: Universidad de Zaragoza

Fecha de defensa: 08 de junio de 2012

Tribunal:
  1. Armando Roy Yarza Presidente/a
  2. José María Martinez Montiel Secretario/a
  3. Nicolás Pérez de la Blanca Capilla Vocal
  4. Dimitrios Makris Vocal
  5. Dariu M, Gavrila Vocal

Tipo: Tesis

Resumen

This thesis brings some contributions to one of the most active research areas in computer vision: the analysis of human body pose from monocular images. It has a broad range of potential applications in different fields such as human computer interfaces, safety (surveillance, biometrics) and biomedical (sport, motion analysis). Exemplar based techniques have been very successful for human body pose analysis. However, their accuracy strongly depends on the similarity of both camera viewing angle and scene properties between training and testing images. Given a typical training dataset captured from a small number of fixed cameras parallel to the ground, three types of testing environments with increasing level of difficulty have been identified and studied in this thesis: 1) a static camera with a similar viewing angle observing only one individual, 2) a fixed surveillance camera with a considerably different viewing angle and multiple targets and 3) a moving camera sequence or just a single static image of an unknown scene. Each environment raises different problems that we have considered separately. Therefore, we have structured the thesis in three main parts corresponding to these three testing conditions. In the first part, we use a common static background subtraction algorithm to perform foreground detection and propose a model-based approach associating the body pose and the 2D silhouette to jointly segment and recover the pose of the subject observed in the scene. To cope with viewpoint and out-of plane rotation, local spatio-temporal models corresponding to several views and steps of the same action are trained, concatenated and sorted in a global framework. Temporal and spatial constraints are then considered to select the most probable models at each time step. The experiments carried out on indoor and outdoor sequences have demonstrated the ability of this approach to adequately segment walking pedestrians and estimate their poses independently of the direction of motion. In the second part, we present a methodology for view-invariant monocular 3D body pose tracking in man-made environments. First, we model 3D body poses and camera viewpoint with a low dimensional manifold and learn a generative model of the silhouette from this manifold to the training views. During the online stage, 3D body poses are tracked using a recursive Bayesian sampling conducted jointly over the scene's ground plane and the pose-viewpoint manifold. For each sample, the homography relating training plane to the image points is calculated using the dominant 3D directions of the scene and used to project the regressed silhouette in the image in order to estimate its likelihood. In our experimental evaluation, we demonstrate the significant improvements of this homographic matching over a commonly used similarity transformation and provide quantitative 3D pose tracking results for monocular sequences with high perspective effect. In the third part, we address human detection and pose estimation by formulating it as a classification problem. Our main contribution is a multi-class pose detector that uses the best components of state-of-the-art classifiers including hierarchical trees, cascades of rejectors as well as randomized forests. First, we define a set of classes by discretizing camera viewpoint and pose space. A bottom-up approach is then followed to build a hierarchical tree by recursively clustering and merging the classes at each level. For each branch of this decision tree, we take advantage of the alignment of training images to build a list of potentially discriminative HOG (Histograms of Orientated Gradients) features. We then select the HOG blocks that show the best rejection performances. We finally grow an ensemble of cascades by randomly sampling one of these HOG-based rejectors at each branch of the tree. The resulting multi-class classifier is then used to scan images in a sliding window scheme. Our approach, when compared to other pose classifiers, gives fast and efficient detection performances with both fixed and moving cameras as well as with static images. We present results using different publicly available training and testing data sets.