Learning discriminative representations from RGB-D video data

Li Liu, Ling Shao

Research output: Chapter in Book/Report/Conference proceedingConference contribution

258 Citations (Scopus)


Recently, the low-cost Microsoft Kinect sensor, which can capture real-time high-resolution RGB and depth visual information, has attracted increasing attentions for a wide range of applications in computer vision. Existing techniques extract hand-tuned features from the RGB and the depth data separately and heuristically fuse them, which would not fully exploit the complementarity of both data sources. In this paper, we introduce an adaptive learning methodology to automatically extract (holistic) spatio-temporal features, simultaneously fusing the RGB and depth information, from RGB-D video data for visual recognition tasks. We address this as an optimization problem using our proposed restricted graph-based genetic programming (RGGP) approach, in which a group of primitive 3D operators are first randomly assembled as graph-based combinations and then evolved generation by generation by evaluating on a set of RGB-D video samples. Finally the best-performed combination is selected as the (near-)optimal representation for a pre-defined task.

The proposed method is systematically evaluated on a new hand gesture dataset, SKIG, that we collected ourselves and the public MSR Daily Activity 3D dataset, respectively. Extensive experimental results show that our approach leads to significant advantages compared with state-of-the-art hand-crafted and machine-learned features.
Original languageEnglish
Title of host publicationIJCAI '13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Number of pages8
Publication statusPublished - 3 Aug 2013

Cite this