Multimodal Dynamic Networks for Gesture Recognition

Di Wu, Ling Shao

Research output: Chapter in Book/Report/Conference proceedingConference contribution

13 Citations (Scopus)


Multimodal input is a real-world situation in gesture recognition applications such as sign language recognition. In this paper, we propose a novel bi-modal (audio and skeleton joints) dynamic network for gesture recognition. First, state-of-the-art dynamic Deep Belief Networks are deployed to extract high level audio and skeletal joints representations. Then, instead of traditional late fusion, we adopt another layer of perceptron for cross modality learning taking the input from each individual net's penultimate layer. Finally, to account for temporal dynamics, the learned shared representations are used for estimating the emission probability to infer action sequences. In particular, we demonstrate that multimodal feature learning will extract semantically meaningful shared representations, outperforming individual modalities, and the early fusion scheme's efficacy against the traditional method of late fusion.
Original languageEnglish
Title of host publicationProceedings of the 22nd ACM international conference on Multimedia
PublisherAssociation for Computing Machinery (ACM)
Number of pages4
ISBN (Print)978-1-4503-3063-3
Publication statusPublished - 3 Nov 2014

Cite this