Automatic dynamic sign language recognition is even more challenging than gesture recognition due to the fact that the vocabularies are large and signs are context dependent. Previous works in this direction tend to build classifiers based on complex hand-crafted features computed from the raw inputs. As a type of deep learning model, convolutional neural networks (CNNs) have significantly advanced the accuracy of human gesture classification. However, such methods are currently used to treat video frames as 2D images and recognize gestures at the individual frame level. In this paper, we present a data driven system in which 3D-CNNs are applied to extract spatial and temporal features from video streams, and the motion information is captured by noting the variation in depth between each pair of consecutive frames. To further boost the performance, multi-modal of video streams, including infrared, contour and skeleton are used as input for the architecture and the prediction results estimated from the different sub-networks were fused together. In order to validate our method, we introduce a new challenging multi-modal dynamic sign language dataset captured with Kinect sensors. We evaluate the proposed approach on the collected dataset and achieve superior performance. Moreover, our method achieves a mean Jaccard Index score of 0.836 on the ChaLearn Looking at People Gesture datasets.
- 3D convolutional neural networks
- Deep learning
- Model combination
- Sign language recognition