Modeling multimodal cues in a deep learning-based framework for emotion recognition in the wild