Temporal normalization of videos using visual speech