Audio-visual intent-to-speak detection for human-computer interaction