L-STAP: Learned Spatio-Temporal Adaptive Pooling for video captioning