Fusion of multimodal embeddings for ad-hoc video search