Embedding images and sentences in a common space with a recurrent capsule network