Towards a systematic multi-modal representation learning for network data