How to train your antivirus: RL-based hardening through the problem space