Text this: Multimodal Computational Attention for Scene Understanding and Robotics