Text this: Echo Depth Estimation via Attention-based Hierarchical Multi-scale Feature Fusion Network.