Text this: Aspect-based multimodal sentiment analysis via employing visual-to-emotional-caption translation network using visual-caption pairs.