Text this: Cross-Modal Tri-Semantic Correlation-CLIP for Short Video Homogenization Recognition.