Text this: Enhanced fine-grained visual classification through lightweight Transformer integration and auxiliary information fusion.