Text this: End-to-end temporal attention extraction and human action recognition.