Text this: Effectively Obtaining Acoustic, Visual, and Textual Data from Videos.