Text this: Benchmarking large language model-based agent systems for clinical decision tasks.