VMT: Virtualized Multi-Threading for Accelerating Graph Workloads on Commodity Processors.

Saved in:
Bibliographic Details
Title: VMT: Virtualized Multi-Threading for Accelerating Graph Workloads on Commodity Processors.
Authors: Feliu, Josue1 (AUTHOR) josue.f.p@um.es, Naithani, Ajeya2 (AUTHOR) ajeya.naithani@ugent.be, Sahuquillo, Julio3 (AUTHOR) jsahuqui@disca.upv.es, Petit, Salvador3 (AUTHOR) spetit@disca.upv.es, Qureshi, Moinuddin4 (AUTHOR) moin@gatech.edu, Eeckhout, Lieven2 (AUTHOR) lieven.eeckhout@ugent.be
Source: IEEE Transactions on Computers. Jun2022, Vol. 71 Issue 6, p1386-1398. 13p.
Subjects: Simultaneous multithreading processors, Graph algorithms, Computer architecture
Abstract: Modern-day graph workloads operate on huge graphs through pointer chasing which leads to high last-level cache (LLC) miss rates and limited memory-level parallelism (MLP). Simultaneous Multi-Threading (SMT) effectively hides the memory access latencies for multi-threaded graph workloads provided that sufficient threads are supported in hardware. Unfortunately, providing a sufficiently large number of physical threads incurs an unjustifiably high hardware cost for commodity SMT processors which typically implement only two physical hardware threads. Ideally, we would like to achieve aggressive-SMT performance when running graph workloads on modest commodity processors. In this paper, we propose Virtualized Multi-Threading (VMT), a low-overhead multi-threading paradigm for accelerating graph workloads on commodity processors. Unlike prior multi-threading paradigms, VMT virtualizes both the physical hardware threads and the architecture state: VMT maps a large number of logical software threads to a small number of physical hardware threads, while maintaining the architecture state of the logical threads in the processor's cache hierarchy. Implemented on top of a quad-core 2-way SMT processor, VMT achieves an average speedup of 1.74× for a set of representative graph workloads, while incurring minimal hardware cost (195 bytes per core to support up to 32 logical threads). VMT's low hardware cost paves the way for implementation in commodity processors. [ABSTRACT FROM AUTHOR]
Copyright of IEEE Transactions on Computers is the property of IEEE and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Engineering Source
Description
Abstract:Modern-day graph workloads operate on huge graphs through pointer chasing which leads to high last-level cache (LLC) miss rates and limited memory-level parallelism (MLP). Simultaneous Multi-Threading (SMT) effectively hides the memory access latencies for multi-threaded graph workloads provided that sufficient threads are supported in hardware. Unfortunately, providing a sufficiently large number of physical threads incurs an unjustifiably high hardware cost for commodity SMT processors which typically implement only two physical hardware threads. Ideally, we would like to achieve aggressive-SMT performance when running graph workloads on modest commodity processors. In this paper, we propose Virtualized Multi-Threading (VMT), a low-overhead multi-threading paradigm for accelerating graph workloads on commodity processors. Unlike prior multi-threading paradigms, VMT virtualizes both the physical hardware threads and the architecture state: VMT maps a large number of logical software threads to a small number of physical hardware threads, while maintaining the architecture state of the logical threads in the processor's cache hierarchy. Implemented on top of a quad-core 2-way SMT processor, VMT achieves an average speedup of 1.74× for a set of representative graph workloads, while incurring minimal hardware cost (195 bytes per core to support up to 32 logical threads). VMT's low hardware cost paves the way for implementation in commodity processors. [ABSTRACT FROM AUTHOR]
ISSN:00189340
DOI:10.1109/TC.2021.3086069