Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores.

Saved in:

Bibliographic Details
Title:	Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores.
Authors:	Caragea, George¹ george@cs.umd.edu, Tzannes, Alexandros¹, Keceli, Fuat², Barua, Rajeev, Vishkin, Uzi
Source:	International Journal of Parallel Programming. Oct2011, Vol. 39 Issue 5, p615-638. 24p. 4 Diagrams, 1 Chart, 5 Graphs.
Subjects:	Parallelizing compilers, Microprocessor programming, Algorithms, Digital communications, Computer storage devices
Abstract:	Super-scalar, out-of-order processors that can have tens of read and write requests in the execution window place significant demands on Memory Level Parallelism (MLP). Multi- and many-cores with shared parallel caches further increase MLP demand. Current cache hierarchies however have been unable to keep up with this trend, with modern designs allowing only 4-16 concurrent cache misses. This disconnect is exacerbated by recent highly parallel architectures (e.g. GPUs) where power and area per-core budget favor numerous lighter cores with less resources, further reducing support for MLP on a per-core basis. Support for hardware and software prefetch increases MLP pressure since these techniques overlap multiple memory requests with existing computation. In this paper, we propose and evaluate a novel Resource-Aware Prefetching (RAP) compiler algorithm that is aware of the number of simultaneous prefetches supported, and optimized for the same. We implemented our algorithm in a GCC-derived compiler and evaluated its performance using an emerging fine-grained many-core architecture. Our results show that the RAP algorithm outperforms a well-known loop prefetching algorithm by up to 40.15% in run-time on average across benchmarks and the state-of-the art GCC implementation by up to 34.79%, depending upon hardware configuration. Moreover, we compare the RAP algorithm with a simple hardware prefetching mechanism, and show run-time improvements of up to 24.61%. To demonstrate the robustness of our approach, we conduct a design-space exploration (DSE) for the considered target architecture by varying (i) the amount of chip resources designated for per-core prefetch storage and (ii) off-chip bandwidth. We show that the RAP algorithm is robust in that it improves performance across all design points considered. We also identify the Pareto-optimal hardware-software configuration which delivers 53.66% run-time improvement on average while using only 5.47% more chip area than the bare-bones design. [ABSTRACT FROM AUTHOR]
	Copyright of International Journal of Parallel Programming is the property of Springer Nature and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database:	Engineering Source

FullText	Links: – Type: pdflink Text: Availability: 0
Header	DbId: egs DbLabel: Engineering Source An: 60840949 AccessLevel: 6 PubType: Academic Journal PubTypeId: academicJournal PreciseRelevancyScore: 0
IllustrationInfo
Items	– Name: Title Label: Title Group: Ti Data: Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores. – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Caragea%2C+George%22">Caragea, George</searchLink><relatesTo>1</relatesTo><i> george@cs.umd.edu</i><br /><searchLink fieldCode="AR" term="%22Tzannes%2C+Alexandros%22">Tzannes, Alexandros</searchLink><relatesTo>1</relatesTo><br /><searchLink fieldCode="AR" term="%22Keceli%2C+Fuat%22">Keceli, Fuat</searchLink><relatesTo>2</relatesTo><br /><searchLink fieldCode="AR" term="%22Barua%2C+Rajeev%22">Barua, Rajeev</searchLink><br /><searchLink fieldCode="AR" term="%22Vishkin%2C+Uzi%22">Vishkin, Uzi</searchLink> – Name: TitleSource Label: Source Group: Src Data: <searchLink fieldCode="JN" term="%22International+Journal+of+Parallel+Programming%22">International Journal of Parallel Programming</searchLink>. Oct2011, Vol. 39 Issue 5, p615-638. 24p. 4 Diagrams, 1 Chart, 5 Graphs. – Name: Subject Label: Subjects Group: Su Data: <searchLink fieldCode="DE" term="%22Parallelizing+compilers%22">Parallelizing compilers</searchLink><br /><searchLink fieldCode="DE" term="%22Microprocessor+programming%22">Microprocessor programming</searchLink><br /><searchLink fieldCode="DE" term="%22Algorithms%22">Algorithms</searchLink><br /><searchLink fieldCode="DE" term="%22Digital+communications%22">Digital communications</searchLink><br /><searchLink fieldCode="DE" term="%22Computer+storage+devices%22">Computer storage devices</searchLink> – Name: Abstract Label: Abstract Group: Ab Data: Super-scalar, out-of-order processors that can have tens of read and write requests in the execution window place significant demands on Memory Level Parallelism (MLP). Multi- and many-cores with shared parallel caches further increase MLP demand. Current cache hierarchies however have been unable to keep up with this trend, with modern designs allowing only 4-16 concurrent cache misses. This disconnect is exacerbated by recent highly parallel architectures (e.g. GPUs) where power and area per-core budget favor numerous lighter cores with less resources, further reducing support for MLP on a per-core basis. Support for hardware and software prefetch increases MLP pressure since these techniques overlap multiple memory requests with existing computation. In this paper, we propose and evaluate a novel Resource-Aware Prefetching (RAP) compiler algorithm that is aware of the number of simultaneous prefetches supported, and optimized for the same. We implemented our algorithm in a GCC-derived compiler and evaluated its performance using an emerging fine-grained many-core architecture. Our results show that the RAP algorithm outperforms a well-known loop prefetching algorithm by up to 40.15% in run-time on average across benchmarks and the state-of-the art GCC implementation by up to 34.79%, depending upon hardware configuration. Moreover, we compare the RAP algorithm with a simple hardware prefetching mechanism, and show run-time improvements of up to 24.61%. To demonstrate the robustness of our approach, we conduct a design-space exploration (DSE) for the considered target architecture by varying (i) the amount of chip resources designated for per-core prefetch storage and (ii) off-chip bandwidth. We show that the RAP algorithm is robust in that it improves performance across all design points considered. We also identify the Pareto-optimal hardware-software configuration which delivers 53.66% run-time improvement on average while using only 5.47% more chip area than the bare-bones design. [ABSTRACT FROM AUTHOR] – Name: AbstractSuppliedCopyright Label: Group: Ab Data: <i>Copyright of International Journal of Parallel Programming is the property of Springer Nature and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract.</i> (Copyright applies to all Abstracts.)
PLink	https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=egs&AN=60840949
RecordInfo	BibRecord: BibEntity: Identifiers: – Type: doi Value: 10.1007/s10766-011-0163-8 Languages: – Code: eng Text: English PhysicalDescription: Pagination: PageCount: 24 StartPage: 615 Subjects: – SubjectFull: Parallelizing compilers Type: general – SubjectFull: Microprocessor programming Type: general – SubjectFull: Algorithms Type: general – SubjectFull: Digital communications Type: general – SubjectFull: Computer storage devices Type: general Titles: – TitleFull: Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores. Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Caragea, George – PersonEntity: Name: NameFull: Tzannes, Alexandros – PersonEntity: Name: NameFull: Keceli, Fuat – PersonEntity: Name: NameFull: Barua, Rajeev – PersonEntity: Name: NameFull: Vishkin, Uzi IsPartOfRelationships: – BibEntity: Dates: – D: 01 M: 10 Text: Oct2011 Type: published Y: 2011 Identifiers: – Type: issn-print Value: 08857458 Numbering: – Type: volume Value: 39 – Type: issue Value: 5 Titles: – TitleFull: International Journal of Parallel Programming Type: main
ResultId	1