Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores.

Saved in:
Bibliographic Details
Title: Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores.
Authors: Caragea, George1 george@cs.umd.edu, Tzannes, Alexandros1, Keceli, Fuat2, Barua, Rajeev, Vishkin, Uzi
Source: International Journal of Parallel Programming. Oct2011, Vol. 39 Issue 5, p615-638. 24p. 4 Diagrams, 1 Chart, 5 Graphs.
Subjects: Parallelizing compilers, Microprocessor programming, Algorithms, Digital communications, Computer storage devices
Abstract: Super-scalar, out-of-order processors that can have tens of read and write requests in the execution window place significant demands on Memory Level Parallelism (MLP). Multi- and many-cores with shared parallel caches further increase MLP demand. Current cache hierarchies however have been unable to keep up with this trend, with modern designs allowing only 4-16 concurrent cache misses. This disconnect is exacerbated by recent highly parallel architectures (e.g. GPUs) where power and area per-core budget favor numerous lighter cores with less resources, further reducing support for MLP on a per-core basis. Support for hardware and software prefetch increases MLP pressure since these techniques overlap multiple memory requests with existing computation. In this paper, we propose and evaluate a novel Resource-Aware Prefetching (RAP) compiler algorithm that is aware of the number of simultaneous prefetches supported, and optimized for the same. We implemented our algorithm in a GCC-derived compiler and evaluated its performance using an emerging fine-grained many-core architecture. Our results show that the RAP algorithm outperforms a well-known loop prefetching algorithm by up to 40.15% in run-time on average across benchmarks and the state-of-the art GCC implementation by up to 34.79%, depending upon hardware configuration. Moreover, we compare the RAP algorithm with a simple hardware prefetching mechanism, and show run-time improvements of up to 24.61%. To demonstrate the robustness of our approach, we conduct a design-space exploration (DSE) for the considered target architecture by varying (i) the amount of chip resources designated for per-core prefetch storage and (ii) off-chip bandwidth. We show that the RAP algorithm is robust in that it improves performance across all design points considered. We also identify the Pareto-optimal hardware-software configuration which delivers 53.66% run-time improvement on average while using only 5.47% more chip area than the bare-bones design. [ABSTRACT FROM AUTHOR]
Copyright of International Journal of Parallel Programming is the property of Springer Nature and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Engineering Source
FullText Links:
  – Type: pdflink
Text:
  Availability: 0
Header DbId: egs
DbLabel: Engineering Source
An: 60840949
AccessLevel: 6
PubType: Academic Journal
PubTypeId: academicJournal
PreciseRelevancyScore: 0
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores.
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Caragea%2C+George%22">Caragea, George</searchLink><relatesTo>1</relatesTo><i> george@cs.umd.edu</i><br /><searchLink fieldCode="AR" term="%22Tzannes%2C+Alexandros%22">Tzannes, Alexandros</searchLink><relatesTo>1</relatesTo><br /><searchLink fieldCode="AR" term="%22Keceli%2C+Fuat%22">Keceli, Fuat</searchLink><relatesTo>2</relatesTo><br /><searchLink fieldCode="AR" term="%22Barua%2C+Rajeev%22">Barua, Rajeev</searchLink><br /><searchLink fieldCode="AR" term="%22Vishkin%2C+Uzi%22">Vishkin, Uzi</searchLink>
– Name: TitleSource
  Label: Source
  Group: Src
  Data: <searchLink fieldCode="JN" term="%22International+Journal+of+Parallel+Programming%22">International Journal of Parallel Programming</searchLink>. Oct2011, Vol. 39 Issue 5, p615-638. 24p. 4 Diagrams, 1 Chart, 5 Graphs.
– Name: Subject
  Label: Subjects
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22Parallelizing+compilers%22">Parallelizing compilers</searchLink><br /><searchLink fieldCode="DE" term="%22Microprocessor+programming%22">Microprocessor programming</searchLink><br /><searchLink fieldCode="DE" term="%22Algorithms%22">Algorithms</searchLink><br /><searchLink fieldCode="DE" term="%22Digital+communications%22">Digital communications</searchLink><br /><searchLink fieldCode="DE" term="%22Computer+storage+devices%22">Computer storage devices</searchLink>
– Name: Abstract
  Label: Abstract
  Group: Ab
  Data: Super-scalar, out-of-order processors that can have tens of read and write requests in the execution window place significant demands on Memory Level Parallelism (MLP). Multi- and many-cores with shared parallel caches further increase MLP demand. Current cache hierarchies however have been unable to keep up with this trend, with modern designs allowing only 4-16 concurrent cache misses. This disconnect is exacerbated by recent highly parallel architectures (e.g. GPUs) where power and area per-core budget favor numerous lighter cores with less resources, further reducing support for MLP on a per-core basis. Support for hardware and software prefetch increases MLP pressure since these techniques overlap multiple memory requests with existing computation. In this paper, we propose and evaluate a novel Resource-Aware Prefetching (RAP) compiler algorithm that is aware of the number of simultaneous prefetches supported, and optimized for the same. We implemented our algorithm in a GCC-derived compiler and evaluated its performance using an emerging fine-grained many-core architecture. Our results show that the RAP algorithm outperforms a well-known loop prefetching algorithm by up to 40.15% in run-time on average across benchmarks and the state-of-the art GCC implementation by up to 34.79%, depending upon hardware configuration. Moreover, we compare the RAP algorithm with a simple hardware prefetching mechanism, and show run-time improvements of up to 24.61%. To demonstrate the robustness of our approach, we conduct a design-space exploration (DSE) for the considered target architecture by varying (i) the amount of chip resources designated for per-core prefetch storage and (ii) off-chip bandwidth. We show that the RAP algorithm is robust in that it improves performance across all design points considered. We also identify the Pareto-optimal hardware-software configuration which delivers 53.66% run-time improvement on average while using only 5.47% more chip area than the bare-bones design. [ABSTRACT FROM AUTHOR]
– Name: AbstractSuppliedCopyright
  Label:
  Group: Ab
  Data: <i>Copyright of International Journal of Parallel Programming is the property of Springer Nature and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract.</i> (Copyright applies to all Abstracts.)
PLink https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=egs&AN=60840949
RecordInfo BibRecord:
  BibEntity:
    Identifiers:
      – Type: doi
        Value: 10.1007/s10766-011-0163-8
    Languages:
      – Code: eng
        Text: English
    PhysicalDescription:
      Pagination:
        PageCount: 24
        StartPage: 615
    Subjects:
      – SubjectFull: Parallelizing compilers
        Type: general
      – SubjectFull: Microprocessor programming
        Type: general
      – SubjectFull: Algorithms
        Type: general
      – SubjectFull: Digital communications
        Type: general
      – SubjectFull: Computer storage devices
        Type: general
    Titles:
      – TitleFull: Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores.
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Caragea, George
      – PersonEntity:
          Name:
            NameFull: Tzannes, Alexandros
      – PersonEntity:
          Name:
            NameFull: Keceli, Fuat
      – PersonEntity:
          Name:
            NameFull: Barua, Rajeev
      – PersonEntity:
          Name:
            NameFull: Vishkin, Uzi
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 01
              M: 10
              Text: Oct2011
              Type: published
              Y: 2011
          Identifiers:
            – Type: issn-print
              Value: 08857458
          Numbering:
            – Type: volume
              Value: 39
            – Type: issue
              Value: 5
          Titles:
            – TitleFull: International Journal of Parallel Programming
              Type: main
ResultId 1