ISODAC: A high performance solution for indexing and searching heterogeneous data.

Saved in:
Bibliographic Details
Title: ISODAC: A high performance solution for indexing and searching heterogeneous data.
Authors: Totaro, G.1 totaro@di.uniroma1.it, Bernaschi, M.2 massimo.bernaschi@cnr.it, Carbone, G.1 giancarlo.carbone@uniroma1.it, Cianfriglia, M.2 m.cianfriglia@iac.cnr.it, Di Marco, A.2 a.dimarco@iac.cnr.it
Source: Journal of Systems & Software. Aug2016, Vol. 118, p115-133. 19p.
Subjects: High performance computing, Indexing, Database searching, Computer crimes, Criminal investigation, Computer storage devices, Data extraction
Abstract: Searching for words or sentences within large sets of textual documents can be very challenging unless an index of the data has been created in advance. However, indexing can be very time consuming especially if the text is not readily available and has to be extracted from files stored in different formats. Several solutions, based on the MapReduce paradigm, have been proposed to accelerate the process of index creation. These solutions perform well when data are already distributed across the hosts involved in the elaboration. On the other hand, the cost of distributing data can introduce noticeable overhead. We propose ISODAC, a new approach aimed at improving efficiency without sacrificing reliability. Our solution reduces to the bare minimum the number of I/O operations by using a stream of in-memory operations to extract and index text. We further improve the performance by using GPUs for the most computationally intensive tasks of the indexing procedure. ISODAC indexes heterogeneous documents up to 10.6x faster than other widely adopted solutions, such as Apache Spark. As proof-of-concept, we developed a tool to index forensic disk images that can easily be used by investigators through a web interface. [ABSTRACT FROM AUTHOR]
Copyright of Journal of Systems & Software is the property of Elsevier B.V. and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Engineering Source
FullText Text:
  Availability: 0
Header DbId: egs
DbLabel: Engineering Source
An: 115943755
AccessLevel: 6
PubType: Academic Journal
PubTypeId: academicJournal
PreciseRelevancyScore: 0
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: ISODAC: A high performance solution for indexing and searching heterogeneous data.
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Totaro%2C+G%2E%22">Totaro, G.</searchLink><relatesTo>1</relatesTo><i> totaro@di.uniroma1.it</i><br /><searchLink fieldCode="AR" term="%22Bernaschi%2C+M%2E%22">Bernaschi, M.</searchLink><relatesTo>2</relatesTo><i> massimo.bernaschi@cnr.it</i><br /><searchLink fieldCode="AR" term="%22Carbone%2C+G%2E%22">Carbone, G.</searchLink><relatesTo>1</relatesTo><i> giancarlo.carbone@uniroma1.it</i><br /><searchLink fieldCode="AR" term="%22Cianfriglia%2C+M%2E%22">Cianfriglia, M.</searchLink><relatesTo>2</relatesTo><i> m.cianfriglia@iac.cnr.it</i><br /><searchLink fieldCode="AR" term="%22Di+Marco%2C+A%2E%22">Di Marco, A.</searchLink><relatesTo>2</relatesTo><i> a.dimarco@iac.cnr.it</i>
– Name: TitleSource
  Label: Source
  Group: Src
  Data: <searchLink fieldCode="JN" term="%22Journal+of+Systems+%26+Software%22">Journal of Systems & Software</searchLink>. Aug2016, Vol. 118, p115-133. 19p.
– Name: Subject
  Label: Subjects
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22High+performance+computing%22">High performance computing</searchLink><br /><searchLink fieldCode="DE" term="%22Indexing%22">Indexing</searchLink><br /><searchLink fieldCode="DE" term="%22Database+searching%22">Database searching</searchLink><br /><searchLink fieldCode="DE" term="%22Computer+crimes%22">Computer crimes</searchLink><br /><searchLink fieldCode="DE" term="%22Criminal+investigation%22">Criminal investigation</searchLink><br /><searchLink fieldCode="DE" term="%22Computer+storage+devices%22">Computer storage devices</searchLink><br /><searchLink fieldCode="DE" term="%22Data+extraction%22">Data extraction</searchLink>
– Name: Abstract
  Label: Abstract
  Group: Ab
  Data: Searching for words or sentences within large sets of textual documents can be very challenging unless an index of the data has been created in advance. However, indexing can be very time consuming especially if the text is not readily available and has to be extracted from files stored in different formats. Several solutions, based on the MapReduce paradigm, have been proposed to accelerate the process of index creation. These solutions perform well when data are already distributed across the hosts involved in the elaboration. On the other hand, the cost of distributing data can introduce noticeable overhead. We propose ISODAC, a new approach aimed at improving efficiency without sacrificing reliability. Our solution reduces to the bare minimum the number of I/O operations by using a stream of in-memory operations to extract and index text. We further improve the performance by using GPUs for the most computationally intensive tasks of the indexing procedure. ISODAC indexes heterogeneous documents up to 10.6x faster than other widely adopted solutions, such as Apache Spark. As proof-of-concept, we developed a tool to index forensic disk images that can easily be used by investigators through a web interface. [ABSTRACT FROM AUTHOR]
– Name: AbstractSuppliedCopyright
  Label:
  Group: Ab
  Data: <i>Copyright of Journal of Systems & Software is the property of Elsevier B.V. and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract.</i> (Copyright applies to all Abstracts.)
PLink https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=egs&AN=115943755
RecordInfo BibRecord:
  BibEntity:
    Identifiers:
      – Type: doi
        Value: 10.1016/j.jss.2015.11.043
    Languages:
      – Code: eng
        Text: English
    PhysicalDescription:
      Pagination:
        PageCount: 19
        StartPage: 115
    Subjects:
      – SubjectFull: High performance computing
        Type: general
      – SubjectFull: Indexing
        Type: general
      – SubjectFull: Database searching
        Type: general
      – SubjectFull: Computer crimes
        Type: general
      – SubjectFull: Criminal investigation
        Type: general
      – SubjectFull: Computer storage devices
        Type: general
      – SubjectFull: Data extraction
        Type: general
    Titles:
      – TitleFull: ISODAC: A high performance solution for indexing and searching heterogeneous data.
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Totaro, G.
      – PersonEntity:
          Name:
            NameFull: Bernaschi, M.
      – PersonEntity:
          Name:
            NameFull: Carbone, G.
      – PersonEntity:
          Name:
            NameFull: Cianfriglia, M.
      – PersonEntity:
          Name:
            NameFull: Di Marco, A.
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 01
              M: 08
              Text: Aug2016
              Type: published
              Y: 2016
          Identifiers:
            – Type: issn-print
              Value: 01641212
          Numbering:
            – Type: volume
              Value: 118
          Titles:
            – TitleFull: Journal of Systems & Software
              Type: main
ResultId 1