Learning mDTD Extraction Patterns for Semi-Structured Web Information Extraction.

Saved in:
Bibliographic Details
Title: Learning mDTD Extraction Patterns for Semi-Structured Web Information Extraction.
Authors: Kim, Dongseok, Cha, Jeongwon, Lee, Gary Geunbae
Source: International Journal of Computer Processing of Oriental Languages. Mar2002, Vol. 15 Issue 1, p63. 16p.
Subjects: Information retrieval, Structured techniques of electronic data processing, Document imaging systems
Abstract: This paper presents a new extraction pattern, called a modified Document Type Definition (mDTD), which relies on an analytical interpretation method to identify textual fragments of documents from the Web. We make two major modifications which differ from the conventional DTD. Regarding syntax, we introduced an extended content model with more operators and keywords. For the semantics, we changed the way to interpret the mDTD rules. The mDTD can represent HTML structures and extraction targets. The design goal of mDTD is to overcome major barriers with information extraction, such as domain portability with minimum human intervention while maintaining a high extraction performance. The user composes mDTD as seed rules from which our system extracts instances from structured documents on the Web. These extracted instances are used as inputs to SmL (Sequential mDTD Learner). SmL generates new mDTD rules that are based on the part-of-speech (POS) tag and the lexical similarity features. For learning, a hand-tagged corpus is not required. We experimented with 200 Web documents on audio and video shopping sites. The results show 91.3% average extraction precision. [ABSTRACT FROM AUTHOR]
Copyright of International Journal of Computer Processing of Oriental Languages is the property of World Scientific Publishing Company and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Engineering Source
FullText Links:
  – Type: pdflink
Text:
  Availability: 0
Header DbId: egs
DbLabel: Engineering Source
An: 9278366
AccessLevel: 6
PubType: Academic Journal
PubTypeId: academicJournal
PreciseRelevancyScore: 0
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: Learning mDTD Extraction Patterns for Semi-Structured Web Information Extraction.
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Kim%2C+Dongseok%22">Kim, Dongseok</searchLink><br /><searchLink fieldCode="AR" term="%22Cha%2C+Jeongwon%22">Cha, Jeongwon</searchLink><br /><searchLink fieldCode="AR" term="%22Lee%2C+Gary+Geunbae%22">Lee, Gary Geunbae</searchLink>
– Name: TitleSource
  Label: Source
  Group: Src
  Data: <searchLink fieldCode="JN" term="%22International+Journal+of+Computer+Processing+of+Oriental+Languages%22">International Journal of Computer Processing of Oriental Languages</searchLink>. Mar2002, Vol. 15 Issue 1, p63. 16p.
– Name: Subject
  Label: Subjects
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22Information+retrieval%22">Information retrieval</searchLink><br /><searchLink fieldCode="DE" term="%22Structured+techniques+of+electronic+data+processing%22">Structured techniques of electronic data processing</searchLink><br /><searchLink fieldCode="DE" term="%22Document+imaging+systems%22">Document imaging systems</searchLink>
– Name: Abstract
  Label: Abstract
  Group: Ab
  Data: This paper presents a new extraction pattern, called a modified Document Type Definition (mDTD), which relies on an analytical interpretation method to identify textual fragments of documents from the Web. We make two major modifications which differ from the conventional DTD. Regarding syntax, we introduced an extended content model with more operators and keywords. For the semantics, we changed the way to interpret the mDTD rules. The mDTD can represent HTML structures and extraction targets. The design goal of mDTD is to overcome major barriers with information extraction, such as domain portability with minimum human intervention while maintaining a high extraction performance. The user composes mDTD as seed rules from which our system extracts instances from structured documents on the Web. These extracted instances are used as inputs to SmL (Sequential mDTD Learner). SmL generates new mDTD rules that are based on the part-of-speech (POS) tag and the lexical similarity features. For learning, a hand-tagged corpus is not required. We experimented with 200 Web documents on audio and video shopping sites. The results show 91.3% average extraction precision. [ABSTRACT FROM AUTHOR]
– Name: AbstractSuppliedCopyright
  Label:
  Group: Ab
  Data: <i>Copyright of International Journal of Computer Processing of Oriental Languages is the property of World Scientific Publishing Company and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract.</i> (Copyright applies to all Abstracts.)
PLink https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=egs&AN=9278366
RecordInfo BibRecord:
  BibEntity:
    Languages:
      – Code: eng
        Text: English
    PhysicalDescription:
      Pagination:
        PageCount: 16
        StartPage: 63
    Subjects:
      – SubjectFull: Information retrieval
        Type: general
      – SubjectFull: Structured techniques of electronic data processing
        Type: general
      – SubjectFull: Document imaging systems
        Type: general
    Titles:
      – TitleFull: Learning mDTD Extraction Patterns for Semi-Structured Web Information Extraction.
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Kim, Dongseok
      – PersonEntity:
          Name:
            NameFull: Cha, Jeongwon
      – PersonEntity:
          Name:
            NameFull: Lee, Gary Geunbae
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 01
              M: 03
              Text: Mar2002
              Type: published
              Y: 2002
          Identifiers:
            – Type: issn-print
              Value: 02194279
          Numbering:
            – Type: volume
              Value: 15
            – Type: issue
              Value: 1
          Titles:
            – TitleFull: International Journal of Computer Processing of Oriental Languages
              Type: main
ResultId 1