Learning mDTD Extraction Patterns for Semi-Structured Web Information Extraction.

Saved in:

Bibliographic Details
Title:	Learning mDTD Extraction Patterns for Semi-Structured Web Information Extraction.
Authors:	Kim, Dongseok, Cha, Jeongwon, Lee, Gary Geunbae
Source:	International Journal of Computer Processing of Oriental Languages. Mar2002, Vol. 15 Issue 1, p63. 16p.
Subjects:	Information retrieval, Structured techniques of electronic data processing, Document imaging systems
Abstract:	This paper presents a new extraction pattern, called a modified Document Type Definition (mDTD), which relies on an analytical interpretation method to identify textual fragments of documents from the Web. We make two major modifications which differ from the conventional DTD. Regarding syntax, we introduced an extended content model with more operators and keywords. For the semantics, we changed the way to interpret the mDTD rules. The mDTD can represent HTML structures and extraction targets. The design goal of mDTD is to overcome major barriers with information extraction, such as domain portability with minimum human intervention while maintaining a high extraction performance. The user composes mDTD as seed rules from which our system extracts instances from structured documents on the Web. These extracted instances are used as inputs to SmL (Sequential mDTD Learner). SmL generates new mDTD rules that are based on the part-of-speech (POS) tag and the lexical similarity features. For learning, a hand-tagged corpus is not required. We experimented with 200 Web documents on audio and video shopping sites. The results show 91.3% average extraction precision. [ABSTRACT FROM AUTHOR]
	Copyright of International Journal of Computer Processing of Oriental Languages is the property of World Scientific Publishing Company and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database:	Engineering Source

FullText	Links: – Type: pdflink Text: Availability: 0
Header	DbId: egs DbLabel: Engineering Source An: 9278366 AccessLevel: 6 PubType: Academic Journal PubTypeId: academicJournal PreciseRelevancyScore: 0
IllustrationInfo
Items	– Name: Title Label: Title Group: Ti Data: Learning mDTD Extraction Patterns for Semi-Structured Web Information Extraction. – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Kim%2C+Dongseok%22">Kim, Dongseok</searchLink><br /><searchLink fieldCode="AR" term="%22Cha%2C+Jeongwon%22">Cha, Jeongwon</searchLink><br /><searchLink fieldCode="AR" term="%22Lee%2C+Gary+Geunbae%22">Lee, Gary Geunbae</searchLink> – Name: TitleSource Label: Source Group: Src Data: <searchLink fieldCode="JN" term="%22International+Journal+of+Computer+Processing+of+Oriental+Languages%22">International Journal of Computer Processing of Oriental Languages</searchLink>. Mar2002, Vol. 15 Issue 1, p63. 16p. – Name: Subject Label: Subjects Group: Su Data: <searchLink fieldCode="DE" term="%22Information+retrieval%22">Information retrieval</searchLink><br /><searchLink fieldCode="DE" term="%22Structured+techniques+of+electronic+data+processing%22">Structured techniques of electronic data processing</searchLink><br /><searchLink fieldCode="DE" term="%22Document+imaging+systems%22">Document imaging systems</searchLink> – Name: Abstract Label: Abstract Group: Ab Data: This paper presents a new extraction pattern, called a modified Document Type Definition (mDTD), which relies on an analytical interpretation method to identify textual fragments of documents from the Web. We make two major modifications which differ from the conventional DTD. Regarding syntax, we introduced an extended content model with more operators and keywords. For the semantics, we changed the way to interpret the mDTD rules. The mDTD can represent HTML structures and extraction targets. The design goal of mDTD is to overcome major barriers with information extraction, such as domain portability with minimum human intervention while maintaining a high extraction performance. The user composes mDTD as seed rules from which our system extracts instances from structured documents on the Web. These extracted instances are used as inputs to SmL (Sequential mDTD Learner). SmL generates new mDTD rules that are based on the part-of-speech (POS) tag and the lexical similarity features. For learning, a hand-tagged corpus is not required. We experimented with 200 Web documents on audio and video shopping sites. The results show 91.3% average extraction precision. [ABSTRACT FROM AUTHOR] – Name: AbstractSuppliedCopyright Label: Group: Ab Data: <i>Copyright of International Journal of Computer Processing of Oriental Languages is the property of World Scientific Publishing Company and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract.</i> (Copyright applies to all Abstracts.)
PLink	https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=egs&AN=9278366
RecordInfo	BibRecord: BibEntity: Languages: – Code: eng Text: English PhysicalDescription: Pagination: PageCount: 16 StartPage: 63 Subjects: – SubjectFull: Information retrieval Type: general – SubjectFull: Structured techniques of electronic data processing Type: general – SubjectFull: Document imaging systems Type: general Titles: – TitleFull: Learning mDTD Extraction Patterns for Semi-Structured Web Information Extraction. Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Kim, Dongseok – PersonEntity: Name: NameFull: Cha, Jeongwon – PersonEntity: Name: NameFull: Lee, Gary Geunbae IsPartOfRelationships: – BibEntity: Dates: – D: 01 M: 03 Text: Mar2002 Type: published Y: 2002 Identifiers: – Type: issn-print Value: 02194279 Numbering: – Type: volume Value: 15 – Type: issue Value: 1 Titles: – TitleFull: International Journal of Computer Processing of Oriental Languages Type: main
ResultId	1