Learning mDTD Extraction Patterns for Semi-Structured Web Information Extraction.
Saved in:
| Title: | Learning mDTD Extraction Patterns for Semi-Structured Web Information Extraction. |
|---|---|
| Authors: | Kim, Dongseok, Cha, Jeongwon, Lee, Gary Geunbae |
| Source: | International Journal of Computer Processing of Oriental Languages. Mar2002, Vol. 15 Issue 1, p63. 16p. |
| Subjects: | Information retrieval, Structured techniques of electronic data processing, Document imaging systems |
| Abstract: | This paper presents a new extraction pattern, called a modified Document Type Definition (mDTD), which relies on an analytical interpretation method to identify textual fragments of documents from the Web. We make two major modifications which differ from the conventional DTD. Regarding syntax, we introduced an extended content model with more operators and keywords. For the semantics, we changed the way to interpret the mDTD rules. The mDTD can represent HTML structures and extraction targets. The design goal of mDTD is to overcome major barriers with information extraction, such as domain portability with minimum human intervention while maintaining a high extraction performance. The user composes mDTD as seed rules from which our system extracts instances from structured documents on the Web. These extracted instances are used as inputs to SmL (Sequential mDTD Learner). SmL generates new mDTD rules that are based on the part-of-speech (POS) tag and the lexical similarity features. For learning, a hand-tagged corpus is not required. We experimented with 200 Web documents on audio and video shopping sites. The results show 91.3% average extraction precision. [ABSTRACT FROM AUTHOR] |
| Copyright of International Journal of Computer Processing of Oriental Languages is the property of World Scientific Publishing Company and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.) | |
| Database: | Engineering Source |
Be the first to leave a comment!