Structured data extraction from template-generated web pages

Tomas Grigalis

Doctoral dissertation

Dissertations are not being sold

Quantity

Description

Most of structured data on the Web is found in database-backed web sites. Typically, upon a web page request in such a site, structured data is retrieved from an underlying database and embedded into a web page using some fixed template. Reverse engineering task – extracting structured data from template-generated web pages is studied in this dissertation. There are thousands of web pages on the Web that differ in visual style and underlying structure. Automatically extracting structured data from many structurally heterogonous template-generated web pages is a difficult and time consuming task, and it is regarded as a grand challenge. It is assumed, that solving the challenge would improve todays’ Web search and help companies to reduce costs. Thus the main goal of the dissertation is to propose a novel and more effective method for extracting structured data from template-generated web pages. The object of the research in this dissertation is structured data extraction from template-generated web pages.

The dissertation consists of introduction, four main chapters and general conclusions. In the first Chapter the problem of structured web data extraction is introduced, state-of-the-art data extraction techniques are reviewed and finally real life applications for structured web data extraction systems are discussed.

In the second Chapter a novel method for extracting structured data records from template-generated web pages is presented. The method is based on clustering visually and structurally similar web page elements. It first renders a given web page in a contemporary web browser, and then clusters visually and structurally similar repeating web page elements to identify an underlying pattern of embedded structured data records. Finally a data extracting wrapper is generated. The wrapper consists of XPath expressions that can be easily reused in many third party data extracting applications.

In the third Chapter a novel method for structurally clustering template-generated web pages is proposed. The method is based on the three observations: that there is a limited number of different style templates in one particular template-generated web site; that there is a limited number of inner-site link locations in all templates of a same site; that each individual location in a web page containing a link usually points to structurally similar web pages. The method leverages XPath locations of inbound inner-site links to significantly speed up web page clustering time.

In the final fourth Chapter more than one million web pages are used to experimentally evaluate the two proposed methods. The results reveal that the both proposed methods consistently outperform other state-of-the-art techniques.

Read electronic version of the book:

DOI: https://doi.org/10.20334/2262-M

Book details

Data sheet

Year:
2014
ISBN:
978-609-457-699-7
Imprint No:
2262-M
Dimensions:
145×205 mm
Pages:
138 p.
Cover:
Softcover
Language:
English
16 other books in the same category:

Follow us on Facebook