Abstract
Web structure mining is to extract knowledge from the
hyperlink structure data of world wide webs for improving web
design for clear content presentation and easy navigation. This
paper presents a graph-based methodology for web structure
mining. The structure of a website is firstly mapped onto a
graph with its nodes representing web pages and links
representing hyperlinks between pages and other websites. Then
the characteristics of the web graph, such as, the degree of each
node, density, connectivity, the closeness centralisation, and the
node clusters, can be analysed quantitatively. The methodology
is tested on the web structural data collected from 110 UK’s
university websites. After cleansing and pre-processing the
data, the graphs were constructed and analysed to obtain the
aforementioned properties for each web and other useful
information, such as page size and the length of the optimal
path as they both affect the navigability. Based on the
evaluation of the properties, some guidelines and criteria are
devised for quantifying the structural quality of the webs into
five categories from very poor to very good. The average
degree and the percentage of strongly connected component
(SCC) pages together with the average distance were found to
be the most important properties in determining the structural
quality of a web.
hyperlink structure data of world wide webs for improving web
design for clear content presentation and easy navigation. This
paper presents a graph-based methodology for web structure
mining. The structure of a website is firstly mapped onto a
graph with its nodes representing web pages and links
representing hyperlinks between pages and other websites. Then
the characteristics of the web graph, such as, the degree of each
node, density, connectivity, the closeness centralisation, and the
node clusters, can be analysed quantitatively. The methodology
is tested on the web structural data collected from 110 UK’s
university websites. After cleansing and pre-processing the
data, the graphs were constructed and analysed to obtain the
aforementioned properties for each web and other useful
information, such as page size and the length of the optimal
path as they both affect the navigability. Based on the
evaluation of the properties, some guidelines and criteria are
devised for quantifying the structural quality of the webs into
five categories from very poor to very good. The average
degree and the percentage of strongly connected component
(SCC) pages together with the average distance were found to
be the most important properties in determining the structural
quality of a web.
Original language | English |
---|---|
Publication status | Published - Jun 2014 |
Event | International Conference on Web Intelligence, Mining and Semantics - Thesseloniki, Greece Duration: 2 Jun 2014 → 5 Jun 2014 |
Conference
Conference | International Conference on Web Intelligence, Mining and Semantics |
---|---|
Country/Territory | Greece |
City | Thesseloniki |
Period | 2/06/14 → 5/06/14 |
Keywords
- Web Mining
- Graph theory
- web structure
Profiles
-
Wenjia Wang
- School of Computing Sciences - Professor of Artificial Intelligence
- Data Science and AI - Member
Person: Research Group Member, Academic, Teaching & Research