Abstract
There are some situations these days in which it is important to have an efficient and reliable classification of a web-page from the information contained in the Uniform Resource Locator (URL) only, without the need to visit the page itself. For example, a social media website may need to quickly identify status updates linking to malicious websites to block them. The URL is very concise, and may be composed of concatenated words so classification with only this information is a very challenging task. Methods proposed for this task, for example, the all-grams approach which extracts all possible sub-strings as features, provide reasonable accuracy but do not scale well to large datasets. We have recently proposed a new method for URL-based web page
classification. We have introduced an n-gram language model for this task as a method that provides competitive accuracy and scalability to larger datasets. Our method allows for the classification of new URLs with unseen sub-sequences. In this paper we extend our presentation
and include additional results to validate the proposed approach. We explain the parameters associated with the n-gram language model and test their impact on the models produced. Our results show that our method is competitive in terms of accuracy with the best known methods
but also scales well for larger datasets.
classification. We have introduced an n-gram language model for this task as a method that provides competitive accuracy and scalability to larger datasets. Our method allows for the classification of new URLs with unseen sub-sequences. In this paper we extend our presentation
and include additional results to validate the proposed approach. We explain the parameters associated with the n-gram language model and test their impact on the models produced. Our results show that our method is competitive in terms of accuracy with the best known methods
but also scales well for larger datasets.
Original language | English |
---|---|
Title of host publication | Knowledge Discovery, Knowledge Engineering and Knowledge Management |
Subtitle of host publication | 6th International Joint Conference, IC3K 2014, Rome, Italy, October 21-24, 2014, Revised Selected Papers |
Editors | Ana Fred, Jan L.G. Dietz, David Aveiro, Liu Kecheng, Joaquim Filipe |
Publisher | Springer |
Pages | 19-33 |
Number of pages | 15 |
Volume | 553 |
ISBN (Electronic) | 978-3-319-25840-9 |
ISBN (Print) | 978-3-319-25839-3 |
DOIs | |
Publication status | Published - 2015 |
Publication series
Name | Communications in Computer and Information Science |
---|
Keywords
- Language models
- Information retrieval
- Web classification
- Web mining
- Machine learning
Profiles
-
Beatriz De La Iglesia
- School of Computing Sciences - Professor & Head of School
- Norwich Institute for Healthy Aging - Member
- Norwich Epidemiology Centre - Member
- Data Science and AI - Member
Person: Research Group Member, Research Centre Member, Academic, Teaching & Research