URL-Based Web Page Classification: With n-Gram Language Models

Tarek Amr Abdallah, Beatriz De La Iglesia

Research output: Chapter in Book/Report/Conference proceedingChapter (peer-reviewed)peer-review

4 Citations (Scopus)

Abstract

There are some situations these days in which it is important to have an efficient and reliable classification of a web-page from the information contained in the Uniform Resource Locator (URL) only, without the need to visit the page itself. For example, a social media website may need to quickly identify status updates linking to malicious websites to block them. The URL is very concise, and may be composed of concatenated words so classification with only this information is a very challenging task. Methods proposed for this task, for example, the all-grams approach which extracts all possible sub-strings as features, provide reasonable accuracy but do not scale well to large datasets. We have recently proposed a new method for URL-based web page
classification. We have introduced an n-gram language model for this task as a method that provides competitive accuracy and scalability to larger datasets. Our method allows for the classification of new URLs with unseen sub-sequences. In this paper we extend our presentation
and include additional results to validate the proposed approach. We explain the parameters associated with the n-gram language model and test their impact on the models produced. Our results show that our method is competitive in terms of accuracy with the best known methods
but also scales well for larger datasets.
Original languageEnglish
Title of host publicationKnowledge Discovery, Knowledge Engineering and Knowledge Management
Subtitle of host publication6th International Joint Conference, IC3K 2014, Rome, Italy, October 21-24, 2014, Revised Selected Papers
EditorsAna Fred, Jan L.G. Dietz, David Aveiro, Liu Kecheng, Joaquim Filipe
PublisherSpringer
Pages19-33
Number of pages15
Volume553
ISBN (Electronic)978-3-319-25840-9
ISBN (Print)978-3-319-25839-3
DOIs
Publication statusPublished - 2015

Publication series

NameCommunications in Computer and Information Science

Keywords

  • Language models
  • Information retrieval
  • Web classification
  • Web mining
  • Machine learning

Cite this