Web Data Mining Exploring Hyperlinks Contents And Usage Data Pdf

File Name: web data mining exploring hyperlinks contents and usage data .zip
Size: 2116Kb
Published: 31.03.2021

Skip to search form Skip to main content You are currently offline. Some features of the site may not work correctly. DOI:

Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data

The review concludes that the breadth and depth of thisbook makes it a required staple for every Web miningresearcher, student, or practitioner. Keywords Data mining, Web mining, text mining, Web informationretrieval, crawling, Web search, data extraction, linkanalysis.

Bing Liu is a well seasoned researcher who hasmade significant contributions to association rule mining, inparticular classification using association rule mining andassociation rule mining with multiple supports. He has alsoworked on Web data extraction, and more recently onopinion mining. In addition to the expertise of the author,two of the chapters, Chapter 8, Web Crawling, and Chapter12, Web Usage Mining , were contributed by two leadingexperts in these respective areas, Filippo Menczer for theformer and Bamshad Mobasher for the latter.

This book is appropriate for students at the graduate or seniorundergraduate level, for practitioners in industry, and even asa good comprehensive reference for researchers in academia. The Table of Contents held a surprise for someone who hadalways found it hard to limit the number of textbooks to onebook in a web mining course that does not have data miningas prerequisite, and thus typically prescribes a good datamining book to introduce data mining techniques, in additionto a second book related to web mining.

This book, on theother hand, has two parts, one devoted to data mining, andthe other devoted to Web mining. While it was not a problemto find a very good data mining book I have a few of themon my bookshelf , it was harder to find a book that addresseddata mining and Web mining.

It was also hard to find a goodand comprehensive Web mining book, since most of themtend to focus on one or only two of the three main Web mining areas of Web structure, content, and usage mining typically leaving Web usage mining in the dark, with just asmall section, citing that it is an emerging area. This book,on the other hand, is a serious book on Web mining that alsodevotes a decent portion to data mining.

I would describe theway the topics are presented as deep and rigorous enough inmost chapters, which is in contrast to a large number ofbooks on data mining and web mining. That said, because thebook is full of simple examples that illustrate the methodsbeing discussed, it is useful even for beginners, making italso appropriate for an introductory level course.

The first part consists of Chapters , coveringthe data mining tasks of association rule and sequentialpattern mining, supervised learning, unsupervised learning,and partially supervised learning. The second part consists ofChapters , covering Information retrieval and search,link analysis, Web crawling, structured data extraction wrapper generation , information integration, opinionmining, and Web usage mining. Typically, a topic is presented starting with a motivation,followed by the pertinent notations, equations anddefinitions, then an algorithm and a concrete exampleillustrating the ideas.

Furthermore, every chapter ends with asection titled Bibliographic Notes that places the presentedmethods in a historical context, and then points the curiousreader toward more literature on the topic. The book starts with an introduction Chapter 1 thatoverviews the history of the WWW, and then discusses thechallenges that distinguish Web mining from data mining.

Itdoes so by emphasizing those aspects of the Web that are nottypical in most other data sets. For example, the Web is aplace of interaction between people and automated services,it is huge, noisy, heterogeneous, and full of unstructured andsemi-structured data. Also, because of reputation concerns,all data is not considered equal on the Web. After thismotivation, definitions are given for data mining and for Web mining. The authorstresses that Web mining is not to be viewed as anapplication of traditional data mining!

This view wassupported by citing the distinguishing characteristics of the Web data, such as its heterogeneity and lack of structurewhich have led to the invention of new specialized miningtasks and algorithms in Web mining. As a researcher in thisarea, I concur with this view, which is not shared by someother data mining books that typically stow Web mining asan application.

This, despite the fact that the Web miningarea is so vast that many problems have no counterpart intraditional data mining, and these problems, and interest inthem, have expanded over the years.

The Web miningprocess is then distinguished further from that of datamining, citing the issue of data collection which can besubstantial in the former compared to the latter. This claim, Ifound, provided a good justification for including an entirechapter Chapter 8 on Web crawling in the book, as well asother chapters devoted to data extraction and to dataintegration.

That said, in addition to being directed towarddata gathering, I would add that Web crawling often requirespowerful data mining methods example: classification inorder to guide the crawler, particularly in the case of focusedcrawling.

I would add here that compared to conventionaldata mining, even the data pre-processing and the evaluationand post-processing can be daunting in Web mining.

Chapter 2 Association Rules and Sequential Patterns covers two important data mining tasks that are particularlyimportant in Web usage mining. This is dueto the vast difference between the supports of the differentitems, stemming from the power law distribution of most Web related data. For instance, we are all too familiar withthe presence of a long tail of infrequent items in most e-commerce transactions that are less constrained by thephysical limitations of the warehouse of their offlinecounterparts.

The same phenomenon occurs when oneconsiders the support of words in text corpora. After this, thechapter covers mining class association rules, which areuseful for transactional data and certain kinds of categoricaldata that are common in e-commerce applications.

Finally,the chapter concludes with mining sequential rules based onGSP and based on PrefixSpan, both with and withoutmultiple minimum supports. Missing in this chapter is theFP-tree approach, which can significantly compress Web transaction data for subsequent AR mining. Chapter 3 discusses Supervised Learning, i. The emphasis on text is furtherexemplified in the choice of evaluation metrics precision,recall, F1 that were discussed, which are suitable forimbalanced data, as is often the case in text classification,and especially in information filtering.

The chapter couldhave benefited from also presenting ROC curves for morethan two classes. That said, the text is rigorous, including forinstance the derivations of the equations for Support VectorMachines SVM , for the separable and non-separable cases. Chapter 4 Unsupervised Learning presents clusteringalgorithms, in particular the K-means algorithm andhierarchical clustering, which in my opinion are insufficientfor the Web mining field. I would have preferred a morethorough chapter on clustering, including the EM algorithmfor mixture models and the Spherical K-Means which in myopinion, are more scalable and more suitable for Web documents and Web sessions.

The author also discussed datastandardization at the end of this chapter. This is animportant part of pre-processing, and should have beenpresented at the start of the Supervised Learning chapterwhich came earlier. For some of the presented supervisedlearning methods e. Chapter 5 Partially Supervised Learning talks about thecase when some of the data is labeled, while the rest isunlabeled LU learning , then moves to the two-class case ofpositive labels versus no labels PU learning.

This chapter isvery valuable in Web mining, as one is often overwhelmedby the massive size of web pages to label on the Web , thusending up with labeling only a small sample. Because thesample is very small however, the model accuracy may notbe satisfactory. Fortunately, LU learning or semi-supervisedclassification is based on taking advantage of a larger set ofunlabeled samples in order to improve the accuracy of thelearned classification model.

The latter case is particularlysuitable for text information filtering where one has only afew examples of documents from a certain topic, and wouldlike to find more similar topic-wise documents from a largecollection that contains all kinds of topics. The only thingmissing in this chapter is the mention that LU learning is alsouseful for clustering semi-supervised clustering , where alimited sample of labeled web pages can guide the clusteringof a larger set of unlabeled samples.

Chapter 6 Information Retrieval and Web Search presents Web search, a very popular problem, as the single mostimportant application of the much older field of InformationRetrieval IR. Yet, the author stresses that Web search is notonly a simple application of IR because there are manyunique characteristics in Web data, for example the hyperlinkinformation, the deceptive Web content such as Web spam,and the massive size that rules out all but the most scalablesearch engines.

This is analogous to the distinctions madeearlier between Web mining and data mining. Web IR waspresented rather rigorously. It started with the theory, i. The chapter thenpresents some IR evaluation measures and delves into textpre-processing which includes very search-specific issuessuch as duplicate detection.

Part of this section tokenization,stop work elimination, stemming should have been includedat the start of Chapter 3 Supervised Learning because it is apre-requisite to text classification. The chapter continueswith a detailed discussion of inverted indexing, includingvarious compression methods e. Elias gamma, Elias delta,and Golomb coding that are well illustrated with examples. Section 6. The chapter ends with asection on Web spamming.

Chapter 7 Link Analysis starts with a nice overview ofsocial network analysis, thus defining important metrics inthis area, such as closeness and betweenness centrality, thenit moves to citation analysis, before delving into thePageRank and HITS algorithms in detail again with severalexamples , and then ending with community discovery.

Chapter 8 Web Crawling was contributed by an expert onthis topic, Filippo Menczer, who starts with universalcrawlers or spiders, and finishes with focused examples ofpositive and negative pages are available crawlers andtopical only some positive examples are used as seeds crawlers. The chapter even addresses certain implementationissues such as parsing, spider traps, and concurrency.

ThenMenczer presents a brilliant discussion on adaptation incrawlers, delighting the reader with details about InfoSpiders,that adapt through reinforcement learning. The last discussion on new developments isenlightening and up to date, as it mentions the future of peerto peer search, for example.

Chapter 9 is titled Structured Data extraction: WrapperGeneration. The latter usestraining samples to learn automated extractors that areessential to some data gathering tasks where one wants tocollect, say, the Data Rich Pages that list product featuresand prices on an e-commerce website. The chapter is detailedand contains numerous examples with nice diagrams thatmake it easy to quickly grasp the concepts.

Chapter 10 Information Integration may at first appear tobe in the wrong book on databases. However, starting withthe first page, the author makes it clear that this is a sequel tothe previous chapter on structured data extraction, for thecase when we want to extract data from multiple websites.

This subject turns out to be essential to extracting the rightdata from the deep Web Web databases , and should thus beconsidered a crucial step in deep Web mining. Chapter 11 is devoted to Opinion Mining , and thus focusesattention on user generated content, as is common on blogsand discussion forums. The author presents methods forsentiment mining, then feature-based opinion mining andsummarization, then comparative sentence and relationmining, and finally opinion search and opinion spam and itsdetection.

Just to emphasize the thoroughness of this chapter,I would say that it contains 12 examples! This finally leads us to Chapter 12 Web Usage Mining ,itself contributed by a leading expert in the field, BamshadMobasher. The chapter starts with a detailed explanation ofdata collection and pre-processing, followed by a discussionof data modeling for web usage mining.

It then delves intothe methods of Web usage pattern discovery, includingclustering, association rule and sequential pattern mining,and classification and prediction with collaborative filtering,an essential social filtering method used to providerecommendations on e-commerce websites.

This chapterprovides an excellent foundation in Web usage mining WUM , but could benefit from adding more contemporary Web mining tasks such as query log mining and addressingchallenges such as scalability and adapting to the evolutionof user activity. A section on privacy would also make thischapter more rounded. That said, this chapter is a welcomeand practical treatment of WUM that has not beenthoroughly presented in previous books.

Thebook could benefit from adding problems, at the end of eachchapter, that would not only spark the curiosity of the casualreader, but also support using the book for teaching.

The fewshortcomings that were mentioned in this review are actuallyeasy to mend, because the book started on a solid foundation. Another very good book, byBaldi et al. Also, most other books are in need of an update tocover the more recent methods in this fast moving area andthe interesting problems that have emerged in the last fewyears e.

Baldi, P. Frasconi, and P. Modeling the Internet andthe Web - Probabilistic methods and algorithms, Wiley, Olfa Nasraoui is an Associate Professor in computer engineeringand computer science at the University of Louisville, where she isalso the endowed Chair of e-commerce. Her research activitiesinclude data mining, Web mining, mining evolving data streams,personalization, and computational intelligence. Short-link Link Embed. Share from cover. Share from page:.

More magazines by this user. Close Flag as Inappropriate. You have already flagged this document. Thank you, for helping us keep this platform clean.

Web Data Mining

Berendt, B. Mobasher, M. Spiliopoulou 1. Goal: analyze the behavioral patterns and profiles of users interacting with a Web site. The discovered patterns are usually represented as collections of pages, objects, or resources that are frequently accessed by groups of users with common interests. When they are, they must be integrated. After that various data mining algorithm can be applied.

Mit dem amazon-Kindle ist es aber nicht kompatibel. Buying eBooks from abroad For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries. Anmeldung Mein Konto Merkzettel 0. Erweiterte Suche.

The review concludes that the breadth and depth of thisbook makes it a required staple for every Web miningresearcher, student, or practitioner. Keywords Data mining, Web mining, text mining, Web informationretrieval, crawling, Web search, data extraction, linkanalysis. Bing Liu is a well seasoned researcher who hasmade significant contributions to association rule mining, inparticular classification using association rule mining andassociation rule mining with multiple supports. He has alsoworked on Web data extraction, and more recently onopinion mining. In addition to the expertise of the author,two of the chapters, Chapter 8, Web Crawling, and Chapter12, Web Usage Mining , were contributed by two leadingexperts in these respective areas, Filippo Menczer for theformer and Bamshad Mobasher for the latter. This book is appropriate for students at the graduate or seniorundergraduate level, for practitioners in industry, and even asa good comprehensive reference for researchers in academia.

Web Data Mining (eBook)

It seems that you're in Germany. We have a dedicated site for Germany. Web mining aims to discover useful information and knowledge from Web hyperlinks, page contents, and usage data.

Buying options

 Боюсь, что. И мы должны его найти. Найти тихо. Если он почует, что мы идем по его следу, все будет кончено. Теперь Сьюзан точно знала, зачем ее вызвал Стратмор. - Я, кажется, догадалась, - сказала .

Ты можешь помочь мне ее найти. Парень поставил бутылку на стол. - Вы из полиции. Беккер покачал головой. Панк пристально смотрел на .

Она принялась нажимать кнопки безжизненной панели, затем, опустившись на колени, в отчаянии заколотила в дверь и тут же замерла. За дверью послышалось какое-то жужжание, словно кабина была на месте. Она снова начала нажимать кнопки и снова услышала за дверью этот же звук. И вдруг Сьюзан увидела, что кнопка вызова вовсе не мертва, а просто покрыта слоем черной сажи.

Мне не успеть. Но когда шестерни разомкнулись, чтобы включилась другая их пара, автобус слегка притормозил, и Беккер прыгнул. Шестерни сцепились, и как раз в этот момент его пальцы схватились за дверную ручку. Руку чуть не вырвало из плечевого сустава, когда двигатель набрал полную мощность, буквально вбросив его на ступеньки.

 Не может быть! - сказала она по-испански. У Беккера застрял комок в горле. Росио была куда смелее своего клиента. - Не может быть? - повторил он, сохраняя ледяной тон.

В одно мгновение Сьюзан все стало ясно. Когда Стратмор загрузил взятый из Интернета алгоритм закодированной Цифровой крепости и попытался прогнать его через ТРАНСТЕКСТ, цепная мутация наткнулась на фильтры системы Сквозь строй. Горя желанием выяснить, поддается ли Цифровая крепость взлому, Стратмор принял решения обойти фильтры. В обычных условиях такое действие считалось бы недопустимым.

 Каковы ваши рекомендации? - требовательно спросил Фонтейн.  - Что вы предлагаете. - Рекомендации? - выпалил Джабба.  - Вы ждете рекомендаций. Что ж, пожалуйста.

3 Response
  1. Sean N.

    Five of the chapters - partially supervised learning, structured data extraction, information integration, opinion mining and sentiment analysis, and Web usage.

Leave a Reply