Web Mining Notes
1. Web mining is mining of data related to the World Wide Web. This may be the data actually present in Web pages or data related to Web activity. Web data can be:
a. Content of actual Web pages
b. Intra page structure includes the HTML or XML node for the page.
c. Inter page structure is the actual linkage structure between Web pages.
d.Usage data that describe how Web pages are accessed by visitors. f. User profiles include demographic and registration information obtained about users.
|Taxonomy of Web Mining:
Figure: Web mining Taxonomy
2.Web content mining examines the content of Web pages as well as results of Web searching. The content includes text as well as graphics data. Web content mining is further divided into Web page content mining and search results mining.
3.Web page content mining is traditional searching of Web pages via content, while Search results mining is a further search of pages found from a previous search.
4.With Web structure mining, information is obtained from the actual organization of pages on the Web.
5. Web usage mining looks at the logs of Web access. General access pattern tracking is a type of usage mining that looks at a history of Web pages visited. This usage may be general or may be targeted to specific usage or users. Usage mining also involves mining of these sequential patterns.
6.Uses of Web Mining: Personalization for a user can be achieved by keeping track of previously accessed pages. Web usage patterns can be used to gather business intelligence to improve sales and advertisement. Collection of information can be done in new ways. Testing of relevance of content and web site architecture can be done.
1. The Harvest system is based on the use of caching, indexing, and crawling. Harvest is actually a set of tools that facilitate gathering of information from diverse sources.
2. The Harvest design is centered around the use of gatherers and brokers.
3. A gatherer obtains information for indexing from an Internet service provider, while a broker provides the index and query interface. The relationship between brokers and gatherers can vary. Brokers may interface directly with gatherers or may go through other brokers to get to the gatherers.
4. Indices in Harvest are topic-specific, as are brokers.
5. Harvest gatherers use Essence system to assist in collecting data, Essence classifies documents by creating a semantic index.
6. Semantic indexing generates different types of information for different types of files. It then creates indices on this information.
VIRTUAL WEB VIEW
1. Multiple layered database (MLDB) is used to handle large amounts of unstructured data on the Web.
2. This database is massive and distributed. Each layer is more generalized than the layer beneath it.
3. The MLDB provides an abstracted and condensed view of a portion of the Web. A view of the MLDB, which is called a Virtual Web View (VWV) can be constructed.
4. Generalization tools are proposed, and concept hierarchies are used to assist in the generalization process for constructing the higher levels of the MLDB.
5. WebML, a web data mining query language is proposed to provide data mining operations on the MLDB. It is an extension of DMQL.
AUTOMATIC CLASSIFICATION OF DOCUMENTS.
1. In this each document is assigned a class label from a set of predefined topic categories.
2. For example, Yahoo!’s taxonomy and its associated documents can be used as training and test sets to derive a Web document classification scheme. This scheme may then be used to classify new Web documents.
3. Keyword-based document classification methods and keyword-based association analysis method can be used for Web document classification. Such a term-based classification scheme has shown good results in Web page classification.
4. A Web page may contain multiple themes, advertisement and navigation information. Therefore, block-based page content analysis may play an important role in construction of high-quality classification models.
5. Hyperlinks contain high-quality semantic clues to a page’s topic. It is beneficial to make good use of such semantic information in order to achieve even better accuracy than pure keyword-based classification. But, because the hyperlinks surrounding a document may be quite noisy, naïve use of terms in a document’s hyperlink neighborhood can even degrade accuracy. The use of block-based Web linkage analysis will reduce such noise and enhance the quality of Web document classification.
6. A Web information infrastructure that is expected to bring structure to the Web based on the semantic meaning of the contents of Web pages. Web document classification by Web mining will help in the automatic extraction of the semantic meaning of Web pages and build up ontology for the semantic Web. Conversely, the semantic Web, if successfully constructed, will greatly help automated Web document classification as well.