Literature Review On Web Usage Mining Computer Science Essay

The Internet has become the largest database ever existed. During the early times of this decade it has been estimated that the internet is having more than 350 million pages [11]. But through a research conducted a few years ago it was found that the indexed part of World Wide Web alone consists of a minimum of 11.3 billion pages [12]. Also the number of people using the internet is growing exponentially. A survey conducted by Computer Industry Almanac itself is an evident for this fact. According to the results of the survey the number of online users had crossed one billion in 2005 while it was only 45 million in 1995. They had also predicted the number to cross two billion by 2011[13].

For the users of the internet finding the required information from this large volume of data has become extremely difficult. So it has become essential to find efficient ways for information retrieval. Also it has been found that more than 90% of the data is in unstructured format. So organizing and structuring this data has become a very important issue among researchers. With this large amount of information available on the web business processes needs to transcend from simple document retrieval to knowledge discovery. Business people were trying to get useful patterns from the available data which will help them for better understanding of their customer needs which in turn provides better customer satisfaction.

Literature Review on Web Usage Mining

Web mining helps the web designers in discovering the knowledge from the information available in the web. Also it helps the users in getting the fast retrieval of the information they are looking for. Three major areas of web mining are

Web content mining- Trying to get useful information from the text, images, audio and video in web pages

Web structure mining- Trying to understand the link structures of the Web which will help in categorization of Web pages.

Web usage mining- Trying to get useful information from the server logs to understand what the users are looking for. It also helps in personalization of web pages.

Though all the three categories of web mining are interlinked, in this research we were going to discuss about the web usage mining. Web usage mining helps the web masters to understand what the users were looking for so that they can develop the strategies to help the user to get the required information quickly. Web mining is generally implemented by using the navigational traces of users which give the knowledge about user preferences and behavior. Then the navigational patterns were analyzed and the users were grouped into clusters. The classification of navigational patterns into groups helps to improve the quality of personalized web recommendations. These web page recommendations were used to predict the web pages that are more likely to be accessed by the user in near future. This kind of personalization also helps in reducing the network traffic load and to find the search pattern of a particular group of users.

Data mining techniques like, clustering, sequential pattern mining and association rule mining were used in web mining. All these techniques were used to extract interesting and frequent patterns from the information recorded in web server logs. These patterns were used to understand the user needs and help the web designers to improve the web services and personalization of web sites.

Web Access Sequence

Generally the web usage mining will be done based on the navigation history stored in the logs of the web server. This navigation history is also called as Web Access sequence which will contain the information about the pages that a user visit, the time spent on each page and the path in which the user traverse with in the website. So the web access sequences will contain all the details of the pages that a user visited during a single session. This data that we get from the log files will be subjected to various data mining techniques to get the useful patterns which can describe the user profile or behavior. These patterns will act as the base knowledge for developing the intelligent online applications, to improve the quality of web personalization, web recommendations etc… The web mining can be generally classified in to two categories online mining and offline mining. In offline mining we use the data stored in the log files to find the navigational patterns while in online mining the requests of users in his current active session will be used. Current user profile will be decided by matching the recommendations from both the online and offline methods.

Several systems have been designed to implement the web usage mining. Among many Analog is one of the first systems developed for Web Usage mining. It has two components online component and offline component. The offline component will reformat the data available in the log file. Generally the web server log will contain the information like IP address of the client, the time in which the web page is requested, the URL of the web page, HTTP status code etcÃ¢â‚¬Â¦ Then the data available will be cleaned by removing the unwanted information after which the system will analyze the user’s activities in the past with the information available in the log files of the web server and classify the user’s session into clusters. Then the online component will classify the active user sessions based on the model generated by the offline component. Once the user group is found then the system will give a list of suggestions to each user request. The suggestions will depend on the user group to which the user belongs.

Clustering Techniques

One of the important portions of web usage mining is the process of clustering the users in to groups based on their profile and search pattern. The clustering of the user’s session can be done in several ways. Christos et al. represents each page as a unique symbol which makes the web access sequence to a string [1]. Consider S as the set consisting of all possible web access sequences. Then the web mining system will process this set S in offline as a background process or during the idle time to group the pages in to clusters such that similar sequences were in the same cluster. The formed clusters were represented by means of weighted suffix tree. The clustering is done by constructing a similarity matrix [1] which then is given as input to k windows clustering algorithm [10] to generate the clusters with very similar Web access sequence. When two web access sequences have the same length then the global alignment has to be taken into account rather than the local alignment. Also the scores were calculated for both the local and the global alignment. A simple way to calculate the scores is to assign a positive value to a matching sequence and a negative value for a mismatch.

Two web access sequences were said to be similar if they have the maximum alignment in their sequence. Some times the web pages listed in the sequence may be unimportant to the user. The user may have reached that page by a wrong click. In such cases the users will immediately leave that page. So the user will stay only for a short time in these kinds of unimportant pages. So before considering the web sequence alignment we have to take care of all these factors in order to get the useful patterns.

C.Suresh et al had proposed an approach in which the clusters were identified based on the distance based clustering methods also they had developed a framework to compare the performance of various clustering methods based on the replicated clustering. In traditional methods, the distance between two user sessions will be calculated using the Euclidean-distance measure. But experiments show that the Sequence Alignment Method is better in representing the behavioral characteristics of web users than the Euclidean-distance method. Cadez et al [14] categorizes the users session as general topics and the behavior of each particular topic is represented by morkov chain. Fu et al. [15] uses Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) algorithm for clustering at page level. BIRCH is a distance-based hierarchical algorithm and it is used for clustering the web user sessions. It has been noticed that the increase in the number of pages is diminishing the performance of the BIRCH algorithm. Since each website contains hundreds of pages considering each page as a separate state will make the clustering unmanageable. To overcome this difficulty the authors proposed an approach to generalize the sessions using attribute-oriented induction. In this new approach the clustering of pages will be done at the category level. It is has been always a difficult job to map a particular page to a specific category but it can be done by using clustering algorithms.

The commonly used algorithm for clustering is k-means algorithm but the major disadvantage of k-means algorithm is, we have to specify number of clusters to be found in advance which is not possible in real world scenario. To overcome this problem, researchers were using the fuzzy ART neural networks, an unsupervised learning approach. In this approach there is no need to specify the number of cluster in advance. The main issue with fuzzy ART neural network is the “category proliferation” which leads to the unrestricted growth of clusters. Sometimes the fuzzy ART network will produces a large number of clusters with only a few members in each cluster. After considering the merits and demerits of both the algorithms the authors had proposed a hybrid approach called FAK. The FAK algorithm has two phases in the first phase fuzzy ART is used as an initial seed generator to generate the clusters. From the identified clusters we will remove the cluster whose centroids were nearer to others thereby addressing the category proliferation problem. The first phase will be followed by applying the k-means algorithm in the second phase to get the final clusters. They found that the FAK algorithm performs much better than the other methods. The most important to be considering during the clustering is the number of user sessions that should be taken into account for clustering. In most cases the designers will decide it is enough to consider the first N sessions of a user for the decent recovery of his web cluster. Also they will decide whether to consider or not the sessions with the short session lengths because those sessions may not be helpful in identifying the clusters. So the two main factors we have to consider while performing the clustering is the number of user sessions to be considered and the minimum session length.

Combining the web content mining and web usage mining

An experiment was conducted in order to extract the navigational pattern of the website’s user [6]. The experiment aims at predicting the user’s gender and whether they are interested in certain website sections. The results of the experiment were analyzed and it is found that the model was only 56% accurate. The reason for low accuracy is found to be the failure to include the web content in the classification model. It is believed that exploring the content of the page will helps in better understanding of the user profile thereby the classification accuracy will be improved. The web usage mining and web content mining can be combined together and it is used in the area of web personalization. In the web personalization the contents of the web page will differ for each user according to their navigational pattern. In this technique the web links that the user may visit in the near future will be predicted based on their profile. Those predicted links will be dynamically displayed in the webpage that the user requested. Also web links of the frequently visited pages will be highlighted at the same time pages which were not visited for a long time will be removed.

This hybrid approach is implemented by doing an initial clustering based on the contents of the web pages followed by the Web Access Sequence alignment. Text clustering can be done effectively by spherical k-means algorithm [10]. Since the multiple sequence alignment consumes more time and space it can be effectively replaced by the iterative progressive alignment. Weighted suffix tree is used to find the most frequent and important navigational pattern with little memory and computational costs. It has been proved that the content exploitation has improved the performance by 2-3%.

In the model proposed by Liu [7] the contents of the web pages and the information from the web server log files is used together in order to extract the meaningful patterns. The extracted contents of the web page represented by means of character N-grams. The users of the web site can be classified by two approaches namely proactive and reactive approach. The proactive approach tries to map each request to a user before or during the user’s interaction with the site. While the reactive approach maps each request to a user after the user completes the interaction with the site. In order to use the proactive approach the browser cookies needs to be enabled and it is necessary that the user must be aware of cookies. So it is always easy to use reactive approach which does not require the user to have any prior knowledge. An experiment has been conducted with 1500 sample sessions to evaluate the proposed method. The results show that the system is 70% accurate in classification and 65% in prediction.

The success of the website also depends on the user perceived latency of documents hosted in web server. It is obvious that short User Perceived Latency will have a positive effect towards the user satisfaction. This has been also proved by a study conducted by Zona Research Inc. in 1999. The study shows that “if a web site takes more than eight seconds to download then a 30% of the visitors is more likely to leave the site” [35]. User perceived latency is influenced by many factors like fast internet connection, increasing the bandwidth of ISP, etc… One way to reduce the UPL is by using the browser cache. In this approach frequently accessed will be pre-fetched and stored in browser cache.

Generally Web cache is implemented by using the proxy server. All the requests from the user to a web server will be interpreted by the proxy servers. If the proxy server has a valid copy of the response then it will give the results to the users. Otherwise the requested will be forwarded to the original server. The real server will send the response to the proxy server. The proxy server will retain a copy of the response in the cache and then send the results to the users. The main Problem with the web cache is if the cache is not up to date then the users will be provided with stale data. Also if a large amount of users access a web server simultaneously then it may results in severe caching problems which may results in the unavailability of the web pages. To overcome all these issues author suggested an approach which combines the web prefetching and caching together. In this approach we have to first identify the objects that needs to be pre-fetched in a web cache environment from the information available in the log files. After identifying the objects we have to group these objects into clusters for each client group. When a user requests for an object first he will be assigned to one of the client group then the proxy server will fetches all the cluster objects of that particular client group. Finally the requested object will be delivered to the user.

To have a minimal UPL we have to predict the users preferences based on the pages he had already visited. The importance of the page is determined by the weights assigned to the page by the algorithm. If two or more pages have the same weight then we rely on the page rank.

Page Rank [22] is “A probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page”. Page rank can be assigned to the document of all sizes. If a document is having a page rank of 0.5 then it means that if a user clicks on a link randomly then there is 50% probability for the link to land on the particular document. Consider a web site consisting of four pages P1, P2, P3, and P4. Initially the page rank of all the pages will be same. Since there are four pages each page will be assigned with a page rank of 0.25. If all the pages P2, P3 and P4 posses link to only P1 then each of the three pages will contribute 0.25 page rank to P1. The page rank of P1 can be calculated by using the following formula.

PR (u1) = PR (u2) + PR (u3) + PR (u4)

If suppose page P2 is having another one link to page P3 and if P4 is having links to all the three pages then the link-vote value will be divided among all the outbound links of a page. So Page P2 will contribute 0.125 to page P1 and 0.125 to page P3. Similarly P4 will contribute one third of its page rank value to P1. The general formula for calculating the page rank value for any page is as follows.

PR (u) =Ã¢Ë†â€˜ PR (v)

Bp is the set of all pages linking to page P

L (v) is number of links from page v

Web Page recommendation is an important section in web personalization. Weighted Association rule mining is used to predict the page recommendations. In this method we assign a weight to each web page. The importance of the page is determined by the weight assigned to the page. In this approach the weight for each page is assigned based on the frequency in which the page is visited by the user and the time spent by the user on a particular page.

The frequency weight of a page FW is calculated using the following formula

FW (p) =Number of visits on a page (P)

Total number of visits on all pages X PR (P)

Where PR is the page rank of p.

Time spent on each page reflects the relative importance of each page for a user. Bcause the user will spend more time on the page he is interested in and he will quickly traverse the other unwanted pages. The two factors that we have to consider while calculating the actual time spent on a page is as follows. The size of the web page and the transfer rate, the assumption is that the transfer rate is constant for a user then the time spent on a page will be inversely proportional to the amount of useful information available to the user from that page.

The weight of a page can be calculated using the following formula

TW (P) =Time spent on a page (p)/page size of (p)

Max pÃ†ÂP Time spent on a page (p)/page size of (p)

Based on these two values the total page weight is calculated as follows

W (p) =FW (p) +TW (p)

According to the page rank algorithm, the link to important page will appear as an outbound link in many pages.

Web prefetching reduces the latency. In prefetching the networks idle time is utilized to fetch the anticipated web pages. In [7] Chen et al. proved that the cache hit ratio had enhanced to 30-75% through pre-fetching. Also the access latency can be reduced by 60% when we combine the caching and pre-fetching techniques [8]. Pre-fetching takes place only if the network bandwidth is less than the predetermined threshold also only the web pages that are not available in the cache will be pre-fetched. Pre-fetching increases the network traffic but at the same time it helps in reducing the latency.

Several approaches were available for web pre-fetching like Top-10 approach, Domain-top approach, etc… In top-10 approach web proxies will be periodically updated by the web servers regarding the most popular document information. The proxies will then send this update information to the clients [9]. In domain-top approach the web proxies will first search for the popular domains and then it looks for the important documents in each domain. A suggestion list will be prepared for the user with the knowledge of the proxy server about the popularity of domains and documents and this list will be used for user’s future requests.

In this dynamic web pre-fetching technique user preference list will be maintained for each user which will contain the list of web sites which will be available for immediate access to the user. This user preference list will be stored in the database of the proxy server. Dynamic web pre-fetching technique uses the intelligent agents to monitor the network traffic. When ever the network traffic is low then the system increase the pre-fetching similarly in a heavy traffic it will reduce the pre-fetching thereby helps in utilizing the idle time of the network also maintains the traffic constant. The number of web links to be pre-fetched depends on the usage of bandwidth and weights of the web pages. While assigning the weights to the web pages preferences will be given to the links which are accessed frequently and recently. By using this technique the cache hit ratio has been increased by 40-75% and latency is reduced to 20-63%.

The log file modeling is the important task in web usage mining. If we use an accurate model for modeling the web log file then the accuracy of web page prediction scheme will also increase. The most commonly used model is Markov model. In markov model each page represents the state and the visited sequence of a pair of page represents the transition between two pages. The accuracy of the traditional first order markov model is less because of the lack of in-depth analysis. In contrast the second order markov model is more accurate but the time complexity of the model is high. In [11] they had proposed a hybrid approach called dynamic nested markov model in which the second order markov model is nested with in the first order markov model. In the dynamic markov model the insertion and removal of nodes is much easier. The node will contain all the information about the web pages like the web page name, inlink list which is a list that contains the name of the previous web page, count of the web page which will represents the number of times the current web page is reached from the previous web page and the outlink list contains the list of nodes which contains the name of the next web page and its count. In this model the number of nodes will be always same as the number of web pages. Since we have replaced the transition matrix structure in the traditional markov model with the dynamic linked list the time complexity of the proposed model is less than the traditional model. Also the model covers

The experiment conducted with the web site that serves 1200 users and receives a minimum of 10,000 requests per day. The experimental data is split into three sets DS1 which contains 3000 pages, DS2 with 1000 pages and DS3 with 1500 pages. It has been shown that DS1 has taken 537 ms and DS2 has taken 62 ms and DS3 has taken 171 ms. So it is evident that time taken for DNMM generation is directly proportional to the number of web pages and the size of the log file.

The latency can also be reduced by client side pre-fetching. A prefetching model proposed by Jaing [3] is based on the users search pattern and the access rate of all the links in a web page. Each link will have a counter which will be incremented whenever it is clicked by a user. The access rate is the ratio of the links counter value to the counter value of that particular page. The pre-fetcher will fetch the web pages whose access rate is high. The main advantage of this model is that this model can be executed independently in client’s machine. But the disadvantage of this model is that it will increase the processing overhead of the clients computer.

Initially the web pages will contain one HTML document it may include some images. But in recent days several HTML documents were embedded in to a single web page. In such cases the browser displays the embedded documents along with the requested documents. These embedded documents decrease the prediction accuracy of the system. Also if the user requests a page by typing the URL in the browser navigation bar then these requested will not be taken in to account by any link analysis method. To overcome these drawbacks kim has proposed a prefetching algorithm. In this algorithm the request patterns were represented by means of a link graph. The nodes of the graph represent the unique URL of the HTML documents and the edges represent the hyperlink or an embedded link and it is directed from the referring document to the referred document. When a user requests a webpage then the access counter value of the node corresponding to that particular web page or document will be incremented by one. Also when a user traverses from one page to another page then the access counter value of the corresponding edge will be incremented by one. It has been assumed that the user is browsing a page that is displayed on the browser if the user does not make another request within a minimum interval of time. By this time the prefetching module will be executed and the prefetched documents will be stored in the cache.

Agarwal, R. (2010). An Architectural Framework for Web Information Retrieval based on User’s Navigational Pattern. Time, 195-200.

Dimopoulos, C., Makris, C., Panagis, Y., Theodoridis, E., & Tsakalidis, A. (2010). A web page usage prediction scheme using sequence indexing and clustering techniques. Data & Knowledge Engineering, 69(4), 371-382. Elsevier B.V. doi: 10.1016/j.datak.2009.04.010.

Georgakis, a, & Li, H. (2006). User behavior modeling and content based speculative web page prefetching. Data & Knowledge Engineering, 59(3), 770-788. doi: 10.1016/j.datak.2005.11.005.

Jalali, M., Mustapha, N., Mamat, A., & Sulaiman, N. B. (2008). A new classification model for online predicting users future movements. Architecture, 0-6.

Kim, Y., & Kim, J. (2003). Web Prefetching Using Display-Based Prediction. Science And Technology, 0-3.

Liu, H., & Keselj, V. (2007). Combined mining of Web server logs and web contents for classifying user navigation patterns and predicting users’ future requests. Data & Knowledge Engineering, 61(2), 304-330. doi: 10.1016/j.datak.2006.06.001.

Nair, A. S. (2007). Dynamic Web Pre-fetching Technique for Latency Reduction. Science, 202-206. doi: 10.1109/ICCIMA.2007.303.

Nigam, B., & Jain, S. (2010). Generating a New Model for Predicting the Next Accessed Web Page in Web Usage Mining. 2010 3rd International Conference on Emerging Trends in Engineering and Technology, 485-490. Ieee. doi: 10.1109/ICETET.2010.56.

Pallis, G., Vakali, a, & Pokorny, J. (2008). A clustering-based prefetching scheme on a Web cache environment. Computers & Electrical Engineering, 34(4), 309-323. doi: 10.1016/j.compeleceng.2007.04.002.

Park, S., Suresh, N., & Jeong, B. (2008). Sequence-based clustering for Web usage mining: A new experimental framework and ANN-enhanced K-means algorithm. Data & Knowledge Engineering, 65(3), 512-543. doi: 10.1016/j.datak.2008.01.002.

S. Chakrabarti, M. van der Berg, B. Dom, Focused crawling: a new approach to topic-specific web resource discovery, in: Proceedings of 8th Int. WorldWide Web Conf. (WWW8), 1999.

A. Gulli, A. Signorini, The indexable web is more than 11.5 billion pages, in: Special interest Tracks and Posters of the 14th International Conference onWorld Wide Web, Chiba, Japan, 2005.

A. Banerjee, J. Ghosh, Clickstream clustering using weighted longest common subsequences, in: Proc. of the Web Mining Workshop at the 1st SIAM Conference on Data Mining, 2001

I. Cadez, D. Heckerman, C. Meek, P. Smyth, S. White, Visualization of navigation patterns on a Web site using model-based clustering, in: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2001), Boston, MA, 2000, pp. 280-284.

Y. Fu, K. Sandhu, M.-Y. Shih, A generalization-based approach to clustering of Web usage sessions, in: B. Masand, M. Spiliopoulou

(Eds.), Web Usage Analysis and User Profiling: International WEBKDD’99 Workshop, San Diego, CA, Lecture Notes in Computer

Science, vol. 1836, Springer, Berlin/Heidelberg, 2000, pp. 21-38.

Order Now