Web Usage Mining for Web Page Recommendation
A Survey On Web Usage Mining For Web Page Recommendation Using Biclustering
Â
ABSTRACT
The World Wide Web contains an increasing amount of websites which in turn contains increasing number of web pages. When any user visits a new website they have to go through large number of web pages to meet their requirements. Web usage mining is the process of extracting useful knowledge from the server logs. This useful knowledge can be applied to target marketing and in the design of web portals. A Recommender system is one of the best web usage mining Application which reduces the difficulties faced by the users to meet their requirements .It recommends the pages of interest to the user. This report includes the survey of different clustering and biclustering techniques. Also we will discuss the biclustering approach which has some advantages over the traditional clustering approach.
Keywords : Web usage mining, Recommender system, biclustering
I. INTRODUCTION
The World Wide Web store, share, and distribute information in the large scale. There is large number of internet users on the web. They are facing many problems like information overload due to the significant and rapid growth in the amount of information and the number of users. As a result, how to provide web users with more exactly needed information is becoming a critical issue in web applications. Web mining extracts interesting pattern or knowledge from web data. It is classified into three types as web content mining, web structure, and web usage mining. Web usage mining is the most important area of web mining which deals with the extraction of useful knowledge from the web usage data. There are different kinds of datasets on which web usage mining can be performed. They are in the form of log files. These log files can be stored at server side, proxy side and client side. Mostly the server side log files are used for web usage mining. Before the mining process various pre-processing techniques can be applied to the log files, for example, pre-processing, pattern discovery, pattern analysis. The data mining techniques like Association rule mining, Sequential pattern analysis; Classification
and Clustering are used to mine the web usage data. The mined knowledge can be helpful in different web applications like personalization of web Content, support for the design, E-commerce, and many other web applications.
In this paper we discuss clustering technique of data mining for web usage data. Clustering is one of the important data mining technique to discover usage pattern from the web usage data. The users with the same browsing pattern are clustered in the same group and the others are clustered in different groups. In this survey we consider biclustering algorithm based on genetic algorithms (GAs) for effective clustering. In general, a genetic algorithm (GA) is a search heuristic that mimics the process of natural selection. This heuristic (also sometimes called a metaheuristic) is routinely used to generate useful solutions to optimization and search problems [10]. So, we believe that a clustering technique with Genetic algorithm can provide relevant clusters more effectively.
A traditional clustering method clusters users according to their similarity of browsing behaviour under all pages. However, it is often the case that some users have similar behaviour only on a subset of pages. For example consider below example user page matrix. [2]
TABLE-1 : USER PAGE MATRIX
 |
Page1 |
Page 2 |
Page 3 |
Page 4 |
User 1 |
5 |
3 |
6 |
|
User 2 |
1 |
2 |
4 |
7 |
User 3 |
1 |
1 |
2 |
6 |
User 4 |
5 |
8 |
11 |
When all pages are considered users 1, 2, and 4 do not show similar behaviour since their hit count values are uncorrelated under page 2 ,while users 1 and 2 have an increased hit count value from page 1 to page 2, the hits of user 4 drops from page 1 to page 2. However, these users behave similarly under pages 1, 3, and 4 since all their hit count values increase from page 1 to page 3 and increase again for page 4. A traditional clustering method will fail to recognize such a cluster since the method requires the three users to behave similarly under all pages which are not the case [2]. To overcome this problem Biclustering or Two- way clustering was introduced. Biclustering was first introduced by Hartigan and called it direct clustering [1]. Following section describes some of the clustering and biclustering methods together with Genetic algorithm available in the literature.
II. LITERATURE SURVEY
2.1 WEB MINING
Web mining is categorized into three areas which are Web usage mining, Web content mining, and Web structure mining [6]. Web usage mining makes use of logs that are generated by the Web server to make sense of the user’s behaviour on the Web. The logs captured by web servers are the primary source of data in web usage mining, and it is important as it explicitly records the browsing behaviour of site visitors. The greatest advantage of the web server logs is that they are records of what people have actually done, and not what they might do or thought they did [4].
Web personalization based on Web usage mining involves three phases; data preparation and transformation, pattern discovery, and recommendation. In the first stage, the web server logs will undergo intensive pre-processing stage that will remove all irrelevant information and prepare the logs for pattern discovery to derive the user profile. A previous study used frequency and duration as indicators to represent the interest degree of a Web page to a user in the session. Another separate study indicates that contiguous sequential patterns found in frequent navigational paths are more suitable for predictive tasks, such as predicting which item the user will access next during his navigation. Recent studies on sequential patterns in web log data show that ordered sequence of events can discover web users’ navigational patterns [4].
Web content mining is the process of extracting knowledge from the content of Web documents [6]. One of the challenges in Web content is to extract useful information from the pages. This stage is known as Web content cleaning. A Web page typically contains a mixture of many kinds of information, such as the main content, advertisements, navigation panels, and copyright notices [5]. Web content mining techniques alone is unable to handle dynamic content changes in news sites. On the other hand, personalization based on web usage by itself is not able to reflect the changes in site content, because these changes are not included in the Web logs. As Web usage and Web content have limitations, combining these two areas will harness both of their use for personalization [4].
2.2 WEB LOG
A Web log is a file to which the Web server writes information each time a user requests a resource from that particular site. All users’ web access activities of a website are recorded by the WWW server of the website and stored into the Web Server Logs. Each user access record contains the client IP address, request time, requested URL, user ID, HTTP status code, etc. Web log consist of attributes with the data values in the form of records. The information contained in web logs has been used in many different ways. In various studies, researchers and search engine administrators have used information from web logs to learn about the search process and to improve search engines. Besides learning about search engines or their users, query web logs are also being used to infer semantic concepts or relations [3].
2.3 DATA COLLECTION
There are three main sources to get the row log data, which are namely 1) Client Log File 2) Proxy Log File 3) Web Server Log File
Web Server Log File:
The most significant and frequently used source for web usage mining is web server log data. This web log data is generated automatically by web server when it services user request, which contains all information about visitor’s activity. The common server log file types are access log, agent log, error log and referrer log [7] Table-1 summarizes each.
TABLE-2: WEB SERVER LOG FILE TYPES AND CONTENT[7]
Log File Type |
What it records |
Access log |
All resource access request sent by user |
Agent log |
User’s browser, version, OS etc |
Error log |
Details of errors occurred while processing user access request |
Refferer log |
Contains information about referrer page |
Depending on web server, web log file data varies on number, type of attributes, and format of log file. W3C maintains standard log file format however custom log file format can be configured. Many varied format are available like 1.Common log format, 2.Extended common log format, 3. Centralized log format, 4.NCSA common log format, 5.ODBC logging, 6.Centralized binary logging. among all common or extended file format are mainly implemented by web server. [7]
Common Log Format (CLF) may contain following fields
[host/IP rfcname logname [DD/MMM/YYYY: HH:MM:SS-0000] “METHOD/PATH HTTP/ 1.0†bytes] [7]
2.4 RECOMMENDATION SYSTEM
Recommender systemsorrecommendation systems are a subclass ofinformation filtering systemthat seek to predict the ‘rating’ or ‘preference’ that user would give to an item.The most popular ones are probably movies, music, news, books, research articles, search queries, social tags, and products in general. However, there are also recommender systems for experts, jokes, restaurants, financial services,life insurance, persons (online dating), and Twitter followers.[9]
Various data mining techniques applied on web recommendation system for the data Pre-processing of web server log data.
III. METHODS AND MATERIALS
3.1 BICLUSTER
Bicluster Types [8]
Different biclustering algorithms have different definitions of bicluster.
1) Bicluster with constant values (a),
2) Bicluster with constant values on rows (b) or
columns (c),
3) Bicluster with coherent values (d).
a |
a |
a |
a |
 |  |  |  |
a |
a |
a |
a |
a |
a |
a |
a |
a |
a |
a |
a |
(a)
a |
a |
a |
a |
 |  |  |  |
a+i |
a+i |
a+i |
a+i |
 |  |  |  |
a+j |
a+j |
a+j |
a+j |
 |  |  |  |
a+k |
a+k |
a+k |
a+k |
(b)
a |
a+i |
a+j |
a+k |
 |  |  |  |
a |
a+i |
a+j |
a+k |
 |  |  |  |
a |
a+i |
a+j |
a+k |
 |  |  |  |
a |
a+i |
a+j |
a+k |
(c)
A |
b |
c |
d |
 |  |  |  |
a+i |
b+i |
c+i |
d+i |
 |  |  |  |
a+j |
b+j |
c+k |
d+j |
 |  |  |  |
a+k |
b+k |
c+k |
d+k |
(d)
3.2 CLICKSTREAM DATA PATTERN
Clickstream data is a sequence of Uniform Resource Locators (URLs) browsed by the user within a particular period of time. By analyzing these data we can discover web users having similar browsing pattern. It requires some preprocessing before it is taken for analyse[1].
3.3 INITIAL BICLUSTERS[1]
K-Means clustering method is applied on the web user access matrix A(U, P) along both dimensions separately to generate ku user clusters and kp page clusters .And then combine the results to obtain small co-regulated sub matrices (ku × kp) called biclusters. These correlated biclusters are also called seeds.
3.4 COHERENT BICLUSTERING FRAMEWORK USING GENETIC ALGORITHM (GA) [1]
Usually, GA is initialized with the population of random solutions. In our case, after the greedy local search procedure the optimization technique genetic algorithm is applied on biclusters to get the optimum bicluster. This will result in faster convergence compared to random initialization.
Algorithm: Evolutionary Biclustering Algorithm [1]
Input: Set enlarged and refined seed
Output: Optimal Bicluster
Step 1. Initialize the population.
Step 2. Evaluate the fitness of individuals
Step 3. For i =1 to max_iteration
Selection()
Crossover()
Mutation() Evaluate the fitness
End(For)
Step 4. Return the optimal bicluster
Using the above algorithm we can generate optimum biclusters from web usage data which exhibits high coherence between the web user and the pages visited by them. Analyzing these overlapping coherent biclusters could be very beneficial for direct marketing, target marketing and also useful for recommending system, web personalization systems, web usage categorization and user profiling. The interpretation of biclustering results is also used by the company for focalized marketing campaigns to improve their performance of the business [1].
IV. CONCLUSION
The Biclustering approach overcomes the problem associated with traditional clustering methods by showing the higher coherence between the web user and the subset of pages visited by them. The result of Biclustering can be used in the focalized marketing strategy like direct marketing and target marketing. The recommendation system will give the website its most visited pages by its all user. It also gives information of the user having same behaviour on subset of pages. So it target on improving the website’s design, information availability and quality of services. Future work aims at extending this framework by using it as a pre-processing tool for the web page recommendation system.
REFERENCES
[1] |
R.Rathipriya, Dr. K.Thangavel, J.Bagyamani “Evolutionary Biclustering of Clickstream Data†IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 1, May 2011. |
[2] |
R.Rathipriya, Dr. K.Thangavel, J.Bagyamani “Binary Partical swarm Optimization based Biclustering of web usage data†International Journal of Computer Applications (0975 – 8887) Volume 25– No.2, July 2011. |
[3] |
Ravi Bhushan and Rajender Nath “Recommendation of Optimized Web Pages to Users Using Web Log Mining Techniques†Advance Computing Conference (IACC), Ghaziabad, February 22-23, IEEE 2013 |
[4] |
Husna Sarirah Husin, James A. Thom2, Xiuzhen Zhang, “News Recommendation Based on Web Usage and Web Content Mining†Data Engineering Workshops (ICDEW), Brisbane, QLD,April 8-12, IEEE 2013 |
[5] |
B. Liu and K. Chen-Chuan-Chang, “Editorial: special issue on web content mining,” SIGKDD Explor. Newsl., vol. 6, pp. 1-4, 2004. |
[6] |
R. Kosala and H. Blockeel, “Web mining research: a survey,” SIGKDD Explor. Newsl., vol. 2, pp. 1-15, 2000 |
[7] |
Chintan R. Varnagar, Nirali N. Madhak, Trupti M. Kodinariya, Rashmi S. Agrawal “A Devised Framework for Content Recommendation System Using Collaborative Log Miningâ€, Automation, Computing, Communication, Control and Compressed Sensing (iMac4s), Kottayam, March 22-23, IEEE 2013 |
[8] |
Date:21/12/2014, Time : 18:57:00 |
[9] |
Date:21/12/2014, Time : 19:06:00 |
[10] |
, Date:21/12/2014, Time : 19:15:00 |