Context Oriented Search Engine Using Web Crawler Information Technology Essay

The number of web pages is increasing into millions and trillions around the world. To make searching much easier for users, web search engines came into existence. Web Search engines are used to find specific information on the World Wide Web. Without search engines, it would be almost impossible for us to locate anything on the Web unless or until we know a specific URL address.

Every search engine maintains a central repository or databases of HTML documents in indexed form. Whenever a user query comes, searching is performed within that database of indexed web pages. The size of repository of every search engine can’t accommodate each and every page available on the WWW. So it is desired that only the most relevant pages are stored in the database so as to increase the efficiency of search engines. To store most relevant pages from the

World Wide Web, a suitable and better approach has to be followed by the search engines. This database of HTML documents is maintained by special software .The software that traverses web for capturing pages is called “Crawlers” or “Spiders”.

Purpose of System:

The proposed system is an attempt to design an information retrieval system implementing a search engine with a web crawler which perform searching web in a faster way. As there are different types of search engine available today which follow some different architecture and techniques. But from research and analysis developer found web crawler based search engines more efficient and effective. Seeing the demand of this search engine developer decided to design such a similar system with some extra features for the ease of user’s usage.

Target Audience:

The Type of Target Audience / User who using the system:

Students :

Today modern society’s education is not limited within the books. So children always want to have the option of internet to increase their knowledge. They love to surf internet for their queries.

Employees:

Developing projects, thesis, seminar papers etc people always feel the need of the search engines where they can just type and get the solutions. In organization it is widely used for the work purposes.

Teenagers:

Information about games, books or any other fun, no need to worry search engine fulfill all the demands. It gives you a lot of option from only one request.

Home users:

House wives always want to know the latest trends and facilities available. Search engine can serve them very well.

Rationale: The following has highlighted some of the most common benefits for adopting this system:

Tangible benefits:

Inexpensive to implement.

The users need to spend very less effort while retrieving information.

User does not need to be expert for using it.

Intangible benefits:

Easy to use for less educated as well as professionals.

User friendly.

Leads to customer satisfaction

The nature challenge for building such system:

Building a search engine is not an easy task. The main challenge in developing such system is to understand the basic concepts of searching algorithms and as well as the crawling. Integrating web crawler with search engine is itself a big challenge. Understanding the techniques being used behind the crawler and the search engine.

Learning Objectives

The main objective of this project is to develop a search engine with an automated web crawler which can serve the user as per their requirements. The information retrieval system is in a very good demand among user. So seeing the interest of the user, developer decided to develop this project.

As a result, in order to provide a highly acceptable search engine the major components as highlighted as follows:

Core functionalities:

Keyword searching: Search engine perform the action based on determining the keywords. It tries to pull out and index words that appear to be significant. The title of the page can give useful information about the document. Words that are mentioned towards the beginning of the document are given more weightage. Words that are several times are also given more weightage.

News search: This search engine will also provide the facility of news search which will be done using some APIs.

Database management: This part covers the crawler management .The way it store the links at the database. First is the spider, also called the crawler. The spider visits a web page, reads it, and then follows links to other pages within the site. This is what it means when someone refers to a site being “spidered” or “crawled.”

Prioritization bases: Crawler index and store the links based on some priority level i.e page rank, back link etc. Everything the spider finds goes into the second part of the search engine, the index. The index, sometimes called the catalog, is like a giant book containing a copy of every web page that the spider finds. If a web page changes, then this book is updated with new information.

User profiling: This feature enables user to search the matter as per their priorities. This specifies the field for their search. Like, suppose a user enters a keyword and he wants to search from a particular field then he will select the field and results will be shown.

Language tools: Provide different language option. This provides flexibility to users belonging from a different language background.

Enhanced functionalities

Optional search: Provide the option to search from either live data or crawled data.

Exclusion of words: Some words or sites are excluded which are not meant to be seen or access by the common people because of many issue related them.

Special features

Advanced search: Shows the page rank, prohibited words with the results.

Direct download: Provide links for the direct downloading.

Learning Objective:

The primary learning objective from this project would be analysis of algorithms and architecture behind crawler and search engine. This would also help me understanding the basic concepts of project management and HCIU principles. It also gives me a wide scope to learn new technologies. Building search engine requires a thorough research and quite deep knowledge.

Chapter 2: Problem Description

Chapter 3: Literature Review

Domain Research

Working Of Web Search Engine

As the web of pages around the world is increasing day by day, the

need of search engines has also emerged. In this chapter, we explain the basic

components of any basic search engine along with its working. After this, the role of web

crawlers, one of the essential components of any search engine, is explained.

2.1 Basic Web Search Engine

World Wide Web is full of plenty of important information which can be useful for millions of users using internet today. Information seekers then use search engine to perform their search activity. they use a list of keywords for the search and in result get a number of relevant web pages which contain the keywords entered by the user.

By Search Engine in relation to the Web, we refer the actual search which get searched from the databases containing many HTML docs.

2.2 Types of Search Engine

Typically, there are three types of Search Engines exist:

Â·â‚¬Â Web Crawler based: those that are powered by web spiders or you can say web robots.

Â·â‚¬Â Human Powered Directory: those which are controlled by human

Â·â‚¬Â Hybrid search Engines

2.2.1 Crawler Based Search Engine

This is the search engine which uses an automated software called web crawler that visits website and maintain database according.

Web Crawler basically performs the following action:

visit the website

read the information

identifies all the links

add them to the list of URL to visit

returns the data to the database to be indexed

And then Search Engine uses this database for retrieving data for the entered query by the user.

The following figure illustrates the life span of normal query which requires number of steps to be executed in order to show results to the user:

C:UsersseemaDesktopUntitled.png

3. The search results are returned to the user in a fraction of a second.

Â

http://edtech.boisestate.edu/bschroeder/publicizing/images/about_search_engines_clip_image002.gif

2. The query travels to the doc servers, which actually retrieve the stored documents. Snippets are generated to describe each search result.

http://edtech.boisestate.edu/bschroeder/publicizing/images/about_search_engines_clip_image003.gif

http://edtech.boisestate.edu/bschroeder/publicizing/images/about_search_engines_clip_image004.gif

2.2.2 Human-powered Search Engine

The Search Engine which depends on human to submit information to be subsequently indexed and catalogued is known as a human powered Search Engine. These are rarely used at large scale.

2.2.3Hybrid search engines

Such a Search Engine is a mixed type which combined the results of web crawler type as well as human powered directory type. A hybrid search engine favors one type of listings over other. For example MSN .

http://abclive.in/abclive_investigative_news/search_engines_working.html

2.3 Structure & Working of Search Engine

2.3.1 Gathering also called “Crawling”

For this purpose crawler is used by the search engine that browse the web. They extract URLs from the web pages and give them to the controller module which then decides what links to visit next and feeds the links.

2.3.2 Maintaining Database/Repository

All the data of the search engine is stored in a database as shown in the figure 2.1.All the

searching is performed through that database and it needs to be updated frequently.

During a crawling process, and after completing crawling process, search engines must

store all the new useful pages that they have retrieved from the Web. The page repository

(collection) in Figure 2.1 represents this possibly temporary collection. Sometimes search

engines maintain a cache of the pages they have visited beyond the time required to build

the index. This cache allows them to serve out result pages very quickly, in addition to

providing basic search facilities.

2.3.3 Indexing

Once the pages are stored in the repository, the next job of search engine is to make a

index of stored data. The indexer module extracts all the words from each page, and

records the URL where each word occurred. The result is a generally very large “lookup

table” that can provide all the URLs that point to pages where a given word occurs. The

table is of course limited to the pages that were covered in the crawling process. As

mentioned earlier, text indexing of the Web poses special difficulties, due to its size, and

its rapid rate of change. In addition to these quantitative challenges, the Web calls for

some special, less common kinds of indexes. For example, the indexing module may also

create a structure index, which reflects the links between pages.

Figure 2.2: Working steps of search engine [4]

2.3.4 Querying

This sections deals with the user queries. The query engine module is responsible for

receiving and filling search requests from users. The engine relies heavily on the indexes,

and sometimes on the page repository. Because of the Web’s size, and the fact that users

typically only enter one or two keywords, result sets are usually very large.

2.3.5 Ranking

Since the user query results in a large number of results, it is the job of the search engine

to display the most appropriate results to the user. To do this efficient searching, the

ranking of the results are performed. The ranking module therefore has the task of sorting

the results such that results near the top are the most likely ones to be what the user is

looking for.

Once the ranking is done by the Ranking component, the final results are displayed to the

user. This is how any search engine works.

3.3 Basic Crawling Terminology

Before we discuss the working of crawlers, it is worth to explain some of the basic

terminology that is related with crawlers. These terms will be used in the forth coming

chapters as well.

3.3.1 Seed Page: By crawling, we mean to traverse the Web by recursively following

links from a starting URL or a set of starting URLs. This starting URL set is the entry

point though which any crawler starts searching procedure. This set of starting URL is

known as “Seed Page”. The selection of a good seed is the most important factor in any

crawling process.

3.3.2 Frontier (Processing Queue): The crawling method starts with a given URL

(seed), extracting links from it and adding them to an un-visited list of URLs. This list of

un-visited links or URLs is known as, “Frontier”. Each time, a URL is picked from the

frontier by the Crawler Scheduler. This frontier is implemented by using Queue, Priority

Queue Data structures. The maintenance of the Frontier is also a major functionality of

any Crawler.

3.3.3 Parser: Once a page has been fetched, we need to parse its content to extract

information that will feed and possibly guide the future path of the crawler. Parsing may

imply simple hyperlink/URL extraction or it may involve the more complex process of

tidying up the HTML content in order to analyze the HTML tag tree. The job of any

parser is to parse the fetched web page to extract list of new URLs from it and return the

new un-visited URLs to the Frontier.

3.4 Working of Basic Web Crawler

From the beginning, a key motivation for designing Web crawlers has been to retrieve

web pages and add them or their representations to a local repository. Such a repository

may then serve particular application needs such as those of a Web search engine. In its

simplest form a crawler starts from a seed page and then uses the external links within it

to attend to other pages. The structure of a basic crawler is shown in figure 3.

The process repeats with the new pages offering more external links to follow, until a

sufficient number of pages are identified or some higher level objective is reached.

Behind this simple description lies a host of issues related to network connections, and

parsing of fetched HTML pages to find new URL links.

Figure 3.1: Components of a web-crawler

Common web crawler implements method composed from following steps:

Â·â‚¬Â Acquire URL of processed web document from processing queue

Â·â‚¬Â Download web document

Â·â‚¬Â Parse document’s content to extract set of URL links to other resources and

update processing queue

Â·â‚¬Â Store web document for further processing

http://nazou.fiit.stuba.sk/home/?page=webcrawler

The basic working of a web-crawler can be discussed as follows:

Â·â‚¬Â Select a starting seed URL or URLs

Â·â‚¬Â Add it to the frontier

Â·â‚¬Â Now pick the URL from the frontier

Â·â‚¬Â Fetch the web-page corresponding to that URL

Â·â‚¬Â Parse that web-page to find new URL links

Â·â‚¬Â Add all the newly found URLs into the frontier

Â·â‚¬Â Go to step 2 and repeat while the frontier is not Empty

Thus a crawler will recursively keep on adding newer URLs to the database repository of

the search engine. So we can see that the main function of a crawler is to add new links

into the frontier and to select a new URL from the frontier for further processing after

each recursive step.

The working of the crawlers can also be shown in the form of a flow -chart (Figure 3.2).

Note that it also depicts the 7 steps given earlier [8]. Such crawlers are called sequential

crawlers because they follow a sequential approach.

In simple form, the flow chart of a web crawler can be stated as below:

Market Research:

Similar web system:

Chapter 4: Research Methods

Abstract:

INTRODUCTION

This Primary research is being conducted in order to know your interest and requirments. Its very important for us to know your specifications so that the developer can proceed in the direction of your satisfaction. The purpose of conducting this questionnaire is to obtain information to keep on record, to make decisions about important issues, to pass information on to others. Primarily, data is collected to provide information regarding a specific topic. The process provides both a baseline from which to measure from and in certain cases a target on what to improve.

4.1 Primary Research

As the complete project is based on printing documents by different people, the opinion and suggestion of the users are very important for the project. It was a basic necessity to understand the user’s mind set regarding this of kind of software. Therefore primary research was carried out in order to understand the user’s requirements properly. The research was conducted under a defined set of standards mention in ethical form – Fast Track Form.

To know the user requirements and developing the system as per their satisfaction, it’s necessary to involve users. Depending on the type of the system being developed and users, the data gathering techniques need to be decided.

For gathering information and user requirements, the following three fact-finding techniques would be considered to be use throughout the research stages.

These techniques are listed below:

Research

Research is all about going through the already existed documentation and system.

Usually there is a large amount of data that has already been collected by others, although it may not necessarily have been analysed or published. Locating these sources and retrieving the information is a good starting point in any data collection effort.

For example: Analysis of the existed search engines can be very useful for identifying problems in certain interventions.

Interview

An Interview is a data-collection technique that involves one to one questioning. Answers to the questions posed during an interview can be recorded by writing them down (either during the interview itself or immediately after the interview) or by tape- recording the responses, or by a combination of both.

Interview should be carried out in order to obtain opinion and the perspective from others who have experienced in implementing and using such system in order to allow the developer to further enhance and refine the system ideas and features of the current existing system and the propose system.

Questionnaire

Questionnaires are an inexpensive way to gather data from a potentially large number of respondents. Often they are the only feasible way to reach a number of reviewers large enough to allow statistically analysis of the results. A well-designed questionnaire that is used effectively can gather information on both the overall performance of the test system as well as information on specific components of the system. If the questionnaire includes demographic questions on the participants, they can be used to correlate performance and satisfaction with the test system among different groups of users.

Questionnaires for the pre analysis of the system:

Question: What grade level do you work with?

Justification- This question will bring me to know the type of user answering the questionnaire and which users use the proposed system.

Question: Do you think search engine is right tool for searching your interests?

Justification- This question will help me to know how much users prefer search engine for their queries.

Question: What is the frequency of using search engine?

Justification- by this question, I will come to know how frequent users use this particular system which will definitely help me to develop the proposed system more efficient.

Question: Which search engine you mostly use?

Justification- This will let me know which type of search engine user prefer more and what features of they like more.

Question: Which of the following color schemes would you look for the interface of the search engine?

Justification-asking user about the color schemes will let me know what kind of interface they want. Attractive interfaces attract users towards the system usages.

Question: Are you satisfied with working of the current search engines?

Justification-this question will help me to know the current mind set of the users towards the existing system So that I can develop the proposed system much better than the existing one.

Question: On what bases you mark the search engines?

Justification- by this question, I will come to know which factor effects user most so that I can work on it more to make the system as user requires.

Question: Do you want the search engines to have feature of different languages?

Justification- from this question, developer will come to know whether user like the feature language tools to be in the search engine.

Question: How frequently you perform news search?

Justification-knowing how user frequently uses this particular search is important for the developer so that the feature should get implemented properly

Questionnaires for expert users of the system:

Question: What profession you belong?

Justification- such type of question let us know which type of user is going to answer all the given questions and what their expectations are.

Question: Which type of search engine do you prefer more?

Justification-knowing what professionals think of the system and what best they prefer so that developer can work in right direction as well as this question also contrast the importance of this particular system with others.

Question: Do you think crawling to be the most effective technique for the search engine?

Justification- wants to know more about the crawling technique from this question and see its advantage over others.

Question: Is web crawler based search engine is more efficient than others?

Justification-from this question viewing of the system will be clearer as compared to others. The developer will come to know whether the proposed system is worth or not.

Interview questions for the system:

Question: What problems do you come through while using the existing search engines?

Justification- it is very important to know that what problems users are facing in the existing systems so that the developer can work more on those areas.

Question: What extra you suggest to be there in the new search engine?

Justification-from this question, I will come to know about the user expectations and requirements as well what else they want in the coming new system.

Question: According to you what strategy we should follow for designing this search engine?

Justification-by this question, I want some more suggestion on the strategy to be followed. So that I can broad up my mind set towards the system.

Question: Give suggestion for making the interface of the system more user-friendly? (Seeing your convenience)

Justification-interface is one of the most important part of the system which should be as interactive as it can be for the user so with this question I will get help in designing a more attractive as well as interactive user interface.

4.2 Secondary Research

4.2.1 Selection of methodology:

Methodology is a collection of processes, methods and tools for accomplishing an objective. Methodologies provide a checklist of key deliverables and activities to avoid missing key tasks. This consistency simplifies the process and reduces training. It also ensures all team members are marching to the same drummer.

Most IT projects use an SDLC that defines phases and specific activities within a typical project. These SDLCs reflect different approaches to completing the product deliverables. There are many different SDLCs that can be applied based on the type of project or product.

A suitable methodology needs to be selected for the project, which provides a framework through which we can manage the project more efficiently. Different software methodology caters for different project because each project will have its own characteristics and needs in regards to an appropriate process.

After making lot of research in this area, I selected “Iterative Enhancement model” as the software development process model for the proposed system.

Incremental Model

The Incremental methodology is a derivative of the Waterfall. It maintains a series of phases which are distinct and cascading in nature. Each phase is dependent on the preceding phase before it can begin and requires a defined set of inputs from the prior phase. However, as the graphic below portrays, in the design phase development is broken into a series of increments that can be constructed sequentially or in parallel. The methodology then continues focusing only on achieving the subset of requirements for that development increment. The process continues all the way through Implementation. Increments can be discrete components (e.g., database build), functionality (e.g., order entry), or integration activities (e.g., integrating a Human Resources package with your Enterprise Resource Planning application). Again, subsequent phases do not change the requirements but rather build upon them in driving to completion.

Iterative Enhancement Model (Evolutionary)

The Evolutionary methodology also maintains a series of phases that are distinct and cascading in nature. As in the other methodologies, each phase is dependent on the preceding phase before it can begin and requires a defined set of inputs from the prior phase. As the graphic below portrays, the Evolutionary methodology is similar to the Incremental in that during the design phase development is broken into a distinct increment or subset of requirements. However, only this limited set of requirements is constructed through to implementation. The process then repeats itself with the remaining requirements becoming an input to a new requirements phase. The “left over” requirements are give consideration for development along with any new functionality or changes. Another Iteration of the process is accomplished through implementation with the result being an “Evolved” form of the same software product. This cycle continues with the full functionality “Evolving” overtime as multiple iterations are completed.

http://www.newmediacomm.com/publication/outsourcing/marapr08/techno.html

Waterfall Model

This methodology was the first formalization of a process for controlling software development. The Waterfall methodology was, and still is, the foundation for all SDLC methodologies. The basic phases in this methodology are utilized in all other methodologies as descriptors of processes within a given SDLC. The Waterfall methodology, as its name implies, is a series of phases that are distinct and cascading in nature. As the graphic below portrays, each phase is dependent on the preceding phase before it can begin, and requires a defined set of inputs from the prior phase. Subsequent phases are driven to complete the requirements defined in the Analysis phase to assure the resulting software meets these requirements. A slight derivative of this methodology exists, typically known as Modified Waterfall, whereby the end of one phase may overlap with the beginning of another, allowing the phases to operate in parallel for a short time. This is normally done to avoid gaps in phase schedules and still necessitates the completion of the prior phase primary deliverables before the subsequent phase is fully started.

Spiral Model

Much as the other methodologies, the Spiral methodology maintains a series of phases that are distinct and cascading in nature. As in the other methodologies, each phase is dependent on the preceding phase before it can begin and requires a defined set of inputs from the prior phase. However, as the graphic below portrays, the Spiral methodology iterates within the Requirements and Design phases. Unlike the other models, multiple iterations are utilized to better define requirements and design by assessing risk, simulating, and validating progress. The Spiral methodology also relies heavily on the use and evolution of prototypes to help define requirements and design. Prototypes become operational and are utilized to finalize detailed design. The objective of this being to thoroughly understand requirements and have a valid design prior to completing the other phases. Much like the Waterfall methodology, subsequent processes of Development through Implementation continues. Unlike Incremental and Evolutionary, no breakdown of development tasks nor iteration after implementation respectively are utilized.

http://www.newmediacomm.com/publication/outsourcing/marapr08/techno.html

Comparison of Methodologies:

Justification for choosing Iterative Enhancement model

Main Reason: – There is no need to look back in this model. Always one activity is performed at a time. It is easy to check development: 90% coded, 20% tested. First of all requirements are collected. After those requirements analyzed, PSF (Project Specification Form) document will be created. Implementation will be started only after completion of designing phase that is our next semester FYP subject. Project is released to the supervisor near the end of the software life cycle or semester. One important reason to choose waterfall model is that it is document driven that is, documentation is produced at every stage. Waterfall model is a well organized process model which will lead to a concrete, more secured and reliable software. In this project very small risks are involved. This is also one reason to choose waterfall model. A lot of risk assessment is not required.

4.2.2 Programming Language Research

C#.NET Selecting a right platform for the project development is one of its prior requirements. While selecting any platform the developer needs to take care of several issues regarding the development of the project. The factors which need attention while selection of any programming language are:

Interface designing

Functionalities

Implementation of algorithms

Required time

As per the project requirements developer decided to used an Object Oriented Language because this will help developer to reuse the codes.

As per the project as well as developer’s requirement it is found that C#.NET will be the best option for programming language as it reduce the designing time and it supports a a good GUI.

C#.NET allows quicker development of application. Supports the following:

Interactive GUI

Standalone Application

Web based Application

Supports ASP.NET as a back end programming language

Many more

The proposed system is web based application so for this developer had many options like, j2ee, PHP, C#.NET.

Here are some of the descriptions about all these languages to better understand which one will be best for the proposed system.

C#.NET:

C# is the language of choice in the .Net platform. It is a new language free of the backward compatibility curse with a whole new bunch of exciting and interesting features. It’s an object oriented language that has its core and many similarities to java, C++ and VB. In other sense you can say it combines the power and efficiency of all these three existing languages.

Merits of C#.NET

C#.NET is CIL language which means that it can interface with all classes and interface of the .net platform.

C# has been derived from C++ and java

This provides OOPs concept to be implemented with the .NET platform.

Code becomes more legible because of the concept of get and set method.

Term Delegates provide cleaner event management.

Concept of Enumeration, indexer ,properties and many more which are not available in many other languages still helps increasing robustness.

Demerits of c#

C# codes are harder as well as slower to debug and run.

Locks the developer into Microsoft platform.

JSP- Java server pages:

Merits of JSP:

Demerits of JSP:

PHP:

Merits of PHP:

Demerits of PHP:

Conclusion of Programming language research

C#.NET has been selected by developer for the development of the required project.

As C#.NET support GUI which is very much important for this system.

Database compatibility will allow developer to use any required database.

Chapter 5: Analysis And Design

5.1 Analysis

5.1.1. Questionnaire Analysis

Question: What grade level do you work with?

Analysis:

Question: Do you think search engine is a right tool for searching your interests?

Analysis:

Question: What is the frequency of using search engine?

Analysis:

Question: Which search engine you mostly use?

Analysis:

Question: Which of the following color schemes would you look for the interface of the search engine?

Analysis:

Question: Are you satisfied with working of the current search engines?

Analysis:

Question: On what bases you mark the search engines?

Analysis:

Question: Do you want the search engines to have feature of different languages?

Analysis:

Question: How frequently you perform news search?

Analysis:

Question: What profession you belong?

Analysis:

Question: Which type of search engine do you prefer more?

Analysis:

Question: Do you think crawling to be the most effective technique for the search engine?

Analysis:

Question: Is web crawler based search engine is more efficient than others?

Analysis:

Order Now