How Do Web Search Engines Work Information Technology Essay

Introduction

A program that searches documents for specified keywords and returns a list of the documents where the keywords were found. Although search engine is really a general class of programs, the term is often used to specifically describe systems like Google, Alta Vista and Excite that enable users to search for documents on the World Wide Web and USENET newsgroups.

Typically, a search engine works by sending out a spider to fetch as many documents as possible. Another program, called an indexer, then reads these documents and creates an index based on the words contained in each document. Each search engine uses a proprietary algorithm to create its indices such that, ideally, only meaningful results are returned for each query.

How Do Web Search Engines Work?

Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be virtually impossible to locate anything on the Web without knowing a specific URL. But do you know how search engines work? And do you know what makes some search engines more effective than others?

When people use the term search engine in relation to the Web, they are usually referring to the actual search forms that searches through databases of HTML documents, initially gathered by a robot.

There are basically three types of search engines: Those that are powered by robots (called crawlers; ants or spiders) and those that are powered by human submissions; and those that are a hybrid of the two.

Crawler-based search engines are those that use automated software agents (called crawlers) that visit a Web site, read the information on the actual site, read the site’s meta tags and also follow the links that the site connects to performing indexing on all linked Web sites as well. The crawler returns all that information back to a central depository, where the data is indexed. The crawler will periodically return to the sites to check for any information that has changed. The frequency with which this happens is determined by the administrators of the search engine.

Human-powered search engines rely on humans to submit information that is subsequently indexed and catalogued. Only information that is submitted is put into the index.

In both cases, when you query a search engine to locate information, you’re actually searching through the index that the search engine has created -you are not actually searching the Web. These indices are giant databases of information that is collected and stored and subsequently searched. This explains why sometimes a search on a commercial search engine, such as Yahoo! or Google, will return results that are, in fact, dead links. Since the search results are based on the index, if the index hasn’t been updated since a Web page became invalid the search engine treats the page as still an active link even though it no longer is. It will remain that way until the index is updated.

So why will the same search on different search engines produce different results? Part of the answer to that question is because not all indices are going to be exactly the same. It depends on what the spiders find or what the humans submitted. But more important, not every search engine uses the same algorithm to search through the indices. The algorithm is what the search engines use to determine the relevance of the information in the index to what the user is searching for.

One of the elements that a search engine algorithm scans for is the frequency and location of keywords on a Web page. Those with higher frequency are typically considered more relevant. But search engine technology is becoming sophisticated in its attempt to discourage what is known as keyword stuffing, or spamdexing.

Another common element that algorithms analyze is the way that pages link to other pages in the Web. By analyzing how pages link to each other, an engine can both determine what a page is about (if the keywords of the linked pages are similar to the keywords on the original page) and whether that page is considered “important” and deserving of a boost in ranking. Just as the technology is becoming increasingly sophisticated to ignore keyword stuffing, it is also becoming more savvy to Web masters who build artificial links into their sites in order to build an artificial ranking.

Did You Know???.

The first tool for searching the Internet, created in 1990, was called “Archie”. It downloaded directory listings of all files located on public anonymous FTP servers; creating a searchable database of filenames. A year later “Gopher” was created. It indexed plain text documents. “Veronica” and “Jughead” came along to search Gopher’s index systems. The first actual Web search engine was developed by Matthew Gray in 1993 and was called “Wandex”.

Basic Fundementals Of Search Engines

A search engine operates, in the following order

Web crawling

Indexing

Searching

Web search engines work by storing information about many web pages, which they retrieve from the html itself. These pages are retrieved by a Web crawler (sometimes also known as a spider) – an automated Web browser which follows every link on the site. Exclusions can be made by the use of robots.txt. The contents of each page are then analyzed to determine how it should be indexed (for example, words are extracted from the titles, headings, or special fields called meta tags). Data about web pages are stored in an index database for use in later queries. A query can be a single word. The purpose of an index is to allow information to be found as quickly as possible. Some search engines, such as Google, store all or part of the source page (referred to as a cache) as well as information about the web pages, whereas others, such as AltaVista, store every word of every page they find. This cached page always holds the actual search text since it is the one that was actually indexed, so it can be very useful when the content of the current page has been updated and the search terms are no longer in it. This problem might be considered to be a mild form of linkrot, and Google’s handling of it increases usability by satisfying user expectations that the search terms will be on the returned webpage. This satisfies the principle of least astonishment since the user normally expects the search terms to be on the returned pages. Increased search relevance makes these cached pages very useful, even beyond the fact that they may contain data that may no longer be available elsewhere.

When a user enters a query into a search engine (typically by using key words), the engine examines its index and provides a listing of best-matching web pages according to its criteria, usually with a short summary containing the document’s title and sometimes parts of the text. The index is built from the information stored with the data and the method by which the information is indexed. Unfortunately, there are currently no known public search engines that allow documents to be searched by date. Most search engines support the use of the boolean operators AND, OR and NOT to further specify the search query. Boolean operators are for literal searches that allow the user to refine and extend the terms of the search. The engine looks for the words or phrases exactly as entered. Some search engines provide an advanced feature called proximity search which allows users to define the distance between keywords. There is also concept-based searching where the research involves using statistical analysis on pages containing the words or phrases you search for. As well, natural language queries allow the user to type a question in the same form one would ask it to a human. A site like this would be ask.com.

The usefulness of a search engine depends on the relevance of the result set it gives back. While there may be millions of web pages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others. Most search engines employ methods to rank the results to provide the “best” results first. How a search engine decides which pages are the best matches, and what order the results should be shown in, varies widely from one engine to another. The methods also change over time as Internet usage changes and new techniques evolve. There are two main types of search engine that have evolved: one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively. The other is a system that generates an “inverted index” by analyzing texts it locates. This second form relies much more heavily on the computer itself to do the bulk of the work.

Most Web search engines are commercial ventures supported by advertising revenue and, as a result, some employ the practice of allowing advertisers to pay money to have their listings ranked higher in search results. Those search engines which do not accept money for their search engine results make money by running search related ads alongside the regular search engine results. The search engines make money every time someone clicks on one of these ads.

Different types of search engines

Aesop Search – The Aesop spider looks for new meta tag to allow webmasters to automatically describe their site.

AltaVista – The default search results consist of GoTo and results from the Altavista spider (over 500M pages). Displays related searches. Offers translation services and multimedia searches.

Ask Jeeves – The polite butler Jeeves answers all your questions asked in plain English. If Jeeves doesn’t understand your question, it gives you the top-results from other search engines.

Brand New Sites – Directory of just born sites (less than 6 months old) classified in 284 categories.

Direct Hit – Search engine which ranks its search results based on user popularity. Often provides good results for popular queries.

Entireweb.com – Search engine claiming over 80M documents.

Excite – Matching content from the Overture website is displayed first. After that come the search results from the Dogpile and directory results from ODP.

Fast Search – Search with a clear interface through a database of over 300 million web pages. Also offers FTP and MP3 search.

First-Search.com – Returns only the homepage of sites. Targeted at users who are searching for good sites, rather than particular pages.

Google! – Lists the results in the order of popularity, determined by the number of links from other sites. Frequently gives you right results first. All pages in the Google index are cached, and you can search for pages related to a specific page.

HotBot – An advanced search engine. There are many configurable options, both in simple as in advanced search mode.

ILor search – Allows users to create annotated comments on top of search results

Lycos – Displays matches from sites part of the Lycos Network and very popular sites first. Then follow Open Directory results, sometimes followed by results from the Lycos crawler. On the bottom there are links to relevant news articles and products to buy.

Northern Light – A search engine for professional web users. They have a general search engine, and a “Special Collection” of 4M journals/books/mags which are accessable on a pay-per-view basis.

PageSeeker – Search engine with an interactive interface. [Requires Flash]

Raging Search – No-nonsense search engine from Altavista. It even returns the same search results as Altavista. There are no banners or any other content that would distract you from your mission.

7Search.com – Search results include web site information, such as email addresses, location, age and site popularity. (When available). You can choose to be notified when there are sites matching your criteria added to their database.

SearchHippo – A crisp and clean spider based web search with free PHP, XSLT and XML code for integration.

SearchKing – Search engine using searchers input to determine relevancy and placement and has instant indexing.

Teoma Search – Searches deliver pages grouped by subject and as a listing, seaches can be modified to search for an exact phrase and to include and exclude specific terms.

TrueSearch – Search engine actively removes dead links.

WebCrawler – Search engine and web directory. Displays matching categories first. After that come the results from the WebCraweler spider, without descriptions.

WISEnut – Up-to-date index on almost 1.5 billion pages, including site categorization and international search support.

Yep – A portal and search engine that ranks sites by popularity.

Zerx – You can view sites related to another site, or refine your existing search using that site.

Google Search Engine

Google Search, a web search engine, is the company’s most popular service. According to market research published by comScore in November 2009, Google is the dominant search engine in the United States market, with a market share of 65.6%.Google indexes billions of web pages, so that users can search for the information they desire, through the use of keywords and operators. Despite its popularity, it has received criticism from a number of organizations. In 2003, The New York Times complained about Google’s indexing, claiming that Google’s caching of content on their site infringed on their copyright for the content. In this case, the United States District Court of Nevada ruled in favor of Google in Field v. Google and Parker v. Google. Furthermore, the publication 2600: The Hacker Quarterly has compiled a list of words that the web giant’s new instant search feature will not search. Google Watch has also criticized Google’s PageRank algorithms, saying that they discriminate against new websites and favor established sites, and has made allegations about connections between Google and the NSA and the CIA. Despite criticism, the basic search engine has spread to specific services as well, including an image search engine, the Google News search site, Google Maps, and more. In early 2006, the company launched Google Video, which allowed users to upload, search, and watch videos from the Internet. In 2009, however, uploads to Google Video were discontinued so that Google could focus more on the search aspect of the service. The company even developed Google Desktop, a desktop search application used to search for files local to one’s computer. Google’s most recent development in search is their partnership with the United States Patent and Trademark Office to create Google Patents, which enables free access to information about patents and trademarks.

One of the more controversial search services Google hosts is Google Books. The company began scanning books and uploading limited previews, and full books where allowed, into their new book search engine. The Authors Guild, a group that represents 8,000 U.S. authors, filed a class action suit in a Manhattan federal court against Google in 2005 over this new service. Google replied that it is in compliance with all existing and historical applications of copyright laws regarding books. Google eventually reached a revised settlement in 2009 to limit its scans to books from the U.S., the U.K., Australia and Canada. Furthermore, the Paris Civil Court ruled against Google in late 2009, asking them to remove the works of La Martinière (Éditions du Seuil) from their database. In competition with Amazon.com, Google plans to sell digital versions of new books.Similarly, in response to newcomer Bing, on July 21, 2010, Google updated their image search to display a streaming sequence of thumbnails that enlarge when pointed at. Though web searches still appear in a batch per page format, on July 23, 2010, dictionary definitions for certain English words began appearing above the linked results for web searches.

Productivity tools

In addition to its standard web search services, Google has released over the years a number of online productivity tools. Gmail, a free webmail service provided by Google, was launched as an invitation-only beta program on April 1, 2004, and became available to the general public on February 7, 2007. The service was upgraded from beta status on July 7, 2009, at which time it had 146 million users monthly.The service would be the first online email service with one gigabyte of storage, and the first to keep emails from the same conversation together in one thread, similar to an Internet forum. The service currently offers over 7400 MB of free storage with additional storage ranging from 20 GB to 16 TB available for US$0.25 per 1 GB per year. Furthermore, software developers know Gmail for its pioneering use of AJAX, a programming technique that allows web pages to be interactive without refreshing the browser. One criticism of Gmail has been the potential for data disclosure, a risk associated with many online web applications. Steve Ballmer (Microsoft’s CEO),Liz Figueroa,Mark Rasch, and the editors of Google Watch believe the processing of email message content goes beyond proper use, but Google claims that mail sent to or from Gmail is never read by a human being beyond the account holder, and is only used to improve relevance of advertisements.

Google Docs, another part of Google’s productivity suite, allows users to create, edit, and collaborate on documents in an online environment, not dissimilar to Microsoft Word. The service was originally called Writely, but was obtained by Google on March 9, 2006, where it was released as an invitation-only preview.On June 6 after the acquisition, Google created an experimental spreadsheet editing program, which would be combined with Google Docs on October 10. A program to edit presentations would complete the set on September 17, 2007, before all three services were taken out of beta along with Gmail, Google Calendar and all products from the Google Apps Suite on July 7, 2009.

Enterprise products

Google entered the enterprise market in February 2002 with the launch of its Google Search Appliance, targeted toward providing search technology for larger organizations. Google launched the Mini three years later, which was targeted at smaller organizations. Late in 2006, Google began to sell Custom Search Business Edition, providing customers with an advertising-free window into Google.com’s index. The service was renamed Google Site Search in 2008.

Another one of Google’s enterprise products is Google Apps Premier Edition. The service, and its accompanying Google Apps Education Edition and Standard Edition, allow companies, schools, and other organizations to bring Google’s online applications, such as Gmail and Google Documents, into their own domain. The Premier Edition specifically includes extras over the Standard Edition such as more disk space, API access, and premium support, and it costs $50 per user per year. A large implementation of Google Apps with 38,000 users is at Lakehead University in Thunder Bay, Ontario, Canada. In the same year Google Apps was launched, Google acquired Postini and proceeded to integrate the company’s security technologies into Google Apps under the name Google Postini Services.

Company Perspectives:

Google’s founders have often stated that the company is not serious about anything but search. They built a company around the idea that work should be challenging and the challenge should be fun. To that end, Google’s culture is unlike any in corporate America, and it’s not because of the ubiquitous lava lamps and large rubber balls, or the fact that the company’s chef used to cook for the Grateful Dead. In the same way Google puts users first when it comes to our online service, Google Inc. puts employees first when it comes to daily life in our Googleplex headquarters. There is an emphasis on team achievements and pride in individual accomplishments that contribute to the company’s overall success. Ideas are traded, tested and put into practice with an alacrity that can be dizzying. Meetings that would take hours elsewhere are frequently little more than a conversation in line for lunch and few walls separate those who write the code from those who write the checks. This highly communicative environment fosters a productivity and camaraderie fueled by the realization that millions of people rely on Google results. Give the proper tools to a group of people who like to make a difference, and they will.

Key Dates:

1995: Google founders Sergey Brin and Larry Page meet at Stanford University.

1997: BackRub, the precursor to the Google search engine, is founded.

1998: Google is incorporated and moves into its first office in a Menlo Park, California, garage.

1999: Google moves its headquarters to Palo Alto, California, and later to Mountain View, California; Red Hat becomes Google’s first commercial customer.

2000: Yahoo! Internet Life magazine names Google the Best Search Engine on the Internet; Google becomes the largest search engine on the Web and launches the Google Toolbar.

2001: Google acquires Deja.com’s Usenet archive and launches Google PhoneBook; Dr. Eric Schmidt joins Google as chairman of the board of directors and is later appointed CEO.

2002: Google launches the Google Search Appliance, AdWords Select, the 2001 Search Engine Awards, and Google Compute.

Conclusion

Online research has become an essential skill for writers. What typically took place in libraries, by phone calls or visits to experts in the field is being changed because of the Internet. Experts can sometimes be contacted by email and information, whether it is addresses, phone numbers, or detailed specifics on a certain subject, can be accessed on the World Wide Web. Search Engines have become the most important tools in locating this information, so it is important to know how to use them effectively. Search skills can be developed through practice in using the search engines and by reading the help pages provided by the search engines themselves. Over time, you will learn which search engine is good for pulling up what kind of information. This article will provide a general overview of the various search engines and some of their advanced search features which will help you with your online research.

Order Now