Search engines for certain key keywords. Search engine web server

They have long become an integral part of the Russian Internet. Search engines are now huge and complex mechanisms that represent not only an information search tool, but also tempting areas for business.

Most search engine users have never thought (or thought about it, but did not find an answer) about the principle of operation of search engines, the scheme for processing user requests, what these systems consist of and how they function...

This master class is designed to answer the question of how search engines work. However, you will not find here factors that influence the ranking of documents. Moreover, you should not count on a detailed explanation of the Yandex algorithm. He, according to Ilya Segalovich, the director of technology and development of the Yandex search engine, can only be recognized “under torture” by Ilya Segalovich himself...

2. Concept and functions of a search engine

A search system is a software and hardware complex designed to search the Internet and respond to a user request, specified in the form of a text phrase (search query), by producing a list of links to sources of information, in order of relevance (in accordance with the request). The largest international search engines: "Google", Yahoo , MSN . On the Russian Internet these are Yandex, Rambler, Aport.

Let's take a closer look at the concept of a search query using the Yandex search engine as an example. The search query should be formulated by the user in accordance with what he wants to find, as briefly and simply as possible. Let's say we want to find information in Yandex on how to choose a car. To do this, open the Yandex main page and enter the text of the search query “how to choose a car.” Next, our task comes down to opening the links provided at our request to sources of information on the Internet. However, it is quite possible that we will not find the information we need. If this happens, then either you need to rephrase your request, or the search engine database really does not have any relevant information on our request (this can happen when asking very “narrow” queries, such as, for example, “how to choose a car in Arkhangelsk”)

The primary goal of any search engine is to deliver to people exactly the information they are looking for. And teach users to make “correct” requests to the system, i.e. queries that comply with the operating principles of search engines are impossible. Therefore, developers create algorithms and operating principles for search engines that would allow users to find the information they are looking for.

This means the search engine must “think” the same way the user thinks when searching for information. When a user makes a request to a search engine, he wants to find what he needs as quickly and easily as possible. Receiving the result, he evaluates the performance of the system, guided by several basic parameters. Did he find what he was looking for? If he didn’t find it, how many times did he have to rephrase the query to find what he was looking for? How much relevant information could he find? How quickly did the search engine process the request? How convenient were the search results presented? Was the result you were looking for the first or the hundredth? How much unnecessary garbage was found along with useful information? Will the necessary information be found when accessing a search engine, say, in a week, or in a month?

In order to satisfy all these questions with answers, search engine developers are constantly improving search algorithms and principles, adding new functions and capabilities, and trying in every possible way to speed up the operation of the system.

3. Main characteristics of the search engine

Let us describe the main characteristics of search engines:

  • Completeness

    Completeness is one of the main characteristics of a search system, which is the ratio of the number of documents found by request to the total number of documents on the Internet that satisfy the given request. For example, if there are 100 pages on the Internet containing the phrase “how to choose a car,” and only 60 of them were found for the corresponding query, then the completeness of the search will be 0.6. Obviously, the more complete the search, the less likely it is that the user will not find the document he needs, provided that it exists on the Internet at all.

  • Accuracy

    Accuracy is another main characteristic of a search engine, which is determined by the degree to which the found documents match the user's request. For example, if the query “how to choose a car” contains 100 documents, 50 of them contain the phrase “how to choose a car”, and the rest simply contain these words (“how to choose the right radio and install it in a car”), then the search accuracy is considered equal to 50/100 (=0.5). The more accurate the search, the faster the user will find the documents he needs, the less various kinds of “garbage” will be found among them, the less often the found documents will not correspond to the request.

  • Relevance

    Relevance is an equally important component of search, which is characterized by the time that passes from the moment documents are published on the Internet until they are entered into the search engine index database. For example, the day after interesting news appeared, a large number of users turned to search engines with relevant queries. Objectively, less than a day has passed since the publication of news information on this topic, but the main documents have already been indexed and available for search, thanks to the existence of the so-called “fast database” of large search engines, which is updated several times a day.

  • Search speed

    Search speed is closely related to its load resistance. For example, according to Rambler Internet Holding LLC, today during business hours the Rambler search engine receives about 60 requests per second. Such workload requires reducing the processing time of an individual request. Here the interests of the user and the search engine coincide: the visitor wants to get results as quickly as possible, and the search engine must process the request as quickly as possible, so as not to slow down the calculation of subsequent queries.

  • Visibility

4. Brief history of the development of search engines

In the initial period of Internet development, the number of its users was small, and the amount of available information was relatively small. For the most part, only research staff had access to the Internet. At this time, the task of searching for information on the Internet was not as urgent as it is now.

One of the first ways to organize access to network information resources was the creation of open directories of sites, links to resources in which were grouped according to topic. The first such project was the Yahoo.com website, which opened in the spring of 1994. After the number of sites in the catalog increased significantly, the ability to search for the necessary information in the catalog was added. In the full sense, it was not yet a search engine, since the search area was limited only to the resources present in the catalog, and not to all Internet resources.

Link directories were widely used in the past, but have almost completely lost their popularity nowadays. Since even modern catalogs, huge in volume, contain information only about a negligible part of the Internet. The largest directory of the DMOZ network (also called the Open Directory Project) contains information about 5 million resources, while the Google search engine database consists of more than 8 billion documents.

In 1995, search engines Lycos and AltaVista appeared. The latter has been a leader in the field of information search on the Internet for many years.

In 1997, Sergey Brin and Larry Page created the Google search engine as part of a research project at Stanford University. Google is currently the most popular search engine in the world!

In September 1997, the Yandex search engine, which is the most popular on the Russian-language Internet, was officially announced.

Currently, there are three main search engines (international) - Google, Yahoo and, which have their own databases and search algorithms. Most other search engines (of which there are a large number) use in one form or another the results of the three listed. For example, AOL search (search.aol.com) uses the Google database, while AltaVista, Lycos and AllTheWeb use the Yahoo database.

5. Composition and principles of operation of the search system

In Russia, the main search engine is Yandex, followed by Rambler.ru, Google.ru, Aport.ru, Mail.ru. Moreover, at the moment, Mail.ru uses the Yandex search engine and database.

Almost all major search engines have their own structure, different from others. However, it is possible to identify the main components common to all search engines. Differences in structure can only be in the form of implementation of the mechanisms of interaction of these components.

Indexing module

The indexing module consists of three auxiliary programs (robots):

Spider is a program designed to download web pages. The spider downloads the page and retrieves all internal links from that page. The html code of each page is downloaded. Robots use HTTP protocols to download pages. The spider works as follows. The robot sends the request “get/path/document” and some other HTTP request commands to the server. In response, the robot receives a text stream containing service information and the document itself.

  • Page URL
  • date the page was downloaded
  • Server response http header
  • page body (html code)

Crawler (“traveling” spider) is a program that automatically follows all the links found on the page. Selects all links present on the page. Its job is to determine where the spider should go next, based on links or based on a predetermined list of addresses. Crawler, following the links found, searches for new documents that are still unknown to the search engine.

Indexer (robot indexer) is a program that analyzes web pages downloaded by spiders. The indexer parses the page into its component parts and analyzes them using its own lexical and morphological algorithms. Various page elements are analyzed, such as text, headings, links, structural and style features, special service HTML tags, etc.

Thus, the indexing module allows you to crawl a given set of resources using links, download encountered pages, extract links to new pages from received documents, and perform a complete analysis of these documents.

Database

A database, or search engine index, is a data storage system, an information array in which specially converted parameters of all documents downloaded and processed by the indexing module are stored.

Search server

The search server is the most important element of the entire system, since the quality and speed of the search directly depend on the algorithms that underlie its functioning.

The search server works as follows:

  • The request received from the user is subjected to morphological analysis. The information environment of each document contained in the database is generated (which will subsequently be displayed in the form, that is, text information corresponding to the request on the search results page).
  • The received data is passed as input parameters to a special ranking module. Data is processed for all documents, as a result of which each document has its own rating that characterizes the relevance of the query entered by the user and the various components of this document stored in the search engine index.
  • Depending on the user’s choice, this rating can be adjusted by additional conditions (for example, the so-called “advanced search”).
  • Next, a snippet is generated, that is, for each document found, the title, a short abstract that best matches the query, and a link to the document itself are extracted from the document table, and the words found are highlighted.
  • The resulting search results are transmitted to the user in the form of a SERP (Search Engine Result Page) – a search results page.

As you can see, all these components are closely related to each other and work in interaction, forming a clear, rather complex mechanism for the operation of a search system, requiring huge amounts of resources.

6. Conclusion

Now let's summarize all of the above.

  • The primary goal of any search engine is to deliver to people exactly the information they are looking for.
  • Main characteristics of search engines:
    1. Completeness
    2. Accuracy
    3. Relevance
    4. Search speed
    5. Visibility
  • The first full-fledged search engine was the WebCrawler project, published in 1994.
  • The search system includes the following components:
    1. Indexing module
    2. Database
    3. Search server

We hope that our master class will allow you to become more familiar with the concept of a search engine and better understand the main functions, characteristics and operating principles of search engines.

Hello, dear readers! Ekaterina Kalmykova is with you. Today’s article will be devoted to such a concept as a search engine, what it is, what it is needed for. We will also consider in detail the types of search engines on the Internet.

If you have a question: “Why do I need to know about these search engines?”, then I will answer this way. When you eat a delicious soup in a restaurant, would you like to know what ingredients it is made from so you can recreate it yourself at home? After all, if you are satisfied with the end result, that is, the taste of the soup, then you would probably be interested in knowing what led to this result?

The same can be said about working with a search engine (SE). If you create your own blog in the future, then knowing how the PS works, you will not have to turn to specialists for help. You will be able to independently manage your project in such a way that the search engine can see it and show it to other users. After all, the traffic to your resource and, accordingly, your earnings will depend on this.

So let's get started.

What is a search engine?

A search engine is a special resource on the Internet that provides information to the user in accordance with his request. That is, this resource collects all the data on the global network, all web projects, and when a specific request is received from a user, it provides the necessary information sought by directing it, for example, to a thematic blog or website.

Thus, after creating your project, your task will be to get into the search results, that is, into the “list” or database of the search engine. Since website promotion on the Internet is simply not possible without using some kind of search engine, you will need to take care of the quality of your resource, its internal and external optimization. We will discuss how to do this in the following articles. So don't miss out.

In the meantime, if you decide to create your own blog, I recommend reading these articles:

Since new web resources appear almost every day, the search engine database must therefore be constantly updated. Each newly created site must be indexed by a robot. In simple words, search engine assistants - robots - must get acquainted with the new resource and transfer this data to the search engine itself.

Well, here you probably guessed that when a robot visits your blog, he should like everything. Your future fate will depend on this guest.

I will tell you how to make the robot completely delighted with your project in one of the following articles. Don't miss it, there will be interesting and very interesting information that I will share with you.

Search engines work

All work related to the PS begins with entering the desired query in the search bar. What can users search for? Yes, anything, from a recipe for pies with cabbage to the eternal question “how to make more money without doing anything.”

In order for your resource to be the answer to the question, you need to be ahead of your competitors. To do this, you need to pay special attention to promoting your project, which includes such activities as writing high-quality optimized content, that is, responding to the requests of the article, improving the behavioral factor, that is, so that your reader is interested in being on the resource, this is improving usability, that is visitor convenience and many other factors. We will all learn to do this with you.

Search Engine Components

And what helps search engines, for example, Google, index your resource?

  1. Agents are workers who do the bulk of the work - indexing and analyzing sites.
  2. Spiders are a program that can download pages of a web resource and collect general information about it.
  3. Crawlers (crawler) - a program that searches for all links on pages, following which it looks for new data that is not familiar to search engines.
  4. Indexer – analyzes text, headings, style, etc.
  5. Robots - index your content pages and also study various links.

In order for indexing to happen the way you need, you create a special document “robots.txt”. It allows the system to check only those pages that you need and remove what you shouldn't see.

Types of search engines

There are several options for information search systems:

  • Catalogs. A simple search comparison is a bookshelf in a library. Everything is stored there in subcategories and categories of specific topics. If you find yourself in such a search engine, then believe me, the information you find there will be more than useful and understandable for your perception. Can you guess what common site we are talking about? Of course, about Wikipedia, which has collected a whole directory of useful information.
  • Search indexes. Data search is carried out using key phrases. This is both convenient and inconvenient at the same time. I think I will be understood by those people who search, for example, “A girl shows her class,” to find how a girl shows her thumb up, but in the search something not very decent comes up. 🙂 This type of search characterizes most search engines.
  • Rating systems. They determine your popularity based on the number of visits. Of course, this is not the best criterion, since the usefulness and quality of the resource itself is not always taken into account. An example of such a system is the Internet resource alexa.com.

Search servers are also divided into general and specialized. General search engines sort information data without any selection across all web resources known to them. These include Yandex, Rambler, Google. Specialized - sort by language used.

Search engines can also be divided into regional and global distribution.

Today, all search engines are constantly improving their algorithms for selecting high-quality, relevant resources.

A little history

PS appeared on the RuNet in 1996 - these are Aport and Rambler. A year later, in 1997, Yandex was formed, and a year later, in 1998, another competitor appeared - Google. Currently the most popular are Yandex and Google.

What search engines are the most popular now?

Here are the statistics:

As you can see, Yandex is now the most popular in Russia, along with Google and Mail.

This way, you can see the top searches that you should focus on when creating and promoting your project.

Search engine Yandex

The principle of operation is as follows: enter the desired query into the search bar, click “Find” and look at the results. Yandex has selected 13 million responses to your request. You can also search in pictures, videos, and the market (see the left column).

Additionally, you can configure the search region. To do this, click on the icon next to the cross in the search bar and select the desired region in the filter window.

Google search engine

Google works similarly to Yandex. You can search for information in different sections: pictures, videos, news, maps, etc.

If you click on “Search Tools”, a panel with settings will open, where you can select the region, language and for what time to search for information.

Now you know what search engines exist on the Internet, you have also seen the most popular of them, and now, armed with information, you can establish your connections and interaction with search engines.

That's all for today. How do you like the article?

Bye everyone.

I advise you to update your blog so as not to miss the latest news.

Ekaterina Kalmykova

A search system is a software and hardware complex designed to search the Internet and respond to a user request, specified in the form of a text phrase (search query), by producing a list of links to sources of information, in order of relevance (in accordance with the request). The largest international search engines: "Google", "Yahoo", "MSN". On the Russian Internet it is - "Yandex", "Rambler", "Aport".

Let us describe the main characteristics of search engines:

    Completeness

Completeness is one of the main characteristics of a search system, which is the ratio of the number of documents found by request to the total number of documents on the Internet that satisfy the given request. For example, if there are 100 pages on the Internet containing the phrase “how to choose a car,” and only 60 of them were found for the corresponding query, then the completeness of the search will be 0.6. Obviously, the more complete the search, the less likely it is that the user will not find the document he needs, provided that it exists on the Internet at all.

    Accuracy

Accuracy is another main characteristic of a search engine, which is determined by the degree to which the found documents match the user's request. For example, if the query “how to choose a car” contains 100 documents, 50 of them contain the phrase “how to choose a car”, and the rest simply contain these words (“how to choose the right radio and install it in a car”), then the search accuracy is considered equal to 50/100 (=0.5). The more accurate the search, the faster the user will find the documents he needs, the less various kinds of “garbage” will be found among them, the less often the found documents will not correspond to the request.

    Relevance

Relevance is an equally important component of search, which is characterized by the time that passes from the moment documents are published on the Internet until they are entered into the search engine index database. For example, the day after interesting news appeared, a large number of users turned to search engines with relevant queries. Objectively, less than a day has passed since the publication of news information on this topic, but the main documents have already been indexed and available for search, thanks to the existence of the so-called “fast database” of large search engines, which is updated several times a day.

    Search speed

Search speed is closely related to its load resistance. For example, according to Rambler Internet Holding LLC, today during business hours the Rambler search engine receives about 60 requests per second. Such workload requires reducing the processing time of an individual request. Here the interests of the user and the search engine coincide: the visitor wants to get results as quickly as possible, and the search engine must process the request as quickly as possible, so as not to slow down the calculation of subsequent queries.

    Visibility

Visual presentation of results is an important component of convenient search. For most queries, the search engine finds hundreds, or even thousands, of documents. Due to unclear queries or inaccurate searches, even the first pages of search results do not always contain only the necessary information. This means that the user often has to perform his own search within the found list. Various elements of the search engine results page help you navigate the search results. Detailed explanations of the search results page, for example for Yandex, can be found at the link http://help.yandex.ru/search/?id=481937.

4. Brief history of the development of search engines

In the initial period of Internet development, the number of its users was small, and the amount of available information was relatively small. For the most part, only research staff had access to the Internet. At this time, the task of searching for information on the Internet was not as urgent as it is now.

One of the first ways to organize access to network information resources was the creation of open directories of sites, links to resources in which were grouped according to topic. The first such project was the Yahoo.com website, which opened in the spring of 1994. After the number of sites in the Yahoo directory increased significantly, the ability to search for the necessary information in the directory was added. In the full sense, it was not yet a search engine, since the search area was limited only to the resources present in the catalog, and not to all Internet resources.

Link directories were widely used in the past, but have almost completely lost their popularity at present. Since even modern catalogs, huge in volume, contain information only about a negligible part of the Internet. The largest directory of the DMOZ network (also called the Open Directory Project) contains information about 5 million resources, while the Google search engine database consists of more than 8 billion documents.

The first full-fledged search engine was the WebCrawler project, published in 1994.

In 1995, search engines Lycos and AltaVista appeared. The latter has been a leader in the field of information search on the Internet for many years.

In 1997, Sergey Brin and Larry Page created the Google search engine as part of a research project at Stanford University. Google is currently the most popular search engine in the world!

In September 1997, the Yandex search engine, which is the most popular on the Russian-language Internet, was officially announced.

Currently, there are three main international search engines - Google, Yahoo and MSN, which have their own databases and search algorithms. Most other search engines (of which there are a large number) use in one form or another the results of the three listed. For example, AOL search (search.aol.com) uses the Google database, while AltaVista, Lycos and AllTheWeb use the Yahoo database.

5. Composition and principles of operation of the search system

In Russia, the main search engine is Yandex, followed by Rambler.ru, Google.ru, Aport.ru, Mail.ru. Moreover, at the moment, Mail.ru uses the Yandex search engine and database.

Almost all major search engines have their own structure, different from others. However, it is possible to identify the main components common to all search engines. Differences in structure can only be in the form of implementation of the mechanisms of interaction of these components.

Indexing module

The indexing module consists of three auxiliary programs (robots):

Spider is a program designed to download web pages. The spider downloads the page and retrieves all internal links from that page. The html code of each page is downloaded. Robots use HTTP protocols to download pages. The spider works as follows. The robot sends the request “get/path/document” and some other HTTP request commands to the server. In response, the robot receives a text stream containing service information and the document itself.

    Page URL

    date the page was downloaded

    Server response http header

    page body (html code)

Crawler (“traveling” spider) is a program that automatically follows all the links found on the page. Selects all links present on the page. Its job is to determine where the spider should go next, based on links or based on a predetermined list of addresses. Crawler, following the links found, searches for new documents that are still unknown to the search engine.

Indexer (robot indexer) is a program that analyzes web pages downloaded by spiders. The indexer parses the page into its component parts and analyzes them using its own lexical and morphological algorithms. Various page elements are analyzed, such as text, headings, links, structural and style features, special service HTML tags, etc.

Thus, the indexing module allows you to crawl a given set of resources using links, download encountered pages, extract links to new pages from received documents, and perform a complete analysis of these documents.

Database

A database, or search engine index, is a data storage system, an information array in which specially converted parameters of all documents downloaded and processed by the indexing module are stored.

Search server

The search server is the most important element of the entire system, since the quality and speed of the search directly depend on the algorithms that underlie its functioning.

The search server works as follows:

    The request received from the user is subjected to morphological analysis. The information environment of each document contained in the database is generated (which will subsequently be displayed in the form of a snippet, that is, text information corresponding to the request on the search results page).

    The received data is passed as input parameters to a special ranking module. Data is processed for all documents, as a result of which each document has its own rating that characterizes the relevance of the query entered by the user and the various components of this document stored in the search engine index.

    Depending on the user’s choice, this rating can be adjusted by additional conditions (for example, the so-called “advanced search”).

    Next, a snippet is generated, that is, for each document found, the title, a short abstract that best matches the query, and a link to the document itself are extracted from the document table, and the words found are highlighted.

    The resulting search results are transmitted to the user in the form of a SERP (Search Engine Result Page) – a search results page.

As you can see, all these components are closely related to each other and work in interaction, forming a clear, rather complex mechanism for the operation of the search system, which requires huge amounts of resources.

No search engine covers all Internet resources.

Each search engine collects information about Internet resources using its own unique methods and forms its own periodically updated database. Access to this database is granted to the user.

Search engines implement two ways to search for a resource:

    Search by topic catalogs - information is presented in the form of a hierarchical structure. At the top level there are general categories (“Internet”, “Business”, “Art”, “Education”, etc.), at the next level the categories are divided into sections, etc. The lowest level is links to specific web pages or other information resources.

    Keyword search (index search or detailed search) - the user sends to the search engine request, consisting of keywords. System returns to the user a list of resources found upon request.

Most search engines combine both search methods.

Search engines can be local, global, regional and specialized.

In the Russian part of the Internet (Runet), the most popular general purpose search engines are Rambler (www.rambler.ru), Yandex (www.yandex.ru), Aport (www.aport.ru), Google (www.google.ru).

Most search enginesimplemented in the form of portals.

Portal (from English.portal- main entrance, gate) is a website that integrates various Internet services: search tools, mail, news, dictionaries, etc.

Portals can be specialized (like,www. museum. ru) and general (for example,www. km. ru).

Search by keywords

The set of keywords used to search is also called the search criterion or search topic.

A request can consist of either one word or a combination of words combined by operators - symbols by which the system determines what action it needs to perform. For example: the request “Moscow St. Petersburg” contains the AND operator (this is how a space is perceived), which indicates that one should search for documents that contain both words - Moscow and St. Petersburg.

In order for the search to be relevant (from the English relevant - relevant, relevant), several general rules should be taken into account:

    Regardless of the form in which the word is used in the query, the search takes into account all its word forms according to the rules of the Russian language. For example, the query “ticket” will also find the words “ticket”, “ticket”, etc.

    Capital letters should only be used in proper names to avoid viewing unnecessary references. At the request of “blacksmiths,” for example, documents will be found that talk about both blacksmiths and Kuznetsovs.

    It is advisable to narrow your search using a few keywords.

    If the required address is not among the first twenty addresses found, you should change the request.

Each search engine uses its own query language. To get acquainted with it, use the built-in help of the search engine

Large sites may have built-in information retrieval systems within their web pages.

Queries in such search systems, as a rule, are built according to the same rules as in global search engines, however, familiarity with the help here will not be superfluous.

Advanced Search

Search engines can provide a mechanism for the user to create a complex query. Follow the link Advanced Search makes it possible to edit search parameters, specify additional parameters and select the most convenient form for displaying search results. The following describes the parameters that can be set during an advanced search in the Yanex and Rambler systems.

Parameter description

Name in Yandex

Name inRambler

Where to look for keywords (document title, body text, etc.)

Dictionary filter

Search by text...

What words should or should not be present in the document and how accurate the match should be

Dictionary filter

Search for query words... Exclude documents containing the following words...

How far apart should keywords be located?

Dictionary filter

Distance between query words...

Restriction on document date

Document date...

Limit your search to one or more sites

Site/Top

Search documents only on the following sites...

Limiting search by document language

Document language...

Search for documents containing a picture with a specific name or signature

Image

Finding pages containing objects

Special objects

Search results presentation form

Issue format

Displaying search results

Some search engines (for example, Yandex) allow you to enter queries in natural language. You write what you need to find (for example: ordering train tickets from Moscow to St. Petersburg). The system analyzes the request and produces the result. If you are not satisfied with it, switch to the query language.

At the first stage of the formation of the Internet, the number of its users was extremely small, and the volume of information posted on it was minimal. At that time, the Network was used as a specialized tool and mainly for scientific purposes, so only employees of various laboratories, universities, and military institutions had access to it. Much less attention was paid to searching for information then than in our time.

However, with the increase in volumes of information, the problem of quick search and convenient access to the information resource of interest to the user has arisen. The first solution to this problem was the emergence of website directories. Such directories were groups of links to resources, which were compiled according to the subject of the resources. The founder of such projects was Yahoo, a website that appeared in April 1994. With the increase in the number of sites in the catalog, Yahoo implemented the ability to search the catalog. However, the site was not a full-fledged search engine, since it only allowed searching for those resources that were included in the catalog.

Link directories were a good idea, but the feasibility of their use decreased in direct proportion to the growth in the number of sites on the Internet. Even the most modern directory, which contains several million resources, provides access to only a small part of the information stored on the Internet. For example, the largest catalog in the Open Directory Project network contains information about 5 million resources, while at the same time, over 8 billion documents are entered into the Google search engine database and their number is growing every minute.

Chronology of the emergence of search engines

  • In 1994, the first full-fledged search engine appeared - the WebCrawler project.
  • In 1995, two search engines were released at once - AltaVista and Lycos. The first of them remained the main information search engine on the Internet for several years.
  • In 1997, two talented programmers, Sergey Brin and Larry Page, created the Google project as part of a research project at Stanford University, which is today the most popular search engine in the world.
  • In 1997, on September 23, a project called Yandex was officially presented, which today is the most popular search engine in the Russian-language segment of the Internet (Runet)

Today, there are 3 main international search engines: Google, Yahoo and MSN Search, which operate using their own search algorithms and have their own databases. Other search engines use their technologies and capabilities to varying degrees. For example, the Google database is used by search engines such as Mail.ru and AOL (search.aol.com), and the Yahoo database is used by search engines AllTheWeb, Lycos and AltaVista. In Russia and the CIS countries, the main search engine is Yandex, followed by Rambler and Google, and the search engines Mail.ru, Aport and KM.ru are also widely used.

Basic components of search engines

All search engines work on the same principles, using similar approaches to finding information. In general, a search engine consists of the following components:

  • Web server - web server responsible for user interaction with search engine components
  • Spider (English spider) - a browser program that “searches” Internet resources and downloads all web pages
  • Crawler (English spider traveler) - a specialized version of spider, the program automatically follows all links found on the resource pages
  • Indexer (English indexer) - a program that analyzes information provided by spiders
  • Database - search engine database in which downloaded and analyzed pages are stored
  • Search engine results engine (English results delivery system) - produces search results from the database

The specific implementation of the above components may be different in each search engine (for example, spider and crawler are one program), but these common features are common to all search engines.

How search engine components work

Spider. The spider program downloads web pages in the same way as a regular user browser. The only difference between them is that the browser displays all information on the screen (graphic, text, audio, etc.), while spider works directly with the html code of the page.

Crawler.A spider responsible for searching for new documents that are not yet in the search engine database. The crawler's task is also to determine the path along which the spider should move. To do this, it selects all the links on the page and follows them.

Indexer. The job of the indexer is to analyze the new pages found. He breaks them down into separate parts and studies them. For example, the indexer highlights page elements such as headings, text, service HTML tags, style and structural features, etc.

Database. All data found, downloaded and analyzed by the search engine from the Internet is entered into the search engine database.

Search Engine results engine. The main element of the search engine, which is of primary interest to site owners and users, is the search results system. She is responsible for ranking pages (sites), that is, she decides which page will be in first place and which in last. Pages are sorted according to a specific ranking algorithm, which each search engine has its own, and is the most important secret. It is the results delivery system that is the object of study for SEOs, since they have to work with it in order to improve the site’s position in search results.

Web server. Typically, a web server is an HTML page with a form for entering a search query. The web server also provides the user with search results in the form of an HTML page. For each search engine, these pages are designed in a specific corporate style.

A search system is a software and hardware complex designed to search the Internet and respond to a user request, specified in the form of a text phrase (search query), by producing a list of links to sources of information, in order of relevance (in accordance with the request). The largest international search engines: "Google", "Yahoo", "MSN". On the Russian Internet it is - "Yandex", "Rambler", "Aport".

Let us describe the main characteristics of search engines:

    Completeness

Completeness is one of the main characteristics of a search system, which is the ratio of the number of documents found by request to the total number of documents on the Internet that satisfy the given request. For example, if there are 100 pages on the Internet containing the phrase “how to choose a car,” and only 60 of them were found for the corresponding query, then the completeness of the search will be 0.6. Obviously, the more complete the search, the less likely it is that the user will not find the document he needs, provided that it exists on the Internet at all.

    Accuracy

Accuracy is another main characteristic of a search engine, which is determined by the degree to which the found documents match the user's request. For example, if the query “how to choose a car” contains 100 documents, 50 of them contain the phrase “how to choose a car”, and the rest simply contain these words (“how to choose the right radio and install it in a car”), then the search accuracy is considered equal to 50/100 (=0.5). The more accurate the search, the faster the user will find the documents he needs, the less various kinds of “garbage” will be found among them, the less often the found documents will not correspond to the request.

    Relevance

Relevance is an equally important component of search, which is characterized by the time that passes from the moment documents are published on the Internet until they are entered into the search engine index database. For example, the day after interesting news appeared, a large number of users turned to search engines with relevant queries. Objectively, less than a day has passed since the publication of news information on this topic, but the main documents have already been indexed and available for search, thanks to the existence of the so-called “fast database” of large search engines, which is updated several times a day.

    Search speed

Search speed is closely related to its load resistance. For example, according to Rambler Internet Holding LLC, today during business hours the Rambler search engine receives about 60 requests per second. Such workload requires reducing the processing time of an individual request. Here the interests of the user and the search engine coincide: the visitor wants to get results as quickly as possible, and the search engine must process the request as quickly as possible, so as not to slow down the calculation of subsequent queries.

    Visibility

Visual presentation of results is an important component of convenient search. For most queries, the search engine finds hundreds, or even thousands, of documents. Due to unclear queries or inaccurate searches, even the first pages of search results do not always contain only the necessary information. This means that the user often has to perform his own search within the found list. Various elements of the search engine results page help you navigate the search results. Detailed explanations of the search results page, for example for Yandex, can be found at the link http://help.yandex.ru/search/?id=481937.

4. Brief history of the development of search engines

In the initial period of Internet development, the number of its users was small, and the amount of available information was relatively small. For the most part, only research staff had access to the Internet. At this time, the task of searching for information on the Internet was not as urgent as it is now.

One of the first ways to organize access to network information resources was the creation of open directories of sites, links to resources in which were grouped according to topic. The first such project was the Yahoo.com website, which opened in the spring of 1994. After the number of sites in the Yahoo directory increased significantly, the ability to search for the necessary information in the directory was added. In the full sense, it was not yet a search engine, since the search area was limited only to the resources present in the catalog, and not to all Internet resources.

Link directories were widely used in the past, but have almost completely lost their popularity at present. Since even modern catalogs, huge in volume, contain information only about a negligible part of the Internet. The largest directory of the DMOZ network (also called the Open Directory Project) contains information about 5 million resources, while the Google search engine database consists of more than 8 billion documents.

The first full-fledged search engine was the WebCrawler project, published in 1994.

In 1995, search engines Lycos and AltaVista appeared. The latter has been a leader in the field of information search on the Internet for many years.

In 1997, Sergey Brin and Larry Page created the Google search engine as part of a research project at Stanford University. Google is currently the most popular search engine in the world!

In September 1997, the Yandex search engine, which is the most popular on the Russian-language Internet, was officially announced.

Currently, there are three main international search engines - Google, Yahoo and MSN, which have their own databases and search algorithms. Most other search engines (of which there are a large number) use in one form or another the results of the three listed. For example, AOL search (search.aol.com) uses the Google database, while AltaVista, Lycos and AllTheWeb use the Yahoo database.

5. Composition and principles of operation of the search system

In Russia, the main search engine is Yandex, followed by Rambler.ru, Google.ru, Aport.ru, Mail.ru. Moreover, at the moment, Mail.ru uses the Yandex search engine and database.

Almost all major search engines have their own structure, different from others. However, it is possible to identify the main components common to all search engines. Differences in structure can only be in the form of implementation of the mechanisms of interaction of these components.

Indexing module

The indexing module consists of three auxiliary programs (robots):

Spider is a program designed to download web pages. The spider downloads the page and retrieves all internal links from that page. The html code of each page is downloaded. Robots use HTTP protocols to download pages. The spider works as follows. The robot sends the request “get/path/document” and some other HTTP request commands to the server. In response, the robot receives a text stream containing service information and the document itself.

    Page URL

    date the page was downloaded

    Server response http header

    page body (html code)

Crawler (“traveling” spider) is a program that automatically follows all the links found on the page. Selects all links present on the page. Its job is to determine where the spider should go next, based on links or based on a predetermined list of addresses. Crawler, following the links found, searches for new documents that are still unknown to the search engine.

Indexer (robot indexer) is a program that analyzes web pages downloaded by spiders. The indexer parses the page into its component parts and analyzes them using its own lexical and morphological algorithms. Various page elements are analyzed, such as text, headings, links, structural and style features, special service HTML tags, etc.

Thus, the indexing module allows you to crawl a given set of resources using links, download encountered pages, extract links to new pages from received documents, and perform a complete analysis of these documents.

Database

A database, or search engine index, is a data storage system, an information array in which specially converted parameters of all documents downloaded and processed by the indexing module are stored.

Search server

The search server is the most important element of the entire system, since the quality and speed of the search directly depend on the algorithms that underlie its functioning.

The search server works as follows:

    The request received from the user is subjected to morphological analysis. The information environment of each document contained in the database is generated (which will subsequently be displayed in the form of a snippet, that is, text information corresponding to the request on the search results page).

    The received data is passed as input parameters to a special ranking module. Data is processed for all documents, as a result of which each document has its own rating that characterizes the relevance of the query entered by the user and the various components of this document stored in the search engine index.

    Depending on the user’s choice, this rating can be adjusted by additional conditions (for example, the so-called “advanced search”).

    Next, a snippet is generated, that is, for each document found, the title, a short abstract that best matches the query, and a link to the document itself are extracted from the document table, and the words found are highlighted.

    The resulting search results are transmitted to the user in the form of a SERP (Search Engine Result Page) – a search results page.

As you can see, all these components are closely related to each other and work in interaction, forming a clear, rather complex mechanism for the operation of the search system, which requires huge amounts of resources.

No search engine covers all Internet resources.

Each search engine collects information about Internet resources using its own unique methods and forms its own periodically updated database. Access to this database is granted to the user.

Search engines implement two ways to search for a resource:

    Search by topic catalogs - information is presented in the form of a hierarchical structure. At the top level there are general categories (“Internet”, “Business”, “Art”, “Education”, etc.), at the next level the categories are divided into sections, etc. The lowest level is links to specific web pages or other information resources.

    Keyword search (index search or detailed search) - the user sends to the search engine request, consisting of keywords. System returns to the user a list of resources found upon request.

Most search engines combine both search methods.

Search engines can be local, global, regional and specialized.

In the Russian part of the Internet (Runet), the most popular general purpose search engines are Rambler (www.rambler.ru), Yandex (www.yandex.ru), Aport (www.aport.ru), Google (www.google.ru).

Most search enginesimplemented in the form of portals.

Portal (from English.portal- main entrance, gate) is a website that integrates various Internet services: search tools, mail, news, dictionaries, etc.

Portals can be specialized (like,www. museum. ru) and general (for example,www. km. ru).

Search by keywords

The set of keywords used to search is also called the search criterion or search topic.

A request can consist of either one word or a combination of words combined by operators - symbols by which the system determines what action it needs to perform. For example: the request “Moscow St. Petersburg” contains the AND operator (this is how a space is perceived), which indicates that one should search for documents that contain both words - Moscow and St. Petersburg.

In order for the search to be relevant (from the English relevant - relevant, relevant), several general rules should be taken into account:

    Regardless of the form in which the word is used in the query, the search takes into account all its word forms according to the rules of the Russian language. For example, the query “ticket” will also find the words “ticket”, “ticket”, etc.

    Capital letters should only be used in proper names to avoid viewing unnecessary references. At the request of “blacksmiths,” for example, documents will be found that talk about both blacksmiths and Kuznetsovs.

    It is advisable to narrow your search using a few keywords.

    If the required address is not among the first twenty addresses found, you should change the request.

Each search engine uses its own query language. To get acquainted with it, use the built-in help of the search engine

Large sites may have built-in information retrieval systems within their web pages.

Queries in such search systems, as a rule, are built according to the same rules as in global search engines, however, familiarity with the help here will not be superfluous.

Advanced Search

Search engines can provide a mechanism for the user to create a complex query. Follow the link Advanced Search makes it possible to edit search parameters, specify additional parameters and select the most convenient form for displaying search results. The following describes the parameters that can be set during an advanced search in the Yanex and Rambler systems.

Parameter description

Name in Yandex

Name inRambler

Where to look for keywords (document title, body text, etc.)

Dictionary filter

Search by text...

What words should or should not be present in the document and how accurate the match should be

Dictionary filter

Search for query words... Exclude documents containing the following words...

How far apart should keywords be located?

Dictionary filter

Distance between query words...

Restriction on document date

Document date...

Limit your search to one or more sites

Site/Top

Search documents only on the following sites...

Limiting search by document language

Document language...

Search for documents containing a picture with a specific name or signature

Image

Finding pages containing objects

Special objects

Search results presentation form

Issue format

Displaying search results

Some search engines (for example, Yandex) allow you to enter queries in natural language. You write what you need to find (for example: ordering train tickets from Moscow to St. Petersburg). The system analyzes the request and produces the result. If you are not satisfied with it, switch to the query language.