Review of programs for searching documents and data. Software and services for professional search Internet resources for searching professional information

Professional Internet search requires specialized software, as well as specialized search engines and search services.

PROGRAMS

http://dr-watson.wix.com/home – the program is designed to study arrays of text information in order to identify entities and connections between them. The result of the work is a report on the object under study.

http://www.fmsasg.com/ - one of the best programs in the world for visualizing connections and relationships Sentinel Vizualizer. The company has completely Russified its products and connected hotline in Russian.

http://www.newprosoft.com/ – “Web Content Extractor” is the most powerful, easy-to-use software for extracting data from web sites. It also has an effective Visual Web spider.

SiteSputnik has no analogues in the world software package, allowing you to search and process its results on the Visible and Invisible Internet, using all the search engines necessary for the user.

WebSite-Watcher – allows you to monitor web pages, including password-protected ones, monitoring forums, RSS feeds, news groups, local files. Has a powerful filter system. Monitoring is carried out automatically and is delivered in a user-friendly form. A program with advanced functions costs 50 euros. Constantly updated.

http://www.scribd.com/ is the most popular platform in the world and increasingly used in Russia for posting various kinds of documents, books, etc. for free access with a very convenient search engine for titles, topics, etc.

http://www.atlasti.com/ is the most powerful and effective tool for qualitative information analysis available to individual users, small and even medium-sized businesses. The program is multifunctional and therefore useful. It combines the ability to create a unified information environment for working with various text, tabular, audio and video files as a single whole, as well as tools for qualitative analysis and visualization.

Ashampoo ClipFinder HD – an ever-increasing share of the information flow comes from video. Accordingly, competitive intelligence officers need tools that allow them to work with this format. One of such products is the presented free utility. It allows you to search for videos based on specified criteria on video file storage sites such as YouTube. The program is easy to use, displays all search results on one page with detailed information, titles, duration, time when the video was uploaded to the storage, etc. There is a Russian interface.

http://www.advego.ru/plagiatus/ – the program was made by SEO optimizers, but is quite suitable as an Internet intelligence tool. Plagiarism shows the degree of uniqueness of the text, the sources of the text, and the percentage of text match. The program also checks the uniqueness of the specified URL. The program is free.

http://neiron.ru/toolbar/ – includes an add-on for combining Google search and Yandex, and also allows for competitive analysis based on assessing the effectiveness of sites and contextual advertising. Implemented as a plugin for FF and GC.

http://web-data-extractor.net/ is a universal solution for obtaining any data available on the Internet. Setting up data cutting from any page is done in a few mouse clicks. You just need to select the data area that you want to save and Datacol will automatically select a formula for cutting out this block.

CaptureSaver is a professional Internet research tool. Simply irreplaceable working programm, allowing you to capture, store and export any Internet information, including not only web pages, blogs, but also RSS news, email, images and much more. It has the widest functionality, an intuitive interface and a ridiculous price.

http://www.orbiscope.net/en/software.html – web monitoring system at more than affordable prices.

http://www.kbcrawl.co.uk/ – software for working, including on the “Invisible Internet”.

http://www.copernic.com/en/products/agent/index.html – the program allows you to search using more than 90 search engines, more than 10 parameters. Allows you to combine results, eliminate duplicates, block broken links, and show the most relevant results. Comes in free, personal and professional versions. Used by more than 20 million users.

Maltego is a fundamentally new software that allows you to establish the relationship of subjects, events and objects in real life and on the Internet.

SERVICES

new – web browser with dozens of pre-installed tools for OSINT.

– an effective search engine-aggregator for finding people on major Russian social networks.

https://hunter.io/ – efficient service to detect and verify email.

https://www.whatruns.com/ is an easy to use yet effective scanner to discover what is working and not working on a website and what its security holes are. Also implemented as a plugin for Chrom.

https://www.crayon.co/ is an American budget platform for market and competitive intelligence on the Internet.

http://www.cs.cornell.edu/~bwong/octant/ – host identifier.

https://iplogger.ru/ – a simple and convenient service for determining someone else’s IP.

http://linkurio.us/ is a powerful new product for economic security workers and corruption investigators. Processes and visualizes huge amounts of unstructured information from financial sources.

http://www.intelsuite.com/en – English-language online platform for competitive intelligence and monitoring.

http://yewno.com/about/ is the first operating system for translating information into knowledge and visualizing unstructured information. Currently supports English, French, German, Spanish and Portuguese.

https://start.avalancheonline.ru/landing/?next=%2F – forecasting and analytical services by Andrey Masalovich.

https://www.outwit.com/products/hub/ – a complete set of offline programs for professional work in web 1.

https://github.com/search?q=user%3Acmlh+maltego – extensions for Maltego.

http://www.whoishostingthis.com/ – search engine for hosting, IP addresses, etc.

http://appfollow.ru/ – analysis of applications based on reviews, ASO optimization, positions in tops and search results for the App Store, Google Play and Windows Phone Store.

http://spiraldb.com/ is a service implemented as a plugin for Chrom, which allows you to get a lot of valuable information about any electronic resource.

https://millie.northernlight.com/dashboard.php?id=93 - free service, collecting and structuring key information by industry and company. It is possible to use information panels based on text analysis.

http://byratino.info/ – collection of factual data from publicly available sources on the Internet.

http://www.datafox.co/ – CI platform collects and analyzes information on companies of interest to clients. There is a demo.

https://unwiredlabs.com/home - a specialized application with an API for searching by geolocation of any device connected to the Internet.

http://visualping.io/ – a service for monitoring sites and, first of all, the photographs and images available on them. Even if the photo only appears for a second, it will be in the subscriber's email. Has a plugin for Google Chrome.

http://spyonweb.com/ is a research tool that allows for in-depth analysis of any Internet resource.

http://bigvisor.ru/ – the service allows you to track advertising campaigns for certain segments of goods and services, or specific organizations.

http://www.itsec.pro/2013/09/microsoft-word.html – instructions for use by Artem Ageev Windows programs for competitive intelligence needs.

http://granoproject.org/ is an open source tool source code for researchers who track networks of connections between individuals and organizations in politics, economics, crime, etc. Allows you to connect, analyze and visualize information obtained from various sources, as well as show significant connections.

http://imgops.com/ – a service for extracting metadata from graphic files and working with them.

http://sergeybelove.ru/tools/one-button-scan/ – a small online scanner for checking security holes in websites and other resources.

http://isce-library.net/epi.aspx – service for searching primary sources using a fragment of text in English

https://www.rivaliq.com/ is an effective tool for conducting competitive intelligence in Western, primarily European and American markets for goods and services.

http://watchthatpage.com/ is a service that allows you to automatically collect new information from monitored Internet resources. The service is free.

http://falcon.io/ is a kind of Rapportive for the Web. It is not a replacement for Rapportive, but provides additional tools. In contrast, Rapportive provides a general profile of a person, as if glued together from data from social networks and mentions on the web. http://watchthatpage.com/ - a service that allows you to automatically collect new information from monitored resources on the Internet. The service is free.

https://addons.mozilla.org/ru/firefox/addon/update-scanner/ – add-on for Firefox. Monitors web page updates. Useful for websites that do not have news feeds (Atom or RSS).

http://agregator.pro/ – aggregator of news and media portals. Used by marketers, analysts, etc. to analyze news flows on certain topics.

http://price.apishops.com/ – an automated web service for monitoring prices for selected product groups, specific online stores and other parameters.

http://www.la0.ru/ is a convenient and relevant service for analyzing links and backlinks to an Internet resource.

www.recordedfuture.com is a powerful tool for data analysis and visualization, implemented as an online service built on cloud computing.

http://advse.ru/ is a service with the slogan “Find out everything about your competitors.” Allows you to obtain competitors' websites in accordance with search queries and analyze competitors' advertising campaigns in Google and Yandex.

http://spyonweb.com/ – the service allows you to identify sites with the same characteristics, including those using the same statistics service identifiers Google Analytics, IP addresses, etc.

http://www.connotate.com/solutions – a line of products for competitive intelligence, managing information flows and converting information into information assets. It includes both complex platforms and simple, cheap services that allow for effective monitoring along with information compression and obtaining only the necessary results.

http://www.clearci.com/ - competitive intelligence platform for businesses of various sizes from start-ups and small companies to Fortune 500 companies. Solved as saas.

http://startingpage.com/ is a Google add-on that allows you to search on Google without recording your IP address. Fully supports all search engines Google features, including in Russian.

http://newspapermap.com/ is a unique service that is very useful for a competitive intelligence officer. Connects geolocation with an online media search engine. Those. you select the region you are interested in, or even a city, or language, see the place on the map and a list of online versions of newspapers and magazines, click on the appropriate button and read. Supports Russian language, very user-friendly interface.

http://infostream.com.ua/ is a very convenient news monitoring system “Infostream”, distinguished by a first-class selection and quite accessible to any wallet, from one of the classics of Internet search, D.V. Lande.

http://www.instapaper.com/ is a very simple and effective tool for saving the necessary web pages. Can be used on computers, iPhones, iPads, etc.

http://screen-scraper.com/ – allows you to automatically extract all information from web pages, download the vast majority of file formats, and automatically enter data into various forms. Saves downloaded files and pages in databases, performs many other extremely useful functions. Works on all major platforms, has fully functional free and very powerful professional versions.

http://www.mozenda.com/ - having several tariff plans and a web service of multifunctional web monitoring and delivery of information necessary for the user from selected sites, available even to small businesses.

http://www.recipdonor.com/ - the service allows you to automatically monitor everything that happens on competitors' websites.

http://www.spyfu.com/ – and this is if your competitors are foreign.

www.webground.su is a service for monitoring the Runet created by Internet search professionals, which includes all the major providers of information, news, etc., and is capable of individual monitoring settings to suit the user’s needs.

SEARCH ENGINES

https://www.idmarch.org/ is the best search engine for the world archive of pdf documents in terms of quality. Currently, more than 18 million pdf documents have been indexed, ranging from books to secret reports.

http://www.marketvisual.com/ is a unique search engine that allows you to search for owners and top management by full name, company name, position, or a combination thereof. The search results contain not only the objects you are looking for, but also their connections. Designed primarily for English-speaking countries.

http://worldc.am/ is a search engine for freely accessible photographs linked to geolocation.

https://app.echosec.net/ is a publicly available search engine that describes itself as the most advanced analytical tool for law enforcement and security and intelligence professionals. Allows you to search for photos posted on various sites, social platforms and social networks in relation to specific geolocation coordinates. There are currently seven data sources connected. By the end of the year their number will be more than 450. Thanks to Dementy for the tip.

http://www.quandl.com/ is a search engine for seven million financial, economic and social databases.

http://bitzakaz.ru/ – search engine for tenders and government orders with additional paid functions

Website-Finder - makes it possible to find sites that Google does not index well. The only limitation is that it only searches 30 websites for each keyword. The program is easy to use.

http://www.dtsearch.com/ is a powerful search engine that allows you to process terabytes of text. Works on desktop, web and intranet. Supports both static and dynamic data. Allows you to search in all MS Office programs. The search is carried out using phrases, words, tags, indexes and much more. The only one accessible system federated search. It has both paid and free versions.

http://www.strategator.com/ – searches, filters and aggregates information about the company from tens of thousands of web sources. Searches in the USA, Great Britain, major EEC countries. It is highly relevant, user-friendly, and has free and paid options ($14 per month).

http://www.shodanhq.com/ is an unusual search engine. Immediately after his appearance, he received the nickname “Google for hackers.” It does not search for pages, but determines IP addresses, types of routers, computers, servers and workstations located at a particular address, and traces chains DNS servers and allows you to implement many others interesting features for competitive intelligence.

http://search.usa.gov/ – search engine for websites and open databases all US government agencies. The databases contain a lot of practical useful information, including for use in our country.

http://visual.ly/ – today visualization is increasingly used to present data. This is the first infographic search engine on the Web. Along with the search engine, the portal has powerful tools data visualizations that do not require programming skills.

http://go.mail.ru/realtime – search for discussions of topics, events, objects, subjects in real or customizable time. The previously highly criticized search in Mail.ru works very effectively and provides interesting, relevant results.

Zanran is just launched, but already working great, the first and only search engine for data that extracts it from PDF files, EXCEL tables, data on HTML pages.

http://www.ciradar.com/Competitive-Analysis.aspx is one of the world's best information retrieval systems for competitive intelligence on the deep web. Retrieves almost all types of files in all formats on the topic of interest. Implemented as a web service. The prices are more than reasonable.

http://public.ru/ – Effective search and professional analysis of information, media archive since 1990. The online media library offers a wide range of information services: from access to electronic archives of Russian-language media publications and ready-made thematic press reviews to individual monitoring and exclusive analytical research based on press materials.

Cluuz is a young search engine with ample opportunities for competitive intelligence, especially on the English-language Internet. Allows you not only to find, but also to visualize and establish connections between people, companies, domains, e-mails, addresses, etc.

www.wolframalpha.com – the search engine of tomorrow. In response to a search request, it provides statistical and factual information available on the request object, including visualized information.

www.ist-budget.ru – universal search in databases of government procurement, tenders, auctions, etc.

Introduction

Currently, the Internet unites hundreds of millions of servers that host billions of different sites and individual files containing various types of information. This is a giant repository of information. There are various methods for searching information on the Internet.

Search by known address. The necessary addresses are taken from directories. Knowing the address, just enter it in address bar Browser.

Example 1. www.gov.ru is a server of Russian government authorities.

Constructing an address by the user. Knowing the system for forming Internet addresses, you can construct addresses when searching for Web sites.

To the keyword (the name of a company, enterprise, organization or a simple English noun), you need to add a thematic or geographic domain, and you need to connect your intuition.

Example 2. Commercial Web page addresses:

www.samsung.com (SAMSUNG company),

www.mtv.com (MTV music news).

Example 3. Addresses of educational institutions:

www.ntu.edu ( National University USA).

Internet search engines

Special information retrieval systems have been developed to search for information on the Internet. Search engines have a regular address and are displayed as a Web page containing special means to organize the search (search string, subject catalog, links). To call a search engine, simply enter its address in the address bar of the Browser.

According to the statistics service LiveInternet.ru, the distribution of search engines in Russia is approximately as follows:

2) Google – 35.0%

3) Search Mail.ru – 8.3%

4) Rambler – 0.9%

According to the method of organizing information, information retrieval systems are divided into two types: classification (rubricators) and dictionary.

Categories (classifiers)- search engines that use a hierarchical (tree) organization of information. When searching for information, the user looks through thematic headings, gradually narrowing the search field (for example, if you need to find the meaning of a word, then first you need to find a dictionary in the classifier, and then find it in it the right word).



Dictionary search engines- These are powerful automatic software and hardware systems. With their help, information is viewed (scanned) on the Internet. Data on the location of this or that information is entered into special index directories. In response to a request, a search is performed according to the query string. As a result, the user is offered those addresses (URLs) where the searched word or group of words was found at the time of scanning. By selecting any of the proposed link addresses, you can go to the found document. Most modern search engines are mixed.

The most famous and popular search engines:

There are systems that specialize in searching information resources in various directions.

https://my.mail.ru

https://ru-ru.facebook.com

https://twitter.com

https://www.tumblr.com

https://www.instagram.com, etc.

Subject search engines:

Search software:

Catalogs (thematic collections of links with annotations):

http://www.atrus.ru

Rules for executing requests

Each search engine's Help section provides information on how to search and how to construct a query string. Below is information about a typical, “average” query language.

Simple request

Enter one word that defines the search topic. For example, in the search engine Rambler.ru it is enough to enter: automation.

Documents are found that contain the words specified in the request. All forms of Russian words are recognized; as a rule, letter case is ignored.

You can use the "*" or "?" character in the query. Sign "?" in a keyword, one character is replaced, in place of which any letter can be substituted, and the “*” sign is a sequence of characters.

For example, the query automatic* will allow you to find documents that include the words automatic, automation, etc.

Complex request

There is often a need to combine keywords to obtain more specific information. In this case, additional linking words, functions, operators, symbols, combinations of operators, separated by brackets, are used.

For example, the query music & (beatles beatles) means that the user is looking for documents containing the words music and beatles or music and beatles.

List of search engines and directories

Address Description
www.excite.com Search engine with site reviews and guides
www.alta-vista.com Search server, advanced search capabilities available
www.hotbot.com Search server
www.ifoseek.com Search server (easy to use)
www.ipl.org Internet Publik library, a public library operating within the framework of the World Village project
www.wisewire.com WiseWire - organization of search using artificial intelligence
www.webcrawler.com WebCrawler - search server, easy to use
www.yahoo.com CatalogWeb and interface for accessing full-text search on the AltaVista server
www.aport.ru Aport - Russian-language search server
www.yandex.ru Yandex - Russian-language search server
www.rambler.ru Rambler - Russian-language search server
Internet Help Resources
www.yellow.com Yellow Pages Internet
monk.newmail.ru Search engines of various profiles
www.top200.ru Top 200 Websites
www.allru.net
www.ru Catalog of Russian Internet resources
www.allru.net/z09.htm Educational Resources
www.students.ru Russian student server
www.cdo.ru/index_new.asp Distance Learning Center
www.open.ac.uk UK Open University
www.ntu.edu US National University
www.translate.ru Electronic text translator
www.pomorsu.ru/guide.library.html List of links to network libraries
www.elibrary.ru Scientific electronic library
www.citforum.ru Digital library
www.infamed.com/psy Psychological tests
www.pokoleniye.ru Website of the Internet Education Federation
www.metod.narod.ru Educational Resources
www.spb.osi.ru/ic/distant Distance learning in Internet
www.examen.ru Exams and tests
www.kbsu.ru/~book/ Computer Science Textbook
Mega.km.ru Encyclopedias and dictionaries

Professional search for information on the Internet

Searching for information is one of the most common and at the same time the most difficult tasks that any user has to face on the Internet. However, if for an ordinary member of the online community knowledge of effective methods information retrieval is a desirable, but far from obligatory quality, then for professionals information activities the ability to quickly navigate Internet resources and find the required sources is one of the basic qualifications.

The reason for the difficulties that arise when searching for information on the Internet is determined by two main factors. Firstly, the number of sources on the Internet is extremely large. At the end of 2001, the most rough estimates indicated an estimated figure of 7.5 billion documents located on servers around the world. Secondly, the array of information on the Internet is not only colossal in volume, but also extremely dynamic. In the half a minute that you spent reading the first lines of this section, about a hundred new or changed documents appeared in the virtual universe, dozens were moved to new addresses, and a few ceased to exist forever. The Internet never “sleeps”, just as our planet never “sleeps”, along which a wave of human business activity continuously rolls in exact accordance with the change of time zones.

Unlike a stable and controlled collection of documents in a library, on the Internet we are dealing with a gigantic and constantly changing information array, the search for data in which is a very, very complex process. The situation is often very reminiscent of the well-known problem of finding a needle in a haystack, and sometimes information of great value remains unclaimed solely because of the difficulty of finding it.

Most users of global computer networks. Both amateurs and professionals often use the same tools. However, the results of the searches and the time spent on them vary greatly.

The purpose of this section is to familiarize yourself in detail with the tools and methods of information retrieval and develop stable skills for professional search on the Internet for all types of data: from texts in any format, to video and animation.

The machines must work.
People must think.

Course “Professional Internet Search” - convenient way learn to competently and effectively search and find the necessary information on the Internet.

What's happened professional search?

Internet paradox is that information becomes more and more More, but find necessary information becomes it's getting harder. Professional search is efficient search necessary And reliable information.
IN modern world information becomes capital, and the Internet becomes a convenient means of obtaining it, which is why the ability to find valuable information characterizes a person as high class professional. A professional search should always be effective. Moreover, during the search, professionals not only look for the place where the information is stored, but also evaluate the authority of the resource, relevance, accuracy, and completeness of the published information. Internet heuristics help us with this - a set of useful search rules, selection criteria and evaluation of network information.

What will you learn and what will you learn?

Have you been looking and couldn't find it? Then the course will be extremely useful to you. You'll get comprehensive search instructions something that is already on the Internet, but at first glance it seems that it is simply impossible to find it... Perhaps! You will learn, how to search to find! Each lesson is based on a combination of knowledge and experience, all received knowledge is tested in practice.

During the course classes You will learn how the modern Internet develops and how it spreads electronic information, how directories are created and how search engines work, why metasearch engines are needed and where the “hidden” web came from, how forums differ from blogs and what fundraising is.

During workshops You will learn use the query language correctly, select wisely keywords, find information on the “hidden” web, find the necessary images and files, evaluate public opinion in the blogosphere, search personal information, and most importantly - to correctly assess the reliability, relevance and completeness of the information found.

The Internet search course will allow you to significantly develop your cognitive, information and communication abilities.

What topics are covered in the Professional Search course?

The goal of the course is to teach the possibilities and subtleties of modern search professional information online.

Each lesson (module) includes lecture, seminar in a forum format, test to master the material covered, as well as several exercises and search tasks.

The updated course will feature weekly one-hour webinars - interactive virtual online seminars dedicated to discussing the key tasks of professional Internet search.

Each training module is equipped useful additional materials on course topics and handouts convenient for printing.

The thematic plan of the course consists of 10 interrelated modules:

1. Internet: history, technology and Internet research.

2. Information search. Search directories.

3. Information retrieval systems. IPS close-up (Google, Yandex and others).

4. Metasearch engines and programs.

5. Internet Help Desk: factual search in encyclopedias, reference books, dictionaries.

6. Bibliographic search: libraries, catalogs, programs.

7. Documentary search: electronic documents, digital libraries, electronic magazines.

8. "Hidden" Web: Search multimedia, databases, knowledge bases and files.

9. Search news(blogs and forums), contacts, institutions, fundraising.

10. Information Retrieval Strategies: Generalization of Internet heuristics skills.

Why is the course distance learning?

The distance course has a whole several advantages.

Firstly, each lesson is allocated not one or two academic hours per week, but whole week. You can master and assimilate lecture material, perform exercises and search tasks without haste.

Secondly, distance learning course interactive. This means that you can always ask, clarify, find out from the teacher what you think is important. Your question will not go unanswered, and complex search tasks can be discussed as a group to evaluate each skill in comparison.

Thirdly, you can study at a time convenient for you and you won’t have to waste time traveling to classes. Moreover, you can study anywhere in the world where there is access to the Internet.

What is par for the course?

The “Internet heuristics” course will last one month and will consist of 10 modules, each module consists of “quanta” lessons - they allow you to maintain the pace necessary for mastering new material). Price of each module – only 300 rubles, for all classes you will pay only 3000 rubles. Please note that you do not have to buy additional textbooks; the course is fully provided with all the necessary educational materials. If you successfully complete the course, you will receive a Moscow State University certificate for completing the “Professional Internet Search” course.

If you want to learn Internet resourcefulness, then you need to choose a convenient time to take the course and sign up (just click on the sign up link opposite the convenient time slot at the top of the page)!

After registration, you will still have time to think and make a final decision. By the way, you can meet

PROFESSIONAL INFORMATION SEARCH ON THE INTERNET

Internet search is an important element of working on the Internet. Exact number of web resources modern Internet Hardly anyone knows for sure. In any case, the count is in the billions. In order to be able to use the information needed at a given moment, no matter for work or entertainment purposes, you first need to find it in this constantly replenished ocean of resources.

In order for an Internet search to be successful, two conditions must be met: queries must be well formulated and they must be asked in appropriate places. In other words, the user is required, on the one hand, to be able to translate his search interests into the language of the search query, and on the other hand, to have good knowledge of search engines, available tools search, their advantages and disadvantages, which will allow you to choose the most suitable search tools in each specific case.

Currently, there is no single resource that satisfies all Internet search requirements. Therefore, if you take your search seriously, you inevitably have to use different tools, using each in the most appropriate case.

Basic Internet search toolscan be divided into the following main groups:

Search engines;

Web directories;

Help Resources;

Local programs for searching the Internet.

The most popular search tools aresearch engines– the so-called Internet search engines (Search Engines). The top three leaders on a global scale are quite stable - Google, Yahoo! and Bing. In many countries, their own local search engines, optimized for working with local content, are added to this list. With their help, you can theoretically find any specific word on the pages of many millions of sites. From the user's point of view, the main disadvantage of search engines is the inevitable presenceinformation noisein the results. This is the customary name for results that are included in the search list for one reason or another and do not correspond to the request.

Despite many differences, all Internet search engines work on similar principles and with technical point vision systems consist of similar subsystems. The first structural part of the search engine is special programs, used for automatic search and subsequent indexing of web pages. Such programs are usually called spiders, or bots. They look at the code of web pages, find links located on them, and thereby discover new web pages. There are also alternative way inclusion of the site in the index. Many search engines offer resource owners the opportunity to independently add a site to their database. However, the web pages are then downloaded, analyzed and indexed. They highlight structural elements, find keywords, and determine their connections with other sites and web pages. Other operations are also performed, the result of which is the formation of a search engine index database. This base is the second main element any search engine. Currently, there is no single absolutely complete index database that would contain information about all Internet content. Because the different search engines use different programs search for web pages and build their index using different algorithms, search engine index databases can vary significantly. Some sites are indexed by several search engines, but there is always a certain percentage of resources included in the database of only one search engine. The presence of such an original and non-overlapping part of the index in each search engine allows us to draw an important practical conclusion: if you use only one search engine, even the largest one, you will definitely lose a certain percentage of useful links.

The next part of the Internet search engine is the actual search and sorting programs. These programs solve two main tasks: first, they find pages and files in the database that match the incoming request, and then sort the resulting data array in accordance with various criteria. Success in achieving search goals largely depends on the effectiveness of their work.

The last element of the Internet search engine is user interface. In addition to the usual requirements for aesthetics and convenience for any website, search engine interfaces have another important requirement: they must offer various tools for composing and clarifying queries, as well as sorting and filtering results. The advantages of search engines are excellent coverage of sources, comparatively quick update database contents and a good choice additional functions.

The main tool for working with search engines is a query.

Also used for Internet searches special applications, installed on local computer. It could be like simple programs, and quite complex complexes of data search and analysis. The most common are search plugins for browsers, browser panels designed to work with a specific search service, and metasearch packages with capabilities for analyzing results.

Web directories – these are resources in which sites are divided into thematic categories. If the user works with search engines only through queries, then in the catalog it is possible to view thematic sections in their entirety. The second fundamental difference between directories and automatic search engines is that, as a rule, people are directly involved in their filling, viewing resources and classifying the site into one category or another. Web directories are usually divided into universal and thematic. Universal ones try to cover as many topics as possible. You can find anything there: from websites about poetry to computer resources. In other words, their search breadth is maximum. Thematic directories specialize in a specific topic, providing maximum search depth by reducing the breadth of resource coverage.

The advantages of catalogs are comparatively high quality resources, since each site in it is viewed and selected by a person. Thematic grouping of sites allows you to conveniently arrange sites of similar topics. This mode of operation is good for discovering sites that are new to you on a topic of interest - it is more accurate than using search engine. It is recommended to use web catalogs for the first acquaintance with any subject area, as well as searching for vague queries - you will have the opportunity to “wander” through the sections of the catalog and more accurately determine what exactly you need.

The disadvantages of web directories are known. First of all, this is a slow replenishment of the database, since the inclusion of a site in the catalog requires human participation. In terms of efficiency, a web directory is not a rival to search engines. In addition, web directories are significantly inferior to search engines in terms of database size.

When talking about Internet search, we cannot ignore a number of terms that are closely related to this area and are often used to describe and evaluate search engines. For example: breadth and depth Internet search. A broad search is one that captures as much as possible large quantity sources of information. In this case, at least a mention of one or another site suitable for the request is considered sufficient. Search depth refers to the detail of indexing and subsequent searching of each specific resource. For example, many search engines approach indexing different sites differently. Large and popular sites are indexed to the maximum extent; robots try not to miss a single page of such a resource. At the same time, on other sites, only the title page and a couple of content pages may be indexed. These circumstances naturally affect subsequent searches. Deep search works on the principle “it is better to include unnecessary information in the results than to miss any data relevant to the search topic.”

Quite often you can come across such concepts as global and local Internet search. When doing a local Internet search, it takes into account geographical location user and preference is given to results that are somehow related to a specific country or locality. At global search this information is not taken into account, and the search is carried out in all available resources.

When composing a query on Internet search engines, various search modes operate. Typical search modes found on most Internet machines include: simple and advanced search. A simple search allows you to specify only one search feature in one request. Advanced search makes it possible to create a query from several conditions, linking them with logical operators.

To refine search queries, various filters . Filters are those or other auxiliary means of composing a query that do not relate to the content side of the query conditions, but limit the search results by some formal feature. So, for example, when using a file type filter when searching, the user does not provide the system with information related to the topic of his request, but simply limits the results obtained to a certain file type specified in the condition of his request.

For most users, universal search engines are the main, and often the only means of Internet search. They offer good coverage of sources, as well as a set of tools sufficient to solve basic search problems.

The market for universal search engines is quite large. We tried to analyze the most famous search engines, and presented the results in Table 1.

When choosing a universal search engine, the quality of the resources found with its help plays an important role. You can determine the preferred search engine for specific tasks using the “marker method”. Its essence is that first a certain thematic search query is compiled, after which a group of people - experts in this field - is surveyed to identify the best, in their opinion, Internet resources on the chosen topic. Based on the survey data, a list of marker sites is generated that are guaranteed to be relevant to the request and contain high-quality information. The request is then sent to the tested search engines. The logic of the assessment is simple: the higher the marker sites are located in the search results, the better a particular resource is suitable for searching for information on a test topic.


By mid-2015, the global Internet had already connected 3.2 billion users, that is, almost 43.8% of the planet’s population. For comparison: 15 years ago, only 6.5% of the population were Internet users, that is, the number of users has increased more than 6 times! But what is more impressive is not the quantitative, but the qualitative indicators of the expansion of the implementation of Internet technologies in various areas of human activity: from global communications of social networks to household Internet things. Mobile Internet provided users with the opportunity to be online outside the office and at home: on the road, outside the city in nature.
Currently, there are hundreds of systems for searching information on the Internet. The most popular of them are available to the vast majority of users because they are free and easy to use: Google, Yandex, Nigma, Yahoo!, Bing..... For more experienced users, “advanced search” interfaces, specialized searches “by social networks", according to news flows and purchase and sale advertisements... But all these wonderful search engines have a significant drawback, which I already noted above as an advantage: they are free.
If investors invest billions of dollars in the development of search engines, then a completely appropriate question arises: where do they make money?
And they make money, in particular, by providing in response to user requests not so much information that would be useful from the user’s point of view, but that which the owners of search engines consider useful for the user. This is done by manipulating the order in which lists of answers are issued to search queries users. Here is open advertising of certain Internet resources, and hidden manipulation of the relevance of answers based on the commercial, political and ideological interests of the owners of search engines.
Therefore, among professional specialists in searching for information on the Internet, the problem of pertinence of search engine results is very relevant.
Pertinence is the correspondence of documents found by an information retrieval system to the information needs of the user, regardless of how fully and how accurately this information need is expressed in the text of the information request itself. This is the ratio of the amount of useful information to the total amount of information received. Roughly speaking, this is search efficiency.
Specialists carrying out qualified searches for information on the Internet need to make certain efforts to filter search results, weeding out unnecessary information “noise”. And for this, professional-level search tools are used.
One of these professional systems is the Russian program FileForFiles & SiteSputnik (SiteSputnik).
Developer Alexey Mylnikov from Volgograd.

"The FileForFiles & SiteSputnik program (SiteSputnik) is designed to organize and automate professional search, collection and monitoring of information posted on the Internet. Special attention is given to receiving incoming new information on topics of interest. Several information analysis functions have been implemented."


Monitoring and categorization of information flows


First a few words about monitoring information flows, a special case of which is monitoring of media and social networks:

  • the user indicates the Sources that may contain the necessary information and the Rules for selecting this information;

  • the program downloads fresh links from Sources, frees their content from garbage and repetitions, and arranges them into Sections according to the Rules.

  • To see live a simple but real monitoring process, which involves 6 sources and 4 headings:
  • open the Demo version of the program;


  • then, in the window that appears, click on the button Together;

  • and when WebsiteSputnik will carry out this Project in real time, you:
    — in the “Clean Stream” list you will see all the new information from Sources,
    — in the “Post-request” section - only economic and financial news that satisfies the rule,
    - in the Headings "About the President", "About the Premiere" and "Central Bank", - information related to the relevant objects.

  • In real Projects, you can use almost any number of Sources and Rubrics.
    You can create your first working Projects in a few hours, and improve them during operation.
    The described information processing is available in the SiteSputnik Pro+News package and higher.

2. Simple and batch search, information collection

To get acquainted with the possibilities SiteSputnik Pro(basic version of the program) :

  • open the Demo version of the program;

  • enter your first request, for example, your full name, as I did:

    and click on the button Search.


  • The program (see the sign that SiteSputnik built) will poll in a few seconds 7 sources, will open in them 24 search pages, will find 227 relevant links, will remove duplicate links and from the remaining 156 unique list of links "An association".

    Name
    Source

    Ordered
    pages

    Downloaded
    pages

    Found
    links

    Time
    search

    Efficiency
    search

    Links
    New

    Efficiency
    New
    Yandex 5 5 50 0:00:05 32% 0 0
    Google 5 5 44 0:00:03 28% 0 0
    Yahoo 5 5 50 0:00:05 32% 0 0
    Rambler 5 4 56 0:00:07 36% 0 0
    MSN (Bing) 5 3 23 0:00:04 15% 0 0
    Yandex.Blogs 5 1 1 0:00:01 1% 0 0
    Google.Blogs 5 1 3 0:00:01 2% 0 0
    Total: 35 24 227 0:00:26 0 0
    Total: number of unique links - 156 , duplicate links - 46 %.

  • (! ) Repeat your request after a few hours or days, and you will see only new links that appeared in the Sources for this period of time. In the last two columns of the table you can see how many new links each Source brought and its efficiency in terms of “novelty”. When a query is executed multiple times, a list containing only new links , is created relative to all previous executions of this request. It would seem elementary and required function, but the author is not aware of any program in which it is implemented.

  • (!! ) The described capabilities are supported not only for individual requests, but also for entire request packets :

    The package that you see consists of seven different queries that collect information about Vasily Shukshin from several Sources, including search engines, Wikipedia, exact search in Yandex news, metasearch and search for mentions on TV and radio stations. To the script TV and Radio includes: "Channel One", "TV Russia", NTV, RBC TV, "Echo of Moscow", radio company "Mayak", ... and other sources of information. Each Source has its own search or browsing depth in pages. It is listed in the third column.

    Batch search allows you to perform comprehensive searches with one click collection of information on a given topic.
    Separate list new links, upon repeated executions of the package, will contain only links that were not previously found.
    Remember what and when you asked the Internet and what it answered you No need- everything is automatically saved in libraries and in program databases.
    I repeat that the capabilities described in this paragraph are fully included in the package SiteSpunik Pro.


  • More details in the instructions: SiteSputnik Pro for beginners.

3. Objects and search monitoring

Quite often the User is faced with the following task. You need to find out what is on the Internet about a specific object: a person or a company. For example, when hiring a new employee or when a new counterparty appears, you always know the full name, company name, telephone numbers, INN, OGRN or OGRNIP, you can also take ICQ, Skype and some other data. Next, using a call to a special program function WebsiteSputnik "Collecting information about the object" (equipment SiteSputnik Pro+Objects):

You enter the data that you know, and with one click of the mouse you carry out accurate And full search for links containing specified information. The search is performed on several search engines at once, using all the details at once, using several possible combinations of recording details at once: remember how you can write down a phone number in different ways. After a certain period of time, without doing boring routine work, you will receive a list of links, cleared of repetitions and, most importantly, ordered by relevance to the object you are looking for. Relevance (significance) is achieved due to the fact that the first in the SiteSputnik search results will be those links on which large quantity the details you specified, and not those that moved up the search engine results of the Webmaster.

Important .
The SiteSputnik program is better than other programs at extracting real, but not official information about the Object. For example, in the official database mobile operator it may be recorded that the phone belongs to Vasily Terekhin, but in reality this phone contains information that Alexander sold a Ford Focus car in 2013, which is additional information for consideration.

Search monitoring .
Search monitoring means the following. If you need to track the occurrence new links, by a given object or arbitrary package of queries, then you just need to periodically repeat the corresponding search. Same as for simple request, program SiteSputnik will create a "New" list, in which it will place only those links that were not found in any of the previous searches.

Search monitoring interesting not only in itself. It may be involved in monitoring the media, social networks and other news sources, which was mentioned above in paragraph 1. Unlike other programs, in which it is possible to obtain new information only from RSS feeds, in the program WebsiteSputnik can be used for this searches built into websites And search engines . Also possible emulation (self-creation) several RSS feeds from arbitrary pages, moreover, emulation of an RSS feed upon request and even a batch of requests.


  • To get the most out of the program, use its main functions, namely:

    • request packages, packages with parameters, use the Assembler (assembler), the "Analytical merging" operation of the results of several tasks, if necessary, apply basic search functions on the invisible Internet;

    • connect your sources to the information sources built into the program : other search engines and searches built into sites, existing RSS feeds created by you own RSS feeds With arbitrary pages, use the search function for new sources;

    • use the following types of features monitoring: Media, social networks and other sources, monitoring comments to news and messages, track the appearance of new information on existing pages;

    • engage Categories , External functions, Task Scheduler, mailing list, multiple computers, Project Instructor, install alarm To notify you of the occurrence of significant events, use the other functions listed below.



4. SiteSputnik program (SiteSputnik): options and features

- Program SiteSputnik is constantly improving in the following areas: "I need to find everything and with a guarantee".
"Interrogation Software for the Internet", - another definition of the User for assigning the program.

A. Functions for searching and collecting information.

. Request package - execution of several queries at once, combining search results or separately. When generating the combined result, repeatedly found links are removed. More details about packages can be found in the introduction to SiteSputnik, and visually in the video: a joint And separate execution of requests. There are no analogues in domestic and foreign developments.

. Packages with parameters. Any queries and query packages designed to solve standard search tasks, for example, search by phone number, full name or e-mail, - can be parameterized, saved and executed from a library of ready-made queries with the substitution of actual (necessary) parameter values. Each package with parameters is its own special advanced search form . It can use not one, but several search engines. You can create forms that are very complex in their functional purpose. It is extremely important that forms can be created by users themselves, without the participation of the program author or programmer. This is written very simply in the instructions, more details in a separate publication about search parameterization and on the forum, clearly in the video: search for all options for recording a number at once mobile phone and according to several options for recording the address Email. There are no analogues.

. Assembler NEW- assembling a search task from several ready-made ones : requests, request packages and parameter packages. Packages may contain other packages in their text. The depth of nesting of packages is unlimited. You can create several search tasks, for example, about several legal entities and individuals, and complete these tasks simultaneously. More details on the forum and in a separate publication about Assembler, clearly at video. There are no analogues.

. Metasearch - execution of a specific request simultaneously at a given “depth” of search for each of them. Metasearch is possible using built-in search engines, which include Yandex, Rambler, Google, Yahoo, MSN (Bing), Mail, Yandex and Google blogs, and connected search tools. Working with multiple search engines looks like you're working with one search engine . Re-found links are deleted. Visually metasearch on three connected social networks: VKontakte, Twitter and Youtube - shown on video.

. Metasearch on the site - combining site search in Google, Yahoo, Yandex, MSN (Bing). Clearly on video.

. Metasearch in office documents - combining search in PDF, XLS, DOC, RTF, PPT, FLASH files in Google, Yahoo, Yandex, MSN (Bing). You can choose any combination of file formats.

. Metasearch for cache copies links in Yandex, Google, Yahoo, MSN (Bing). A list is compiled, each item of which contains all the snippets found for each link by each search engine. There are no analogues.

. Deep Search for Yandex, Google and Rambler allows you to combine into one list all links from the regular search and all links, respectively, from the lists “More from the site”, “Additional results from the site” and “Search on the site (Total...)”. Read more about deep search on the forum. There are no analogues.

. Accurate and complete search . This means the following. On the one hand, each query can be executed on that and only on the source in whose query language it is written. This exact search. On the other hand, there can be an arbitrary number of such requests and sources. This provides full search. Read more in a separate post about procedural search. There are no analogues.

. Searching the Invisible Internet .

    It includes the following basic features:

    A special package of requests that can be improved by the User,
    - search for invisible links using a spider,
    - search for invisible links in the vicinity of a visible link or folder by “image and likeness”,
    - special searches for open folders,
    - search for invisible links and folders with standard names using special dictionaries,
    - use of your own searches built into sites.

    More details in a separate publication on SiteSputnik Invisible. Basic functions“well known in narrow circles,” but the method of their application has no analogues. The essence of this method is to build a site map visible from the Internet (in other words, materialize the visible Internet), and only on the basis of visible links and search for invisible links in relation to them. Searching for already visible links using “invisible” methods is not carried out.

B. Information monitoring functions.

. Monitoring for appearance on the Internet new links on a given topic. Monitor appearance new links can be used using integers request packets , which involve any of the search methods mentioned above, rather than individual search engine front pages. Implemented union and intersection new links from multiple separate searches. More details in the publication on monitoring (see § 1) and on the forum. There are no analogues.

. Collective information processing . Creation corporate or professional network for collective collection, monitoring and analysis of information. The participants and creators of such a network are corporation employees, members of a professional community or interest groups. The geographic location of the participants does not matter. More details in a separate publication about organizing a network for collective collection, monitoring and analysis of information.

. Monitoring links (web pages) to detect changes in their content (content). Beta version. Found changes are highlighted in color and special signs. More details in a separate publication on monitoring (see § 2 and 3).

IN. Information analysis functions.

. Categories of materials already described above. More details can be found in a separate publication about Rubrics. Rules for entering Rubrics allow you to specify keywords and the distance between them, set logical “AND”, “OR” and “NOT”, apply a multi-level bracket structure and dictionaries (insert files) to which you can apply logical operations.

. VF technology - almost arbitrary expansion of the possibility of categorizing materials through the implementation of external functions that are organically integrated into the Rules for entering Rubrics and can be implemented by the programmer independently without the participation of the program author.

. Numerical analysis occupancy of Rubriks, installation alarm and notification of the occurrence of significant events by highlighting the Rubrics in color and/or sending an alarm report by e-mail.

. Factual relevance. There is an option to arrange the links in order close to significance these links in relation to the problem being solved, bypassing the tricks of webmasters who use various ways increasing website rankings in search engines. This is achieved by analyzing the results of executing several “diverse” queries on a given topic. In the literal sense of the word, links containing maximum required information . Read more in the description of how to find the optimal supplier and on the forum. There are no analogues.

. Calculating Object Relationships - search for links, resources (sites), folders and domains on which objects are simultaneously mentioned. The most common objects are people and firms. To search for connections, all program tools mentioned on this page can be used SiteSputnik, which significantly increases the efficiency of the work you do. The operation is performed on any number of objects. More details in the introduction to the program, as well as in the description new feature"objects and their connections." There are no analogues.

. Formation, integration and intersection of information flows on a variety of topics, comparison of threads. More details in a separate post on threads.

. Building web maps sites, resources, folders and searched objects based on those found on the Internet when Google help, Yahoo, Yandex, MSN (Bing) and Altavista links belonging to the site. Experts can find out: is it visible "extra" information from the Internet on their websites, as well as research competitors’ websites on this subject. Web sitemap is materialization of the visible internet . More details in a separate publication about building web maps, visually at video. There are no analogues.

. Finding new sources of information on a given topic, which can then be used to track the emergence of new relevant information. More details at.

G. Service functions.

. Task Scheduler provides work Scheduled: performs specified program functions at a given time. More details in a separate publication about the Planner.

. Project Instructor NEW- this is an assistant creation and maintenance Projects for searching, collecting, monitoring and analyzing information (categorization and signaling). More details on the forum.

. Automatic archiving. IN databases All the results of your work are automatically remembered, namely: requests, request packages, search and monitoring protocols, any other of the above functions and the results of their execution. Can structure work on topics and subtopics.

. Database includes sorting, simple search and custom search by SQL query. For the latter, there is a wizard for composing SQL queries. Using these tools, you can find and review the work you did yesterday, last month, a year ago, define a topic as a search criterion, or set another search criterion based on the contents of the database.

. Technical limitations search engines. Some limitations, such as the length of the query string, can be overcome. It ensures the execution of not one, but several queries, combining search results or separately. You can read about a way to overcome violation of the law of additivity for major search engines. For one word or one phrase enclosed in quotes, a case-sensitive search in search engines is implemented, in particular, search by abbreviation.

Built-in browser . Navigator by page. Multicolor marker to highlight key and arbitrary words. Bilisting and N-listing from generated documents.

. Unloading news feeds into a table view focused on import in Excel, MySQL, Access, Kronos and other Applications.


5. Installation and launch of the Program, computer requirements.

To install and run the program:

  • Download the file, copy the FileForFiles folder from it to your HDD, for example, on D:\;

  • Demo version of the program will be installed and it will open.

  • The program will work on any computer on which it is installed Windows any versions.