Visitors: 0

Search Engine Scraping

smartphone showing Google site

Search Engine Scraping is the process of harvesting URLs, descriptions, or other information from search engines such as Google, Bing, Yahoo, Petal, or Sogou. This is a specific form of screen scraping or web scraping dedicated to search engines only.

Most commonly larger search engine optimization (SEO) providers depend on regularly scraping keywords from search engines, especially Google to monitor the competitive position of their customer's website for relevant keywords or their indexing status.

Search engines like Google have implemented various forms of human detection to block any sort of automated access to their services, with the intent of driving the users of scrapers towards buying their official APIs instead.

The process of entering a website and extracting data in an automated fashion is also often called "crawling". Search engines like Google, Bing, and Yahoo, etc get almost all their data from automated crawling bots.

Difficulties
Google is by far the largest search engine with the most users in numbers as well as most revenue in creative advertisements, which makes Google the most important search engine to scrape for SEO-related companies.

Although Google does not take legal action against scraping, it uses a range of defensive methods that makes scraping their results a challenging task, even when the scraping tool is realistically spoofing a normal web browser.

  • Google is using a complex system of request rate limitation which can vary for each language, country, User-Agent, as well as depending on the keywords or search parameters. The rate limitation can make it unpredictable when accessing a search engine automated as the behavior patterns are not known to the outside developer or user.
  • Network and IP limitations are as well part of the scraping defense systems. Search engines can not easily be tricked by changing to another IP while using proxies is a very important part of successful scraping. The diversity and abusive history of an IP is important as well.
  • Offending IPs and Offending IP Networks can easily be stored in a blacklist database to detect offenders much faster. The fact that most ISPs give dynamic IP addresses to customers requires that such automated bans be only temporary, to not block innocent users.
  • Behavior-based detection is the most difficult defense system. Search engines serve their pages to millions of users every day, this provides a large amount of behavior information. A scraping script or bot is not behaving like a real user, aside from having non-typical access times, delays, and session times the keywords being harvested might be related to each other or include unusual parameters. Google for example has a very sophisticated behavior analysis system, possibly using deep learning software to detect an unusual pattern of access. It can detect unusual activity much faster than other search engines.
  • HTML markup changes, depending on the methods used to harvest the content of a website even a small change in HTML data can render a scraping tool broken until it is updated.
  • General changes in detection systems. In the past years, search engines have tightened their detection systems nearly month by month making it more and more difficult to reliable scrape as the developers need to experiment and adapt their code regularly.

Detection
When search engine defense thinks access might be automated the search engine can react differently.

The first layer of defense is a captcha page, where the user is prompted to verify they are a real person and not a bot or tool. Solving the captcha will create a cookie that permits access to the search engine again for a while. After about one day the captcha page is removed again.

The second defense layer is a similar error page but without captcha, in such a case the user is completely blocked from using the search engine until the temporary block is lifted or the user changes their IP.

The third layer of defense is a long-term block of the entire network segment. Google has blocked large network blocks for months. This sort of block is likely triggered by an administrator and only happens if a scraping tool is sending a very high number of requests.

All these forms of detection may also happen to a normal user, especially users sharing the same IP address or network class (IPV4 ranges as well as IPV6 ranges).

Methods of Scraping Google Bing, Yahoo, and Other Search Engines
To scrape a search engine successfully the two major factors are time and amount. The more keywords a user needs to scrape and the smaller the time for the job the more difficult scraping will be and the more developed a scraping script or tool needs to be.

Scraping scripts need to overcome a few technical challenges:

  • IP rotation using proxies (proxies should be unshared and not listed in blacklists).
  • Proper time management, the time between keyword changes, pagination as well as correctly placed delays. Effective long-term scraping rates can vary from 3 - 5 requests (keywords or pages) per hour up to 100 and more per hour for each IP address/Proxy in use. The quality of IPs, methods of scraping, keyword requested, and language/country requested can greatly affect the possible maximum rate.
  • Correct handling of URL parameters, cookies as well as HTTP headers to emulate a user with a typical browser.
  • HTML DOM parsing (extracting URLs, descriptions ranking position, site links, and other relevant data from the HTML code).
  • Error handling, automated reaction on captcha or block, and other unusual responses.

Programming Languages
When developing a scraper for a search engine almost any programming language can be used. Although, depending on performance requirements, some languages will be favorable.

PHP is a commonly used language to write scraping scripts for websites or backend services since it has powerful capabilities built-in (DOM parsers, libcURL); however, its memory usage is typically 10 times the factor of a similar C/C++ code. 

Ruby on Rails as well as Python is also frequently used for automated scraping jobs. For the highest performance, C++ DOM parsers should be considered.

Additionally, Bash Scripting can be used together with cURL as a command-line tool to scrape a search engine.

Topics


Jammu & Kashmir - History, Culture & Traditions | J&K Current Trends | Social Network | Health | Lifestyle | Human Resources | Analytics | Cosmetics | Cosmetology | Forms | Jobs



Quote of the Day


"Time Flies Over, but Leaves its Shadows Behind"