Visitors: 0

Web Scraping

    Wilson Levi

    5/5 stars (1 votes)

    silver iMac near iPhone on brown wooden table

    Web Scrapingweb harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or a web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

    Web scraping a web page involves fetching it and extracting it from the same. Fetching is the downloading of a page (which a browser does when a user views a page). Therefore, web crawling is the main component of web scraping, to fetch pages for later processing. 

    Once fetched, then extraction can take place. The content of a page may be parsed, searched, reformatted, its data copied into a spreadsheet, or loaded into a database. Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else. 

    An example will be to find and copy names and telephone numbers, or companies and their URLs, or email id to a list (contact scraping).

    Web Scraping is used for contact scraping, and as a component of applications used for web indexing, web mining, and data mining, online price change monitoring and price comparison, product review scraping (to watch the competition), gathering real estate listings, weather data monitoring, website change detection, research, tracking online presence and reputation, web mashup, and web data integration.

    Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for humans and end-users and not for ease of automated use. As a result, specialized tools and software have been developed to facilitate the scraping and web pages.

    Newer forms of web scraping involve monitoring data feeds from web servers. For example, JSON is commonly used as a transport storage mechanism between the client and the webserver.

    There are methods that some websites use to prevent web scraping, such as detecting and disallowing bots from crawling (viewing) their pages. In response, there are web scraping systems that rely on using techniques in DOM parsing, computer vision, and natural language processing to simulate human browsing to enable gathering web page content for offline parsing.

    History
    The history of web scraping dates back nearly to the time when the World Wide Web was born.

    • After the birth of the World Wide Web in 1989, the first robot, World Wide Web Wanderer, was created in June 1993, which was intended only to measure the size of the web.
    • In December 1993, the first crawler-based web search engine, JumpStation was launched. As there were not so many websites available on the web, search engines at that time used to rely on their human website administrators to collect and edit the links into a particular format. In comparison, JumpStation brought a new leap, being the first www search engine that relied on a webroot.
    • In 2000, the first web API and API crawler came. API stands for Application Programming Interface. It is an interface that makes it much easier to develop a program by providing the building blocks. In 2000, Salesforce and eBay launched their own API, with which programmers were enabled to access and download some of the data available to the public. Since then, many websites offer web APIs for people to access their public database.

    Techniques
    Web scraping is the process of automatically mining data or collecting information from the World Wide Web. It is a field with active developments sharing a common goal with the semantic web vision, an ambitious initiative that still requires breakthroughs in text processing, semantic understanding, artificial intelligence, and human-computer interactions. Current web scraping solutions range from the ad-hoc, requiring human effort, to fully automated systems that are able to convert entire websites into structured information, with limitations.

    Human Copy & Paste
    The simplest form of web scraping is manually copying and pasting data from a web page into a text file or spreadsheet. Sometimes even the best web-scraping technology cannot replace a human's manual examination and copy-and-paste, and sometimes this may be the only workable solution when the websites for scraping explicitly set up barriers to prevent machine automation.

    Text Pattern Matching
    A simple yet powerful approach to extracting information from web pages can be based on the UNIX grep command or regular expression-matching facilities of programming languages (for instance Perl or Python).

    HTTP Programming
    Static and dynamic web pages can be retrieved by posting HTTP requests to the remote web server using socket programming.

    HTML Parsing
    Many websites have large collections of pages generated dynamically from an underlying structured source like a database. Data of the same category are typically encoded into similar pages by a common script or template. In data mining, a program that detects such templates in a particular information source, extracts its content, and translates it into a relational form, is called a wrapper.

    Wrapper generation algorithms assume that input pages of a wrapper induction system conform to a common template and that they can be easily identified in terms of a URL common scheme. Moreover, some semi-structured data query languages, such as XQuery and HTQL, can be used to parse HTML pages and to retrieve and transform page content.

    DOM Parsing
    By embedding a full-fledged web browser, such as the Internet Explorer or Chrome browser control, programs can retrieve the dynamic content generated by client-side scripts. These browser controls also parse web pages into a DOM tree, based on which programs can retrieve parts of the pages. Languages such as Xpath can be used to parse the resulting DOM tree.

    Vertical Aggregation
    There are several companies that have developed vertical-specific harvesting platforms. These platforms create and monitor a multitude of "bots" for specific verticals which no "man in the loop" (no direct human involvement), and no work related to a specific target site.

    The preparation involves establishing the knowledge base for the entire vertical and then the platform creates the bots automatically. The platform's robustness is measured by the quality of the information it retrieves (usually number of fields) and its scalability (how quickly it can scale up to hundreds or thousands of sites). This scalability is mostly used to target the Long Tail of sites that common aggregators find complicated or too labor-intensive to harvest content from.

    Semantic annotation recognizing
    The pages being scraped may embrace metadata or semantic markups and annotations, which can be used to locate specific data snippets. If annotations are embedded in the pages, as Microformat does, this technique can be viewed as a special case of DOM parsing. In another case, the annotations, organized into a semantic layer, are stored and managed separately from the web pages, so the scrapers can retrieve data schema and instructions from this layer before scraping the pages.

    Computer Vision Web-Page Analysis
    There are efforts using machine learning and computer vision that attempt to identify the exact information from web pages by interpreting pages visually as a human being might.

    Software
    There are many software tools available that can be used to customize web-scraping. This software may attempt to automatically recognize the data structure of a page or provide a recording interface that removes the necessity to manually write web-scraping code or some scripting functions that can be used to extract and transform content, and database interfaces that can store the scraped data in local databases. Some web scraping software can be used to extract data from an API directly.

    Example
    Below is an example of a web scraper, that scrapes a website and extracts email ids. So you don't have to search them on links:

    Email ID Extractor - Web Scraper
     

    Topics


    Jammu & Kashmir - History, Culture & Traditions | J&K Current Trends | Social Network | Health | Lifestyle | Human Resources | Analytics | Cosmetics | Cosmetology | Forms | Jobs



    Quote of the Day


    "Time Flies Over, but Leaves its Shadows Behind"