What is Web Scraping and Is it Legal?

Whether it is legal to collect and store data from public web pages in general is curious. In this article, we will try to clarify this situation.

Jan 22, 2020

Today, many internet users need to obtain public data on their web pages. At the basic level, the simplest solution to do this is to select and copy any field we want from a web page or to save the link of that page or to save the page completely. Although this solution can be used in the simplest sense, sometimes systematic data collection may be needed. There are several terms for systematic data collection over the web. These are the terms web crawling and web scraping. Let me examine the two concepts with the differences between them and whether they are legal or not.

 

Web Crawling

 

Web crawling is the saving of the html obtained as a result of sending the request to a web page with the GET method. It is the most important feature of parsing the elements in the html obtained in the crawling process (Parsing) and saving it systematically. It usually works like this in many crawling engines systematically.

  1. A url object is taken as input from the crawling url queue.
  2. Compliance with url rules is checked.
  3. Url properties are parsed.
  4. GET request is sent for html content.
  5. If there is no error after the responses, html content is obtained.
  6. The elements in the html content are parsed.
  7. Extract and data storage are done as a result of the desired tags.
  8. Internal links within the tags are detected and added to the crawling queue.
  9. The crawling process for the received url ends and the crawling url is deleted from the queue.
  10. If the crawling url queue is full, go back to step 1.
  11. Crawling process ends.

 

Web Scraping

 

Web scraping is to produce meaningful data from the html content obtained as a result of sending the request to a web page with the GET method. The difference in the processing of the data obtained, if it performs the same function as crawling in terms of operation, is the difference between these two terms. Crawling is a step in the scraping system hierarchy. After this step, scraping starts while parceling and processing the data. The scraping steps on the Koovan ecosystem work as follows.

  1. A url object is taken as input from the crawling url queue.
  2. Compliance with url rules is checked.
  3. Url properties are parsed.
  4. GET request is sent for html content.
  5. If there is no error after the responses, html content is obtained.
  6. Javascript methods in html are executed and a new html is generated.
  7. The elements in the new html content obtained are parsed.
  8. Extract and data storage are done as a result of the desired tags.
  9. The parsing element data is obtained from the parsing by the specified xpath path and the corresponding data path of the matching element is recorded.
  10. Internal links within the tags are detected and added to the crawling queue.
  11. The crawling process for the received url ends and the crawling url is deleted from the queue.
  12. If the crawling url queue is full, return to step 1.
  13. Crawling process has done.

The concept of xpath here is a language that shows the path of an element in the html content. Thanks to this language, each element can be accessed with a text and obtain the desired data.

 

Is Web Scraping Legal?

 

Collecting and collecting data on public web pages can be legal as long as certain requirements are met. These articles are as follows.

  • A crime is not committed (as defined in the Computer Fraud and Abuse Law), without violating any contract (in terms of terms of use), unless a screening is done at a destructive rate.
  • The companies' user agreement cannot be implemented as a browsing agreement, as companies do not provide adequate information to the site visitors about the terms.
  • Website data can be accessed as a visitor and by following similar ways to a search engine. This can be done without explicitly accepting any terms as a user.
  • She concluded that linking to the terms of use at the bottom of the web page was not sufficient to "cause constructive notification". In other words, there is nothing to imply that accessing only information on a public page is subject to any contractual terms. Neither explicit nor implicit consent is granted to any agreement, so no contract is violated.
  • Signup for an Account

  • Share your wishes with us

  • Access to your data