Koovan Documentation

Koovan Web Scraping Usage Documentation Home


Get Start


There is no need for any installation to use the Koovan application. Koovan is a fully web-based cloud platform. The data you have obtained is stored in your own cloud account. You can access your data whenever you want and download a copy to your computer.

This document describes the use in Koovan. The steps in the documentation are sorted by scraping processes that need to be followed. By following these steps, you can start using the system very easily.

Purpose and Scope

Purpose

Koovan is a web-based web scraping application. It can collect data from public web pages, as well as facilitates data collection from web pages that have access restrictions. You can start using immediately by following the steps below.

The purpose of the document is to explain how you can collect data from a web page, how you can process and download the data you collect. If you cannot find what you are looking for in this document, please contact the support mail address immediately.

Scope

To crawl a website on Koovan, you must follow these steps.

  1. You must create a domain.
  2. You must create a template of the domain we created.
  3. You must create a crawl task for a web address that may comply with its template.
  4. You should wait for the crawl task to finish.

You will find the details of these steps later in the document.

Screen Definitions

In this section, the descriptions of the screens in the application are written. Detailed explanation of each screen is illustrated below.

1. Statistics Screen

There are two sections on this screen. One of them is the section with the statistical data of the last ten days and the other is the graph showing the last ten days of data lines of the last one month websites.

2. General Statistics Screen

The number of domain names, the number of binary files downloaded, the number of rows downloaded daily and monthly, the cloud account occupancy rate and the crawling data count graph of the last ten days.

3. Account Detail

It is the screen where detailed information of the registered account is displayed.

4. Domain Screen

It is the screen where the website domains are located. In addition to the total number of domains registered, a list of the last ten domains added is on this screen.

5. Transaction Logs Screen

It is the screen where logs formed as a result of the tasks and system processes initiated by the user.

6. Task List Screen

Adding a Website

To add a new website, click on the Add New Domain button from the Websites link and enter the name of the domain name you want to add. If the entered domain name is in accordance with the rules, the registration process is completed successfully.

Creating a Template

It is possible to collect the data on public web pages with templates you can determine. You can create as many templates as you want for a website. This section explains how to create a template for a website.

1. The template creation screen opens.

  • Select a desired domain name from the Templates page and click the Add New Templates tab.
  • Enter the link of the web page whose template you want to create in the text box in the template creation form that opens.
  • After waiting for a while, a copy of the page you want will be created in the web page viewer on the right.

2. Template name and link are determined

  • Text is entered in the template name field so that you can easily find it later
  • The link rule that you want the template to be considered valid is defined in the link field.
    • Basic level regular expression method is used to create the link rule.
    • The <:s> definition is used to indicate that a field can be dynamic as text. (Example: https://www.imdb.com/chart/<:s>/ link rule, https://www.imdb.com/chart/ It means there can be a template for top / and https://www.imdb.com/chart/moviemeter/ pages.)
    • The <:d> definition is used to indicate that a field can be created numerically dynamically.
    • Collect data with iteration tab is collected repeatedly from selected data.

3. Data fields are selected.

  • The data field desired to be labeled is determined from the screen that opens after any element clicked in the web page viewer.
  • If the Collect data with iteration tab is active, two different repeating elements must be selected from the selected data fields. This is because the repetitive elements are collected in a loop.
  • The data formed under the selected element is an example and aimed to use the next crawling engines as a road map at the current collection stage.

4. The template is saved.

Creating Crawling

A crawling task can be created for one or more web pages. There are three methods for this. Multiple links can be entered via a single link, or a text file can be uploaded to link. Creating crawling through a single link can be defined as follows.

  1. Go to the Create New Crawling page under the Crawling Menu.
  2. The link of the crawling task that you want to start from the From Single Url tab is entered.
  3. If you do not have a template that complies with this link rule, you can only collect the following data:
    1. Page headers
    2. Page keywords
    3. Page description
    4. Binary files
  4. When the most suitable template for this web page is detected by Koovan Engine, it collects the specified fields and collects them in the specified format.
  5. If a duplicate template has been created for the added url, pagination can be done for that url.
  6. Two methods can be used for paging. One of these is added the <:p> (for click paging) definition instead of the paging numeric key value. Another method is to replace it with <:o> (for automatic refreshing with Scroll).
  7. You can determine in which format it will create the result of the started crawling task.