Spider Configuration

There can be instances when particular page types are required to be crawled e.g. if user wants to crawl only HTML pages or he do not wants robots.txt to be followed or sitemap extraction is not desired or specific directory crawling on a website is the target. In these cases, and others alike, spider configuration is the place to get help from.

Where to Find Spider Configuration

This option is available in spider menu. See the snapshots below.

spider configuration 1

spider configuration 2

Options Explained

Ignore Robots.txt

This option ignores the robots.txt totally from any angle i.e. if website contains the robots.txt at its root directory; it will not be followed. If a custom robots.txt has set; this option will improvise that customization and will not follow any robots.txt.

Ignore Sitemaps

This option will ignore sitemap crawling from all places; even if it is specified in custom robots.txt or it actually exists with website.

Note: When sitemap is being ignored from configuration panel; “Crawl Sitemap Option” from spider menu cannot be selected and vice versa because enabling both options will send the bot to the dead state and nothing will be crawled.

Ignore Subdomains

What if a user wants to crawl particular subdomain at a website? Below is an examples and solution to understand this.

Case:

I want to crawl only webpages on www.example.com.

Normal Input
example.com
Output
All the links and subdomains on domain name will be crawled if they exist i.e. abc.example.com, xyz.example.com abc.example.com/an-other-example-page etc. The reason for this behavior is; input field contain the domain name only. For crawling only a subdomain, that subdomain is required as input i.e. if results are required only from www.example.com then input should be www.example.com rather specifying the domain name (example.com) in the input field.

Crawl External Folders

The Webbee is, by default, set to crawl only the internal folders i.e. all the pages in specified directory/domain will be crawled. E.g.

Input: http://www.example.com/directory/sub-directory
Output:
http://www.example.com/directory/sub-directory/1
http://www.example.com/directory/sub-directory/2
http://www.example.com/directory/sub-directory/3
etc.

But this default behavior can be altered with this option. When checked, Webbee will crawl the entire website, if internally linked, from the given input even if it is a directory or a domain.

Other Ignore Options

All other ignore options will ignore the concerned page type.

About Ahmad Ali

Ahmad is the co-founder and CEO at Webbee Inc. He’s been working as a digital marketer for past few years and has worked with some notables names across different industries. He is also the creator of Webbee SEO spider, one of the most advanced SEO spider tool on the internet.

Leave a Comment