There can be instances when particular page types are required to be crawled e.g. if user wants to crawl only HTML pages or he do not wants robots.txt to be followed or sitemap extraction is not desired or specific directory crawling on a website is the target. In these cases, and others alike, spider configuration is the place to get help from.
Where to Find Spider Configuration
This option is available in spider menu. See the snapshots below.
This option ignores the robots.txt totally from any angle i.e. if website contains the robots.txt at its root directory; it will not be followed. If a custom robots.txt has set; this option will improvise that customization and will not follow any robots.txt.
This option will ignore sitemap crawling from all places; even if it is specified in custom robots.txt or it actually exists with website.
Note: When sitemap is being ignored from configuration panel; “Crawl Sitemap Option” from spider menu cannot be selected and vice versa because enabling both options will send the bot to the dead state and nothing will be crawled.
What if a user wants to crawl particular subdomain at a website? Below is an examples and solution to understand this.
I want to crawl only webpages on www.example.com.
All the links and subdomains on domain name will be crawled if they exist i.e. abc.example.com, xyz.example.com abc.example.com/an-other-example-page etc. The reason for this behavior is; input field contain the domain name only. For crawling only a subdomain, that subdomain is required as input i.e. if results are required only from www.example.com then input should be www.example.com rather specifying the domain name (example.com) in the input field.
Crawl External Folders
The Webbee is, by default, set to crawl only the internal folders i.e. all the pages in specified directory/domain will be crawled. E.g.
But this default behavior can be altered with this option. When checked, Webbee will crawl the entire website, if internally linked, from the given input even if it is a directory or a domain.
Other Ignore Options
All other ignore options will ignore the concerned page type.