Crawler IP Whitelisting in VergeCloud

Known Crawler Whitelisting in VergeCloud

Overview

Automated bots often referred to as crawlers or spiders are programs that systematically browse the web. Search engines, analytics platforms, AI services, and other online tools rely on these bots to index content, collect website performance metrics, and power data-driven services. While most crawlers are beneficial and legitimate, managing them carefully is essential to ensure your website remains secure, maintains optimal SEO performance, and is not disrupted by abusive or rogue traffic.

VergeCloud provides robust mechanisms to automatically handle known and trusted crawlers, enabling seamless integration with search engines and other services while giving you fine-grained control over crawler traffic.

How VergeCloud Handles Crawlers

By default, VergeCloud automatically whitelists IP addresses and user-agents of well-known, legitimate crawlers such as:
  1. Googlebot
  2. Bingbot
  3. Applebot
  4. Meta (Facebook) crawler
  5. OpenAI bots
  6. Yandex
This ensures that trusted bots can crawl and index your website without being blocked by security rules, avoiding interruptions to SEO performance and site analytics. VergeCloud verifies these bots using official sources wherever available. For crawlers without official IP lists, VergeCloud references the IP2Location database to maintain a reliable whitelist.

Key Advantages of Automatic Whitelisting:

  1. Validating crawler authenticity: Only genuine crawlers gain access, protecting your site from impersonation attacks.
  2. Enhancing security: Blocks suspicious or unknown bots while allowing legitimate traffic.
  3. Improving site visibility: Trusted search engine bots can index your content efficiently.
  4. Staying updated: Automatic updates of known bot IP ranges from official sources.

Why Managing Crawlers Matters

Even though crawlers are beneficial, unchecked or unknown bots can impact your website negatively:
  1. Security threats: Malicious bots may attempt brute force attacks, SQL injections, or data scraping.
  2. SEO risks: Fake or spam bots can distort analytics data and affect search engine rankings.
  3. Performance degradation: High-volume bot traffic can strain server resources, slowing down legitimate user access.
Managing crawler access allows you to balance security, performance, and SEO optimization effectively.

Controlling Crawler Access Manually

For advanced users who want full control, VergeCloud allows the global whitelist to be disabled using the API. This enables administrators to manage all crawler traffic manually via firewall rules.

Warning
Disabling the global whitelist may prevent legitimate search engine bots from crawling your website, which could reduce visibility in search results. If you choose to do this, you must maintain an updated list of allowed crawlers and configure firewall rules accordingly.

Example API Call to Disable Global Whitelist

curl --location --request PATCH 'https://api.vergecloud.com/cdn/4.0/domains/example.com/firewall' \
--header 'Authorization: API_KEY' \
--header 'Content-Type: application/json' \
--data '{"skip_global_firewall": true}'

Creating Custom Rules for Crawlers

After disabling the whitelist, you are responsible for explicitly allowing or blocking specific crawler IPs to ensure only trusted bots access your content.
VergeCloud’s firewall allows you to define custom rules for crawler management. You can selectively allow or block crawler IP addresses or user agents based on your website’s needs.

Practical use cases include:
  1. Allowing only Googlebot to access certain pages while blocking other crawlers.
  2. Temporarily blocking all crawlers during maintenance windows.
  3. Restricting specific bots from sensitive endpoints while letting general traffic continue.
Firewall rules for crawlers can be configured using IP addresses, ASN numbers, user agents, or a combination of attributes. This granular control ensures that legitimate bots are not disrupted while preventing rogue bot traffic.

Below are the supported crawler bots with their official IP verification sources:

Best Practices for Crawler Management

  1. Enable automatic whitelisting unless you have a specific reason to manage crawlers manually.
  2. Regularly review crawler traffic logs to detect anomalies or suspicious patterns.
  3. Combine IP, ASN, and user-agent checks for stronger validation.
  4. Use temporary blocks instead of permanent rules during maintenance or testing periods.
  5. Update firewall rules whenever official crawler IP ranges change to prevent accidental blocking.


    • Related Articles

    • Steps to Activate Cloud Icon for VergeCloud

      Overview Before you activate the Cloud icon for your domain in VergeCloud, it’s important to make sure your server and DNS setup are fully prepared to work with the platform. Turning on the Cloud icon changes the way traffic reaches your website, ...
    • Using Layer 4 Network on VergeCloud

      Overview The VergeCloud Layer 4 Proxy enhances security and performance for applications using TCP protocol. It is ideal for non-HTTP traffic such as email, FTP, SSH, VoIP, or gaming. By proxying connections through VergeCloud’s edge, your origin ...
    • How to Whitelist VergeCloud’s IP Addresses in Your Firewall

      Overview To ensure seamless communication between VergeCloud’s edge servers and your origin or main server, it is essential to whitelist VergeCloud’s IP addresses in your firewall configuration. Without whitelisting, your firewall may block ...
    • Understanding VergeCloud’s DDoS Challenge Modes

      VergeCloud’s DDoS protection uses multiple layers of mitigation to protect against both network-level (Layer 3 & 4) and application-level (Layer 7) attacks. Each challenge mode handles threats differently. This guide explains each type to observe ...
    • Essential Steps Before Changing Nameservers to VergeCloud

      Overview When you add a new domain to the VergeCloud User Panel, one of the first and most important tasks is confirming that your DNS settings are correct. Proper DNS management determines whether your website loads, whether email services function, ...