Gist Crawler Bot

The Gist Crawler Bot is a web crawler designed to systematically access and extract both licensed and public content from the web. The collected data is used to power a Retrieval-Augmented Generation (RAG) model that serves content on the Gist.ai website.

Key Features of Gist Crawler Bot:

  • Identification
    The Gist Crawler identifies itself with a dedicated user-agent string in all HTTP requests, enabling servers to recognize and manage its activity. The user-agent string that Prorata.ai uses is ProRataInc
  • Scope and Compliance
    The bot is configured to crawl only content that is either public (freely accessible without credentials) or explicitly licensed for use. This ensures compliance with legal and ethical standards for web crawling
  • Crawling Behavior
    Our bot follows internal and external links found on pages to discover new content, updating its index as it encounters new or changed information.
  • Respect for Site Controls
    The Gist Crawler respects robots.txt directives and other exclusion protocols, only accessing content that site owners permit for crawling. It also honors additional access controls or restrictions as required by licensing agreements.
  • Crawl Rate and Server Impact
    To minimize impact on server performance, ProRataIncBot spaces out its requests and avoids overwhelming any single site, adhering to best practices for responsible crawling.
  • Content Handling
    The crawler may download HTML and associated resources (such as CSS or JavaScript) needed for accurate content extraction, subject to reasonable file size limits for efficiency.
  • Verification
    Requests from ProRataIncBot can be verified through IP address or reverse DNS lookup to ensure authenticity and prevent spoofing. Our crawler uses the IP address(es): 172.190.46.235 (for production) and 172.171.95.51 (testing)

By following these principles, ProRataIncBot efficiently and responsibly collects licensed and public content for use in Gist.ai’s RAG-powered web services, while respecting both legal requirements and the preferences of content owners.