Gist Crawler Bot
The Gist Crawler Bot is a web crawler designed to systematically access and extract both licensed and public content from the web. The collected data is used to power a Retrieval-Augmented Generation (RAG) model that serves content on the Gist.ai website.
Key Features of Gist Crawler Bot:
- Identification
The Gist Crawler identifies itself with a dedicated user-agent string in all HTTP requests, enabling servers to recognize and manage its activity. The user-agent string that Prorata.ai uses isProRataInc
- Scope and Compliance
The bot is configured to crawl only content that is either public (freely accessible without credentials) or explicitly licensed for use. This ensures compliance with legal and ethical standards for web crawling - Crawling Behavior
Our bot follows internal and external links found on pages to discover new content, updating its index as it encounters new or changed information. - Respect for Site Controls
The Gist Crawler respectsrobots.txt
directives and other exclusion protocols, only accessing content that site owners permit for crawling. It also honors additional access controls or restrictions as required by licensing agreements. - Crawl Rate and Server Impact
To minimize impact on server performance, ProRataIncBot spaces out its requests and avoids overwhelming any single site, adhering to best practices for responsible crawling. - Content Handling
The crawler may downloadHTML
and associated resources (such asCSS
orJavaScript
) needed for accurate content extraction, subject to reasonable file size limits for efficiency. - Verification
Requests from ProRataIncBot can be verified through IP address or reverse DNS lookup to ensure authenticity and prevent spoofing. Our crawler uses the IP address(es):172.190.46.235
(for production) and172.171.95.51
(testing)
By following these principles, ProRataIncBot efficiently and responsibly collects licensed and public content for use in Gist.ai’s RAG-powered web services, while respecting both legal requirements and the preferences of content owners.
Updated about 1 month ago