# Scrapfly

Scrapfly is a web scraping API that enables developers to extract data from websites efficiently, offering features like JavaScript rendering, anti-bot protection bypass, and proxy rotation.

- **Category:** ai web scraping
- **Auth:** API_KEY
- **Composio Managed App Available?** N/A
- **Tools:** 14
- **Triggers:** 0
- **Slug:** `SCRAPFLY`
- **Version:** 20260316_00

## Tools

### Capture Website Screenshot

**Slug:** `SCRAPFLY_CAPTURE_SCREENSHOT`

Tool to capture a full-page or viewport screenshot of a website. Use when you need to take a screenshot with options like JS rendering, custom resolution, or accessibility testing. Returns the screenshot image directly. Supports vision deficiency simulations and dark mode.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `js` | string | No | JavaScript code to execute on the page before capturing screenshot. |
| `url` | string | Yes | Target URL to capture a screenshot of. |
| `cache` | boolean | No | Enable caching of the screenshot result. |
| `format` | string ("jpg" | "png" | "webp" | "gif") | No | Screenshot image format (jpg, png, webp, or gif). Default is jpg. |
| `capture` | string | No | Area to capture: 'viewport' for visible area, 'fullpage' for entire page, or a CSS/XPath selector for a specific element. Default is viewport. |
| `country` | string | No | Proxy geolocation country code (e.g., 'US', 'FR', 'DE') to capture screenshot from that region. |
| `options` | string | No | Comma-separated screenshot options: 'dark_mode' for dark theme, 'block_banners' to hide cookie/privacy banners, 'print_media_format' for print styles. |
| `timeout` | integer | No | Request timeout in milliseconds. Default is 30000 (30 seconds). |
| `cache_ttl` | integer | No | Cache time-to-live in seconds. Only applies when cache is enabled. |
| `resolution` | string | No | Screen resolution in WIDTHxHEIGHT format (e.g., '1920x1080'). Default is 1920x1080. |
| `auto_scroll` | boolean | No | Automatically scroll down the page before capturing. Useful for lazy-loaded content. |
| `cache_clear` | boolean | No | Clear and refresh cache for this request. |
| `rendering_wait` | integer | No | Time to wait after page load before capturing screenshot, in milliseconds. Useful for waiting for dynamic content. |
| `wait_for_selector` | string | No | CSS selector or XPath to wait for before capturing. Screenshot is taken after this element appears. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Capture Screenshot Metadata (HEAD)

**Slug:** `SCRAPFLY_CAPTURE_SCREENSHOT_HEAD`

Tool to capture screenshot metadata without downloading the image body. Use this for async screenshot workflows where you need the URL to retrieve the image later. Returns the screenshot URL in response, saving bandwidth compared to full screenshot retrieval.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `js` | string | No | Base64-encoded JavaScript to execute on the page before screenshot (max 16KB). |
| `url` | string | Yes | Target URL to capture screenshot of. |
| `cache` | boolean | No | Enable caching for repeated requests. Default is false. |
| `format` | string ("jpg" | "png" | "webp" | "gif") | No | Image format options for screenshot. |
| `capture` | string | No | Capture area: 'viewport' for visible area, 'fullpage' for entire page, or a CSS selector/XPath for specific element. |
| `country` | string | No | Proxy location using ISO 3166 alpha-2 country code (e.g., 'US', 'FR', 'DE'). Default is 'us'. |
| `options` | string | No | Comma-separated flags: 'dark_mode' for dark theme, 'block_banners' to hide ads/popups, 'print_media_format' for print styles. |
| `timeout` | integer | No | Maximum request time in milliseconds (range: 60000-120000). Default is 60000. |
| `cache_ttl` | integer | No | Cache duration in seconds. Default is 86400 (24 hours). |
| `resolution` | string | No | Screen dimensions as widthxheight (e.g., '1920x1080'). Default is 1920x1080. |
| `auto_scroll` | boolean | No | Auto-scroll to bottom of page to trigger lazy-loaded content. Default is false. |
| `cache_clear` | boolean | No | Force cache refresh by clearing existing cached result. Default is false. |
| `rendering_wait` | integer | No | Delay after page load before capturing screenshot, in milliseconds. Default is 1000. |
| `vision_deficiency` | string ("none" | "deuteranopia" | "protanopia" | "tritanopia" | "achromatopsia" | "blurredVision") | No | Accessibility simulation options for vision deficiency. |
| `wait_for_selector` | string | No | CSS selector or XPath to wait for before capturing screenshot. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Create Scrapfly Crawler

**Slug:** `SCRAPFLY_CREATE_CRAWLER`

Tool to create a new web crawler to recursively crawl an entire website. Returns a crawler UUID for tracking progress. Use when you need to crawl multiple pages from a website with configurable limits and extraction rules.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `asp` | boolean | No | Enable Anti Scraping Protection bypass |
| `url` | string | Yes | Starting URL for the crawl. Must be a valid HTTP/HTTPS URL |
| `delay` | integer | No | Delay between requests in milliseconds. Range: 0-15000ms |
| `max_depth` | integer | No | Maximum crawl depth - controls how many levels deep the crawler will follow links |
| `render_js` | boolean | No | Enable JavaScript rendering |
| `page_limit` | integer | No | Maximum number of pages to crawl. Set to 0 for unlimited (subject to subscription limits) |
| `concurrency` | integer | No | Maximum concurrent scrape requests. Set 0 for account default |
| `cost_budget` | number | No | Automatically stop the crawl when reaching a credit limit |
| `exclude_paths` | array | No | Exclude URLs matching these patterns. Mutually exclusive with include_only_paths |
| `content_formats` | array | No | Array of desired output formats (markdown, extracted_data, page_metadata, etc.) |
| `include_only_paths` | array | No | Only crawl URLs matching these patterns. Supports wildcards (*). Max 100 paths |
| `follow_external_links` | boolean | No | Allow crawler to follow links to external domains |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Extract Structured Data

**Slug:** `SCRAPFLY_EXTRACT_DATA`

Tool to extract structured data from HTML or other content using AI models, LLM prompts, or custom templates. Use when you need to parse web pages or documents into structured JSON data. Supports predefined extraction models for common types (articles, products, events) or custom extraction via prompts/templates.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `url` | string | No | Base URL for converting relative URLs to absolute URLs in the extracted data. Useful when extracting links and images. |
| `charset` | string | No | Document charset encoding. Use 'auto' for automatic detection or specify a charset like 'utf-8', 'iso-8859-1', etc. |
| `content` | string | Yes | HTML, text, or structured content to extract data from. This will be sent as the request body. |
| `content_type` | string ("text/html" | "text/plain" | "text/markdown" | "text/csv" | "application/json" | "application/ld+json" | "application/xml" | "application/xhtml+xml") | Yes | Content type of the document body. Required to properly parse the input content. |
| `webhook_name` | string | No | Webhook name for asynchronous processing. If provided, the extraction will be processed asynchronously and results sent to the webhook. |
| `extraction_model` | string ("article" | "event" | "food_recipe" | "hotel" | "hotel_listing" | "job_listing" | "job_posting" | "organization" | "product" | "product_listing" | "real_estate_property" | "real_estate_property_listing" | "review_list" | "search_engine_results" | "social_media_post" | "software" | "stock" | "vehicle_ad" | "vehicle_ad_listing") | No | AI extraction models for structured data extraction. |
| `extraction_prompt` | string | No | Custom LLM prompt for extraction. Use this when you need custom extraction logic beyond predefined models. Cannot be used together with extraction_model or extraction_template. |
| `extraction_template` | string | No | JSON extraction template defining custom extraction rules. Cannot be used together with extraction_model or extraction_prompt. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Get Scrapfly Account Information

**Slug:** `SCRAPFLY_GET_ACCOUNT_INFO`

Tool to retrieve Scrapfly account information. Use after authenticating to get API credit balance and usage stats. Returns comprehensive account data including subscription plan, usage statistics, billing info, and project settings.

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Get Crawler Artifact

**Slug:** `SCRAPFLY_GET_CRAWLER_ARTIFACT`

Tool to download crawler artifact files in WARC or HAR format. Use when you need to retrieve the complete crawl results as an archive file. WARC format is recommended for large crawls as it includes gzip compression.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `type` | string ("warc" | "har") | Yes | The artifact format type. WARC format is recommended for large crawls with gzip compression. |
| `crawler_uuid` | string | Yes | The unique identifier (UUID) of the crawler whose artifact you want to download. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Get Crawler Contents

**Slug:** `SCRAPFLY_GET_CRAWLER_CONTENTS`

Tool to retrieve extracted content from crawled pages. Supports multiple output formats including markdown, text, HTML, and JSON. Use when you need to access the actual content extracted during a crawl, with optional filtering by URL and format selection.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `url` | string | No | Specific URL to query for single-URL retrieval. When specified, returns content only for this URL. |
| `limit` | integer | No | Number of items per page (default: 10, max: 50). |
| `plain` | boolean | No | Return raw content without JSON wrapper. Only works with single URL queries (when 'url' parameter is specified). |
| `offset` | integer | No | Number of results to skip for pagination (default: 0). |
| `formats` | string | No | Comma-separated list of content formats to retrieve. Available formats: html (raw HTML), clean_html (HTML with boilerplate removed), markdown (LLM-optimized markdown), text (plain text only), json (structured JSON), extracted_data (AI-extracted structured data), page_metadata (page metadata like title, description). |
| `crawler_uuid` | string | Yes | The unique identifier (UUID) of the crawler whose contents you want to retrieve. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Get Crawler Status

**Slug:** `SCRAPFLY_GET_CRAWLER_STATUS`

Tool to get the current status of a crawler including progress, pages crawled, and completion state. Use for polling workflow to monitor crawl progress.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `crawler_uuid` | string | Yes | The unique identifier (UUID) of the crawler job to check status for. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Get Crawler URLs

**Slug:** `SCRAPFLY_GET_CRAWLER_URLS`

Tool to retrieve the list of discovered and crawled URLs from a crawler. Use when you need to get all URLs found during a crawl or filter by status to analyze failed URLs with error codes. Supports pagination for large result sets.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `page` | integer | No | Page number for pagination (default: 1). |
| `status` | string | No | Filter results by status. Use 'visited' for successfully crawled URLs or 'failed' to get URLs that failed with error codes. |
| `per_page` | integer | No | Number of URLs to return per page (default: 100, max: 1000). |
| `crawler_uuid` | string | Yes | The unique identifier (UUID) of the crawler whose URLs you want to retrieve. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Scrapfly Scrape

**Slug:** `SCRAPFLY_SCRAPE`

Tool to perform a web scraping request. Use when you need to fetch a page with custom configuration like JS rendering, proxies, and extraction.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `asp` | boolean | No | Enable anti-scraping protection bypass. |
| `url` | string | Yes | The URL to scrape. |
| `body` | string | No | Request body payload for POST/PUT requests. |
| `tags` | array | No | Custom labels for analytics. |
| `cache` | boolean | No | Enable result caching. |
| `retry` | integer | No | Number of retries on failure (default is 2). |
| `method` | string ("GET" | "POST" | "PUT" | "PATCH" | "DELETE" | "HEAD" | "OPTIONS") | No | HTTP method to use for request, default is GET. |
| `country` | string | No | Target country for the proxy (e.g., 'US', 'FR', 'DE'). Matching the target site's region can improve success rates against geo-restricted or anti-bot measures. |
| `headers` | object | No | Custom HTTP request headers to send. |
| `session` | string | No | Proxy session ID for sticky sessions. |
| `timeout` | integer | No | Timeout for the scrape request in seconds (default is 60). |
| `render_js` | boolean | No | Enable JavaScript rendering for sites requiring it. Default returns static HTML only; dynamic/SPA content requires this set to true. |
| `extract_rules` | object | No | Extraction rules (JMESPath, CSS selector, or JSONPath). |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Scrapfly Scrape POST

**Slug:** `SCRAPFLY_SCRAPE_POST`

Tool to scrape web pages using POST method to send data in the request body. Use when you need to scrape endpoints that require POST requests, such as form submissions or APIs that expect data payload.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `asp` | boolean | No | Enable anti-scraping protection bypass. |
| `url` | string | Yes | The target URL to scrape with POST request. |
| `body` | object | Yes | POST request body payload as JSON object. This data will be sent in the request body to the target URL. |
| `tags` | array | No | Custom labels for analytics and monitoring. |
| `cache` | boolean | No | Enable result caching to avoid repeated requests. |
| `retry` | integer | No | Number of retries on failure (default is 2). |
| `country` | string | No | Target country for the proxy using ISO 3166-1 alpha-2 codes (e.g., 'US', 'FR', 'DE'). |
| `headers` | object | No | Custom HTTP request headers to send to the target URL. Note: Content-Type defaults to application/json for POST requests. |
| `session` | string | No | Proxy session ID for sticky sessions and persistent cookies. |
| `timeout` | integer | No | Timeout for the scrape request in seconds (default is 60). |
| `render_js` | boolean | No | Enable JavaScript rendering for sites requiring browser execution. |
| `extract_rules` | object | No | Extraction rules using JMESPath, CSS selector, or JSONPath for structured data extraction. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Scrape With PUT

**Slug:** `SCRAPFLY_SCRAPE_WITH_PUT`

Tool to scrape web pages using PUT method with body payload. Use when the target API requires PUT requests with data in the request body. Forwards PUT request with custom body to the target URL. If not specified, content-type defaults to application/x-www-form-urlencoded.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `asp` | boolean | No | Enable anti-scraping protection bypass. |
| `url` | string | Yes | The target URL to scrape with PUT request. |
| `body` | string | Yes | Request body payload to send with the PUT request. Should be JSON string for JSON payloads. |
| `tags` | array | No | Custom labels for analytics. |
| `cache` | boolean | No | Enable result caching. |
| `retry` | integer | No | Number of retries on failure (default is 2). |
| `country` | string | No | Target country for the proxy (e.g., 'US', 'FR', 'DE'). |
| `headers` | object | No | Custom HTTP request headers to send. For JSON body, include 'content-type': 'application/json'. |
| `session` | string | No | Proxy session ID for sticky sessions. |
| `timeout` | integer | No | Timeout for the scrape request in seconds (default is 60). |
| `render_js` | boolean | No | Enable JavaScript rendering for sites requiring it. |
| `extract_rules` | object | No | Extraction rules (JMESPath, CSS selector, or JSONPath). |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |
