# WebScraping.AI

WebScraping.AI provides an API for web scraping with features like Chrome JS rendering, rotating proxies, and HTML parsing.

- **Category:** ai web scraping
- **Auth:** API_KEY
- **Composio Managed App Available?** N/A
- **Tools:** 7
- **Triggers:** 0
- **Slug:** `WEBSCRAPING_AI`
- **Version:** 20260316_00

## Tools

### Get account usage and quota

**Slug:** `WEBSCRAPING_AI_ACCOUNT_INFO`

Tool to retrieve account API call quota and usage. Use when checking remaining requests and subscription details.

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Ask Question About Web Page

**Slug:** `WEBSCRAPING_AI_ASK_QUESTION`

Tool to get an answer to a question about a given web page using LLM. Use when you need AI-powered analysis or extraction from a web page. Proxies and Chromium JavaScript rendering are used for page retrieval.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `js` | boolean | No | Execute on-page JavaScript using a headless browser. Default: true. |
| `url` | string | Yes | URL of the target page to analyze. |
| `proxy` | string ("datacenter" | "residential") | No | Proxy type to use for the request. |
| `device` | string ("desktop" | "mobile" | "tablet") | No | Device emulation type. Default: desktop. |
| `format` | string ("json" | "text") | No | Response format. Default: json. |
| `country` | string | No | Proxy country code. Default: us. |
| `headers` | object | No | HTTP headers to pass to the target page. |
| `timeout` | integer | No | Maximum web page retrieval time in milliseconds. Default: 10000, Max: 30000. |
| `question` | string | Yes | Question or instructions to ask the LLM model about the target page. |
| `wait_for` | string | No | CSS selector to wait for before returning page content. |
| `js_script` | string | No | Custom JavaScript code to execute on the target page. |
| `js_timeout` | integer | No | Maximum JavaScript rendering time in milliseconds. Default: 2000, Max: 20000. |
| `custom_proxy` | string | No | Custom proxy URL in http://user:password@host:port format. |
| `error_on_404` | boolean | No | Return error on 404 HTTP status on the target page. Default: false. |
| `error_on_redirect` | boolean | No | Return error on redirect on the target page. Default: false. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Extract Fields with AI

**Slug:** `WEBSCRAPING_AI_EXTRACT_FIELDS`

Tool to extract structured data fields from a web page using AI. Returns extracted fields as JSON. Uses proxies and Chromium JavaScript rendering for page retrieval and processing.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `url` | string | Yes | The target website URL to scrape and extract structured data from. |
| `fields` | object | Yes | Object describing fields to extract from the page. Keys are field names, values are descriptions of what to extract. Example: {'title': 'Main product title', 'price': 'Current product price'} |
| `unknown` | string | No | Optional parameter for handling unknown or additional configuration. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Get Rendered HTML

**Slug:** `WEBSCRAPING_AI_GET_RENDERED_HTML`

Tool to retrieve fully rendered HTML of a webpage. Use when JS-generated content must be included.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `js` | string | No | Base64-encoded JavaScript to execute after rendering. Must be Base64-encoded; plain script strings are ignored. |
| `url` | string | Yes | The target URL to render and fetch HTML. |
| `wait` | integer | No | Wait time before capture, in milliseconds. Increase for deferred JS execution or slow-loading pages to ensure complete content capture. |
| `device` | string ("desktop" | "mobile") | No | Browser device mode to simulate. |
| `locale` | string | No | Browser locale (RFC5646 code). |
| `cookies` | string | No | Cookies in 'key1=value1; key2=value2;' format. |
| `headers` | object | No | Extra HTTP headers as JSON object. |
| `referer` | string | No | Referer header value. |
| `timeout` | integer | No | Request timeout, in milliseconds. |
| `proxy_type` | string ("datacenter" | "residential") | No | Proxy type to use for the request. |
| `user_agent` | string | No | Custom User-Agent string. |
| `disable_images` | boolean | No | Whether to disable image loading. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Get Selected HTML

**Slug:** `WEBSCRAPING_AI_GET_SELECTED_HTML`

Tool to extract HTML from specific page elements using CSS selectors. Use when you need HTML from a particular section rather than the entire page.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `js` | boolean | No | Execute JavaScript via headless browser. Default: true. |
| `url` | string | Yes | URL of the target page to scrape. |
| `proxy` | string ("datacenter" | "residential") | No | Proxy type options. |
| `device` | string ("desktop" | "mobile" | "tablet") | No | Device type options. |
| `country` | string | No | Proxy country code (2-letter ISO code). Default: us. |
| `headers` | object | No | HTTP headers to pass to the target page as a JSON object. |
| `timeout` | integer | No | Maximum retrieval time in milliseconds. Default: 10000, max: 30000. |
| `selector` | string | No | CSS selector to extract specific page elements. If not provided, returns the entire page HTML. |
| `wait_for` | string | No | CSS selector to wait for before returning content. |
| `js_script` | string | No | Custom JavaScript code to execute on the page before extraction. |
| `js_timeout` | integer | No | JavaScript rendering time in milliseconds. Default: 2000, max: 20000. |
| `custom_proxy` | string | No | Custom proxy URL to use instead of built-in proxies. |
| `error_on_404` | boolean | No | Return error when target page responds with 404 status. Default: false. |
| `error_on_redirect` | boolean | No | Return error when target page redirects. Default: false. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Get Selected Multiple Elements

**Slug:** `WEBSCRAPING_AI_GET_SELECTED_MULTIPLE`

Tool to extract HTML of multiple page areas by URL and CSS selectors. Use when you need to extract multiple elements without HTML parsing on your side.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `js` | boolean | No | Execute on-page JavaScript using a headless browser. True by default. |
| `url` | string | Yes | The target URL to scrape. |
| `proxy` | string ("datacenter" | "residential") | No | Proxy type options. |
| `country` | string | No | Country of the proxy to use. US by default. |
| `timeout` | integer | No | Maximum web page retrieval time in milliseconds. Default is 15000, maximum is 30000. |
| `selectors` | array | No | Multiple CSS selectors to extract. If null, returns whole page HTML. |
| `js_timeout` | integer | No | Maximum JavaScript rendering time in milliseconds. Default is 2000. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Get Text

**Slug:** `WEBSCRAPING_AI_GET_TEXT`

Tool to retrieve raw text content from a specified web page. Returns unstructured plain text — markdown formatting (code fences, lists, headings) is not preserved. Use when you need plain text extraction from a URL. Use FIRECRAWL_EXTRACT instead when formatted markdown output is required.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `url` | string | Yes | The target URL to scrape text from. |
| `proxy` | string ("us" | "eu") | No | Proxy region to use for the request (e.g., 'us' or 'eu'). |
| `locale` | string | No | Browser locale/language (e.g., 'en-US'). |
| `session` | string | No | Session ID for preserving cookies across multiple calls. |
| `timeout` | integer | No | Request timeout in seconds (must be >= 1). |
| `render_js` | boolean | No | Whether to render JavaScript on the page before extracting text. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |
