# Diffbot

Diffbot provides AI-powered tools to extract and structure data from web pages, transforming unstructured web content into structured, linked data.

- **Category:** artificial intelligence
- **Auth:** API_KEY
- **Composio Managed App Available?** N/A
- **Tools:** 35
- **Triggers:** 0
- **Slug:** `DIFFBOT`
- **Version:** 20260312_00

## Tools

### Combine Entity Profiles

**Slug:** `DIFFBOT_COMBINE_ENTITY_PROFILES`

Combine multiple entity profiles into a unified view using the Diffbot Knowledge Graph. Returns enhanced person or organization data by matching on identifying attributes like name, email, employer, or URL. Use this to enrich partial entity data, merge duplicate profiles, or verify entity identity.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `ip` | string | No | IP address of the entity to enhance. Can be used with types Person and Organization. |
| `url` | array | No | Origin or homepage URI(s) of entity to enhance. Can be used with types Person and Organization. |
| `name` | array | No | Name(s) of the entity to enhance. Can be used with types Person and Organization. Multiple names can be provided for better matching. |
| `type` | string ("Person") | No | Valid entity types for the combine API. |
| `email` | string | No | Email address of the entity to enhance. Can be used only with type Person. |
| `phone` | string | No | Phone number of the entity to enhance. Can be used with types Person and Organization. |
| `title` | string | No | Job title of the entity to enhance. Can be used only with type Person. |
| `filter` | string | No | Semi-colon separated path filter to include specific fields in response JSON. Use dot notation (e.g., 'skills.name') or JsonPath expression (e.g., '$.name;$.locations.country.name'). |
| `school` | string | No | School or educational institution of the entity to enhance. Can be used only with type Person. |
| `search` | boolean | No | When true, Diffbot will attempt to search the web for origins and merge relevant results with what's found in the Knowledge Graph. |
| `refresh` | boolean | No | When true, Diffbot will attempt to recrawl all origins of the identified entity and reconstruct the entity from refreshed data. |
| `customId` | string | No | User-defined ID for correlation and tracking purposes. |
| `employer` | string | No | Employer of the entity to enhance. Can be used only with type Person. |
| `jsonmode` | string (" " | "extended") | No | JSON mode options for response formatting. |
| `location` | string | No | Location of the entity to enhance. Can be used with types Person and Organization. |
| `threshold` | number | No | Enhance similarity threshold score (0.0 to 1.0). Higher values require stronger match confidence. |
| `filterExclude` | string | No | Semi-colon separated path filter to exclude specific fields from response JSON. Use dot notation or JsonPath expression. |
| `nonCanonicalFacts` | boolean | No | When true, returns non-canonical facts in addition to canonical ones. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Create Bulk Extract Job

**Slug:** `DIFFBOT_CREATE_BULK`

Tool to submit a bulk extract job to process multiple URLs with Extract APIs. Use when you need to process many URLs asynchronously using any Extract API. The job will process URLs in the background and provide downloadable results.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `name` | string | Yes | Job name for identification. Must be unique. |
| `urls` | string | Yes | URLs to process. Can be a single URL or comma-separated list of URLs. |
| `apiUrl` | string | Yes | Extract API endpoint URL to use for processing (e.g., 'https://api.diffbot.com/v3/article'). The token will be added automatically. |
| `notifyEmail` | string | No | Email address to notify when the job completes. |
| `notifyWebhook` | string | No | Webhook URL to POST to when the job completes. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Create or Update Custom API

**Slug:** `DIFFBOT_CREATE_CUSTOM_API`

Tool to create or update the parameters and ruleset of a Custom API. Use this when you need to define custom extraction rules for specific websites that require tailored parsing logic beyond standard Diffbot APIs. Allows defining URL patterns, CSS selectors, extraction rules, and preprocessing filters to extract structured data from websites with unique layouts.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `api` | string | Yes | The specific API being targeted. Always precede the API name with "/api/" as in "/api/article" (except for "all") |
| `notes` | array | No | An array of strings that can be added manually. The API automatically adds a note specifying when the API was last updated |
| `rules` | array | No | An array of objects that defines a set of rules for the specific urlPattern-api combination |
| `testUrl` | string | No | A URL that can be used to check that the rule still works as intended. This is the page that will load automatically when editing the ruleset in the Dashboard UI |
| `useProxy` | string | No | Used to disable proxies (when they have been set globally), by applying the value "none" |
| `prefilters` | array | No | An array of CSS selector strings that should be omitted from the DOM before extraction occurs |
| `urlPattern` | string | Yes | A regex pattern that defines the URLs for which the ruleset will be applied |
| `renderOptions` | string | No | Rendering options for the page (e.g., 'mobile', 'desktop') |
| `xForwardHeaders` | object | No | X-Forward headers configuration. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Create Bulk Enhance Job

**Slug:** `DIFFBOT_CREATE_KG_BULK_ENHANCE`

Tool to submit a bulk enhance job to enrich multiple entities asynchronously. Use when you need to process many Person or Organization records in batch. The API accepts entity descriptions and returns enriched data from the Diffbot Knowledge Graph.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `name` | string | No | Human-readable name for the bulk job to help identify it later. |
| `size` | integer | No | Maximum number of results to return per entity. Default is 1. |
| `search` | boolean | No | If true, Diffbot will search the web for additional origins and merge results. Default is false. |
| `refresh` | boolean | No | If true, Diffbot will recrawl all origins and reconstruct entities from fresh data. Default is false. |
| `entities` | array | Yes | List of entities to enhance. Each entity can be a Person, Organization, or a reference by ID. Minimum 1 entity required. |
| `jsonmode` | string (" " | "extended") | No | JSON mode options for enhance results. |
| `threshold` | number | No | Similarity threshold for matching entities (0.0 to 1.0). Higher values require closer matches. |
| `webhookurl` | string | No | Webhook URL to receive notifications when the job completes. |
| `nonCanonicalFacts` | boolean | No | If true, returns non-canonical facts in results. Default is false. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Delete Custom API

**Slug:** `DIFFBOT_DELETE_CUSTOM_API`

Tool to delete custom API definitions for a given URL pattern. Removes custom extraction rules from your account. Use when you need to remove previously configured custom APIs.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `api` | string | Yes | Base API of the custom API to delete. Always precede the API name with '/api/' (e.g., '/api/article') |
| `urlPattern` | string | Yes | URL pattern (regex) of the custom API to delete. This defines which URLs the custom API applies to. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Delete KG Enhance Bulkjob

**Slug:** `DIFFBOT_DELETE_KG_ENHANCE_BULKJOB`

Tool to delete an Enhance Bulkjob. Removes the bulk job and its results from the system. Use when cleaning up completed or failed jobs.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `bulkjobId` | string | Yes | Enhance Bulkjob ID to delete (e.g., 'B-a6a72339-3af7') |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Download Bulk Job Results

**Slug:** `DIFFBOT_DOWNLOAD_BULK_RESULTS`

Tool to download results of a bulk enhance job with filtering options via POST request. Use this to retrieve processed results from a completed or running bulk job. Supports multiple export formats (json, jsonl, csv, xls, xlsx) and various filtering options to customize the output. HTTP 200 indicates results are ready, HTTP 201 means the job is still executing.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `from` | integer | No | Starting index for pagination (offset). Should be used together with 'size' parameter. |
| `head` | integer | No | Return first n results. Use this to limit the number of results returned. |
| `size` | integer | No | Maximum number of results to return. Should be specified with 'from' parameter for pagination. |
| `wait` | integer | No | Seconds to wait for bulkjob results to export. Results will continue to export in the background. Use 0 to only trigger an export without waiting. |
| `filter` | string | No | Semi-colon separated path filter to filter response json. You can use simple dot notation like 'skills.name' or JsonPath expressions like '$.name;$.locations.country.name'. |
| `format` | string ("json" | "jsonl" | "csv" | "xls" | "xlsx") | No | Export format options for bulk results. |
| `bulkjobId` | string | Yes | Enhance Bulkjob ID (e.g., 'B-89cfc3b2-e744'). This is the unique identifier of the bulk job whose results you want to download. |
| `exportfile` | string | No | File name of the export file. Specify a custom filename for the exported results. |
| `exportspec` | string | No | The spec defines the columns to export. This is applicable for csv, xls and xlsx formats. Simple spec looks like 'name;summary'. For complex specs including list handling, see Diffbot documentation. |
| `exportquery` | boolean | No | Prefixes the enhance query parameters to the CSV export result. Only applicable for CSV exports. |
| `onlyMatches` | boolean | No | Return only records that have a match. Use this to filter out records without matches. |
| `filterExclude` | string | No | Semi-colon separated path filter to exclude data from response json. You can use simple dot notation or JsonPath expressions. |
| `exportseparator` | string | No | Separator for multi-value fields when exporting columnar results (csv, xls, xlsx formats). |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Enhance Entity with Knowledge Graph

**Slug:** `DIFFBOT_ENHANCE_ENTITY`

Enrich a person or organization with comprehensive data from the Diffbot Knowledge Graph. Provide identifiers like name, email, employer, or URL and receive detailed entity information including employment history, education, location, skills, and more. Use when you need to gather all publicly available knowledge about a specific person or organization from billions of web pages.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `id` | string | No | DiffbotId of entity to enhance. Can be used with types Person and Organization. If you know the exact entity ID, this is the most precise identifier |
| `ip` | string | No | IP address of the entity to enhance. Can be used with types Person and Organization |
| `url` | array | No | Origin or homepage URI of entity to enhance. Can be used with types Person and Organization. Provide multiple URLs associated with the entity |
| `name` | array | No | Name of the entity to enhance. Can be used with types Person and Organization. Provide multiple name variations to increase match accuracy |
| `size` | integer | No | Maximum number of results to return (default=1). Set higher to get multiple potential matches |
| `type` | string ("Person" | "Organization") | No | Type of entity to enhance. |
| `email` | array | No | Email address(es) of the entity to enhance. Can be used only with type Person. Provide multiple emails if available |
| `phone` | string | No | Phone number of the entity to enhance. Can be used with types Person and Organization |
| `title` | string | No | Job title of the entity to enhance. Can be used only with type Person |
| `filter` | string | No | Semi-colon separated path filter to include only specific fields in response JSON. Use dot notation (e.g., 'skills.name') or JsonPath expressions (e.g., '$.name;$.locations.country.name'). Reduces response size |
| `school` | string | No | School or educational institution of the entity to enhance. Can be used only with type Person |
| `search` | boolean | No | If true, Diffbot will attempt to search the web for origins for the search query and merge relevant results with what's found in the KG (default=false). Useful when entity might not be in the KG yet |
| `refresh` | boolean | No | If true, Diffbot will attempt to recrawl all origins of the identified entity and reconstruct the entity from refreshed data (default=false). This provides the most up-to-date information but takes longer |
| `customId` | string | No | User-defined ID for correlation and tracking purposes. Will be returned in the response for request matching |
| `employer` | string | No | Employer of the entity to enhance. Can be used only with type Person. Helps identify the correct person when names are common |
| `jsonmode` | string (" " | "extended") | No | JSON mode for response formatting. |
| `location` | string | No | Location of the entity to enhance. Can be used with types Person and Organization. Can be city, state, country, or full address |
| `threshold` | number | No | Enhance similarity threshold (0.0 to 1.0). Only return matches with similarity score above this threshold. Higher values return fewer but more confident matches |
| `description` | string | No | Description of the entity to enhance. Can be used with types Person and Organization. Helps identify the correct entity |
| `filterExclude` | string | No | Semi-colon separated path filter to exclude specific fields from response JSON. Use dot notation or JsonPath expressions. Useful to remove verbose fields |
| `nonCanonicalFacts` | boolean | No | If true, returns non-canonical facts in addition to canonical facts (default=false). Non-canonical facts are alternative values found across multiple sources |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Diffbot Extract Job

**Slug:** `DIFFBOT_EXTRACT_JOB`

Tool to extract structured job posting data from job listing pages. Returns job title, company, location, salary, requirements, skills, and other job-related information. Use when you need to parse and structure data from job postings.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `url` | string | Yes | Target URL of the job listing page to extract structured data from |
| `proxy` | string | No | IP address of a custom proxy that will be used to fetch the target page. Leave empty to use default proxy |
| `fields` | string | No | Comma-separated list of optional fields to be returned from any fully-extracted pages (e.g. 'querystring,links'). Valid values: links, extlinks, meta, querystring, breadcrumb |
| `timeout` | integer | No | Maximum time in milliseconds to wait for the retrieval/fetch of content from the requested URL. Default is 30000 (30 seconds) |
| `callback` | string | No | Use for jsonp requests. Needed for cross-domain ajax |
| `useProxy` | string | No | Set to 'default' to use Diffbot's datacenter proxy for this request. Set to 'none' to instruct Extract to not use proxies |
| `proxyAuth` | string | No | Authentication parameters for the custom proxy specified in the proxy parameter (format: username:password) |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Diffbot Extract List

**Slug:** `DIFFBOT_EXTRACT_LIST`

Tool to extract structured data from list-style pages like news indexes, product listings, and directory pages. Returns an array of items with their titles, links, and descriptions. Use when you need to extract multiple items from a page organized as a list or index.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `url` | string | Yes | Target URL to extract list data from. Must be a valid URL starting with http or https. |
| `proxy` | string | No | Specify an IP address of a custom proxy that will be used to fetch the target page. |
| `fields` | string | No | Comma-separated list of optional fields to be returned from any fully-extracted pages (e.g., 'links,meta,querystring'). Valid values: links, extlinks, meta, querystring, breadcrumb. |
| `timeout` | integer | No | Sets a value in milliseconds to wait for the retrieval/fetch of content from the requested URL. The default timeout for the third-party response is 30 seconds (30000). |
| `callback` | string | No | Use for jsonp requests. Needed for cross-domain ajax. |
| `useProxy` | string | No | Set to 'default' to use Diffbot's datacenter proxy for this request. 'none' will instruct Extract to not use proxies, even if proxies have been enabled for this particular URL globally. |
| `proxyAuth` | string | No | Used to specify the authentication parameters that will be used with a custom proxy specified in the &proxy parameter. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Get Diffbot Account Details

**Slug:** `DIFFBOT_GET_ACCOUNT`

Retrieves comprehensive Diffbot account information including subscription plan details, credit balance, usage history, and account status. Returns account holder name, email, current plan, available credits, and daily usage statistics for the past 31 days. Use this to check your account's credit balance, monitor API usage patterns, verify account status, or retrieve account metadata.

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Diffbot Analyze

**Slug:** `DIFFBOT_GET_ANALYZE`

Automatically analyzes a web page to determine its type and extract structured data. The Analyze API intelligently classifies pages into types (article, product, discussion, image, video, organization, etc.) and extracts relevant structured data. Use this when you need to process URLs of unknown type or want automatic extraction without specifying the page type in advance.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `url` | string | Yes | The full URL of the page to analyze, including http:// or https:// |
| `mode` | string | No | Restrict extraction to a specific page type. Options: article, product, discussion, image, video, list, event. If not specified, all types are considered. |
| `fields` | string | No | Comma-separated list of additional fields to include or limit output fields. Options include: links, extlinks, meta, querystring, breadcrumb, or specific field names to limit output. |
| `timeout` | integer | No | Maximum time (in milliseconds) to wait for API response. Default is 30000ms (30 seconds). Maximum is 300000ms (5 minutes). |
| `callback` | string | No | Optional JSONP callback function name. If set, the API returns JSONP-wrapped response. |
| `fallback` | string | No | API to use if page type cannot be determined. Options: article, product, discussion, image, video. |
| `discussion` | boolean | No | Set to false to disable automatic extraction of comments/discussions from the page. Default is true. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Get Article Data

**Slug:** `DIFFBOT_GET_ARTICLE`

Tool to extract information from articles, including authors, publication dates, and images. Use when you need structured metadata from a web article URL.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `url` | string | Yes | Full URL of the web page to analyze, must start with http or https |
| `mode` | string | No | Extraction mode override (defaults to 'article') |
| `stats` | boolean | No | Whether to include statistics like word count |
| `fields` | string | No | List of specific fields to include in the response. If provided, only these fields are returned. |
| `paging` | string | No | Paging token for multi-page articles (returned in previous response) |
| `timeout` | integer | No | Maximum time in milliseconds to wait for page rendering |
| `callback` | string | No | Name of the JSONP callback function (if using JSONP) |
| `discussion` | boolean | No | Whether to include discussion/comment data in the response |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Get Bulk Job Data

**Slug:** `DIFFBOT_GET_BULK_DATA`

Tool to download extracted results from a completed bulk job. Use after a bulk job has finished processing to retrieve the data. Supports JSON and CSV formats.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `num` | integer | No | Limit results to the N most recently processed URLs. Useful for testing or sampling large result sets. |
| `name` | string | Yes | Name of the bulk job whose results you want to download. Must match the job name specified when the job was created. |
| `type` | string ("data" | "urls") | No | Type of data to retrieve. |
| `format` | string ("json" | "csv") | No | Output format for bulk job data. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Get Bulk Job Status

**Slug:** `DIFFBOT_GET_BULK_JOB_STATUS`

Tool to poll the status of a specific Diffbot Knowledge Graph Enhance bulk job. Use when you need to check the progress, completion status, or details of a bulk enhancement job.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `bulkjobId` | string | Yes | Enhance Bulkjob ID to poll status for |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Get Bulk Job Results

**Slug:** `DIFFBOT_GET_BULK_RESULTS`

Tool to download the results of a completed Enhance Bulkjob. Returns enriched records from the bulk job. Use after a bulk enhance job has completed processing.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `from` | integer | No | Starting index for pagination (use with size parameter) |
| `head` | integer | No | Return first n results |
| `size` | integer | No | Maximum number of results to return. Should be specified with from parameter. Use -1 for all results. |
| `wait` | integer | No | Seconds to wait for bulkjob results to export. Results will continue to export in the background. Use 0 to only trigger an export without waiting. |
| `filter` | string | No | Semi-colon separated path filter to filter response json. Use dot notation like 'skills.name' or JsonPath expressions like '$.name;$.locations.country.name'. |
| `format` | string ("json" | "jsonl" | "csv" | "xls" | "xlsx") | No | Export format options. |
| `bulkjobId` | string | Yes | Enhance Bulkjob ID to retrieve results for |
| `exportfile` | string | No | File name of the export file |
| `exportspec` | string | No | Defines the columns to export for csv, xls and xlsx formats. Use semi-colon separated field paths like 'name;summary' or JsonPath expressions. See Diffbot documentation for advanced syntax. |
| `exportquery` | string ("true" | "false") | No | Enum for exportquery parameter. |
| `onlyMatches` | string ("true" | "false") | No | Enum for onlyMatches parameter. |
| `filterExclude` | string | No | Semi-colon separated path filter to exclude data from response json. Use dot notation or JsonPath expressions. |
| `exportseparator` | string | No | Separator for multi-value fields when exporting columnar results |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Get Bulk Single Result

**Slug:** `DIFFBOT_GET_BULK_SINGLE_RESULT`

Tool to download the result of a single job within a Diffbot bulk enhance job. Returns enriched entity data for a specific input record by its index. Use after a bulk enhance job has completed to retrieve individual results without downloading the entire dataset.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `job_idx` | integer | Yes | Job index within the bulkjob (0-based index of the input record) |
| `bulkjob_id` | string | Yes | Enhance Bulkjob ID (e.g., 'B-1ff60452-8421') |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Get Crawl Data

**Slug:** `DIFFBOT_GET_CRAWL_DATA`

Download extracted results from a completed crawl job. Returns all structured data extracted during crawl processing (articles, products, etc.). Use after a crawl job has completed to retrieve the collected data.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `num` | integer | No | Maximum number of results to return. Omit to return all available results. |
| `name` | string | Yes | Name of the crawl job to retrieve data from. Must be an existing crawl job created via Start Crawl. |
| `format` | string ("json" | "csv") | No | Output format for crawl data. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Get Discussion Thread

**Slug:** `DIFFBOT_GET_DISCUSSION`

Extract structured discussion threads from web pages including forums, comment sections, product reviews, Reddit discussions, and blog comments. Returns posts with author info, timestamps, content, and hierarchical relationships. Useful for analyzing conversations, gathering feedback, or monitoring discussions. Supported platforms: Native comment systems, Disqus, Facebook Comments, Reddit, forum software, and more. Use this when you need to: - Extract all comments/posts from a discussion thread - Analyze user feedback or reviews - Monitor forum discussions or social media threads - Gather structured conversation data with metadata

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `url` | string | Yes | The URL of the discussion page to process (e.g., forum thread, Reddit discussion, product review page, or comment section). |
| `fields` | string | No | Comma-separated list of additional fields to include in the response (e.g., 'sentiment', 'links', 'meta', 'breadcrumb'). |
| `timeout` | integer | No | Maximum time in milliseconds to wait for the response. Default is 30000 (30 seconds). |
| `maxPages` | string | No | Maximum number of pages to concatenate. Default is 1 (no concatenation). Set to 'all' to retrieve all pages in the thread. Each page counts as a separate API call. |
| `norender` | boolean | No | Whether to disable full page rendering. Set to True for faster responses with potentially lower extraction quality. Default is False. |
| `discussion` | boolean | No | Whether to extract comments/reviews. Set to False to disable comment extraction for faster response times. Default is True. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Diffbot Get Event

**Slug:** `DIFFBOT_GET_EVENT`

Tool to extract event details from web pages. Use when you need structured event data such as venue, date, and description.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `url` | string | Yes | URL of the event page to analyze |
| `fields` | string | No | Comma-separated list of fields to return, e.g., title,date,location |
| `paging` | boolean | No | Enable automatic paging of results |
| `timeout` | integer | No | Maximum timeout in milliseconds for the API call |
| `callback` | string | No | JSONP callback function name, if JSONP output is needed |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Diffbot Get Image

**Slug:** `DIFFBOT_GET_IMAGE`

Tool to extract detailed information about images, including dimensions and recognition data. Use after confirming the image URL is publicly accessible.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `url` | string | Yes | Publicly-accessible URL of the image to analyze |
| `fields` | string | No | Comma-separated list or array of specific fields to include in response, e.g., 'naturalWidth','captions' |
| `paging` | boolean | No | Whether to include paging information for multi-image responses |
| `timeout` | integer | No | Maximum time to wait for API response, in milliseconds |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Get KG Coverage Report by ID

**Slug:** `DIFFBOT_GET_KG_COVERAGE_REPORT_BY_ID`

Download Knowledge Graph coverage report by report ID. Returns detailed CSV coverage statistics showing field presence across query results. Use this after generating a coverage report from a DQL query to retrieve the statistical breakdown of field coverage.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `id` | string | Yes | Report ID to retrieve. Format: C-<hash> (e.g., 'C-a0b39dad-68bf'). Reports may expire quickly, so use immediately after generation. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Diffbot Get Product

**Slug:** `DIFFBOT_GET_PRODUCT`

Tool to extract product information such as specifications, prices, availability, and reviews. Use when you need structured product data including specs, pricing, and reviews.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `url` | string | Yes | URL of the product page to analyze |
| `mode` | string | No | Extraction mode override (defaults to 'product') |
| `fields` | array | No | List of fields to return, e.g., title,offerPrice,images |
| `paging` | boolean | No | Enable automatic paging of results |
| `timeout` | integer | No | Maximum timeout in milliseconds for the API call |
| `callback` | string | No | JSONP callback function name, if JSONP output is needed |
| `discussion` | boolean | No | Include discussions/comments in the response |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Get Video Data

**Slug:** `DIFFBOT_GET_VIDEO`

Tool to extract information from videos, including titles, descriptions, and embedded HTML. Use when you need structured video metadata from any web page.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `url` | string | Yes | Full URL of the web page to analyze for embedded videos, must start with http or https |
| `mode` | string | No | Extraction mode override (e.g., 'auto') |
| `fields` | string | No | Comma-separated list or array of optional fields to include in the response (e.g., 'links', 'meta', 'querystring', 'breadcrumb'). Standard fields are always returned. |
| `paging` | boolean | No | Whether to return all detected results in one call (may increase runtime) |
| `timeout` | integer | No | Maximum time in milliseconds to wait for extraction |
| `callback` | string | No | Name of the JSONP callback function (if using JSONP) |
| `fallback` | boolean | No | Whether to try an alternate extraction method if the primary fails |
| `discussion` | boolean | No | Include user discussion data (comments) if available |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### List Bulk Jobs

**Slug:** `DIFFBOT_LIST_BULK_JOBS`

Tool to list all Bulk jobs associated with a specific token. Use after authenticating to retrieve statuses of all jobs for the account.

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### List Bulk Jobs Status For Token

**Slug:** `DIFFBOT_LIST_BULK_JOBS_STATUS_FOR_TOKEN`

Tool to get the status of all bulk enhance jobs for a token. Returns list of all bulk jobs associated with your API token. Use when you need to monitor or retrieve the status of multiple bulk jobs at once.

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### List Custom APIs

**Slug:** `DIFFBOT_LIST_CUSTOM_APIS`

Tool to retrieve all Custom APIs and their extraction rules currently defined on your Diffbot token. Use when you need to list, review, or audit custom API configurations for your account.

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Manage Crawl Job

**Slug:** `DIFFBOT_MANAGE_CRAWL`

Manages Diffbot crawl jobs: pause, restart, delete, or view status. Returns list of all active crawl jobs when called without parameters. Use 'name' parameter with action flags (pause=1, restart=1, delete=1) to control specific jobs.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `name` | string | No | Unique identifier of the crawl job to manage. Omit to list all active crawl jobs. |
| `pause` | integer | No | Set to 1 to pause the specified crawl job. Requires 'name' parameter. |
| `delete` | integer | No | Set to 1 to delete the specified crawl job. Requires 'name' parameter. |
| `restart` | integer | No | Set to 1 to restart the specified crawl job. Requires 'name' parameter. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Resolve Lost ID

**Slug:** `DIFFBOT_RESOLVE_LOST_ID`

Tool to resolve lost IDs in the Knowledge Graph. Use when you need to map a lost identifier to its canonical counterpart for data consistency.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `type` | string | No | The type of object (e.g., 'article', 'product'). If omitted, Diffbot will attempt to infer. |
| `lostId` | string | Yes | The lost ID which needs to be resolved to a canonical ID. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Diffbot Knowledge Graph Search

**Slug:** `DIFFBOT_SEARCH`

Search the Diffbot Knowledge Graph using DQL (Diffbot Query Language). Query billions of entities including organizations, people, articles, products, and more. Use structured queries to filter by type, fields, and relationships.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `col` | string | No | Comma-separated list of custom crawl collections to search (default="all"). Only used when query_type="crawl". |
| `size` | integer | No | Maximum number of results to return (default=50). Use -1 to return all results. Constraint: from+size ≤ 10,000 for facet queries. |
| `query` | string | Yes | DQL (Diffbot Query Language) query to search the Knowledge Graph. Use structured syntax to filter entities and documents. |
| `filter` | string | No | Semi-colon separated path filter to include specific fields in response JSON using dot notation or JsonPath. |
| `format` | string | No | Output format. Only "json" is supported by this action. Non-JSON formats (jsonl, csv, xls, xlsx) cannot be processed and will be ignored. |
| `offset` | integer | No | Starting index for pagination (API param 'from'; default=0). Constraint: from+size ≤ 10,000 for facet queries. |
| `jsonmode` | string | No | JSON mode: "extended" (includes origin info) or "id" (returns diffbotIds only). |
| `query_type` | string | No | Type of query: "query" (default, structured DQL), "text" (free-text search), "queryTextFallback" (tries DQL first, falls back to text), or "crawl" (search custom crawl collections). |
| `filter_exclude` | string | No | Semi-colon separated path filter to exclude specific fields from response JSON. |
| `non_canonical_facts` | boolean | No | Include non-canonical facts in results (default=false). |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Search Crawl Job Data

**Slug:** `DIFFBOT_SEARCH_CRAWL_DATA`

Tool to query crawl job collections using DQL (Diffbot Query Language). Use when you need to search extracted data from completed crawl or bulk jobs by collection name.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `col` | string | Yes | Name of the collection (Crawl or Bulk job name) to search |
| `num` | integer | No | Number of results to return per page (default varies by API) |
| `query` | string | Yes | Search query string using Diffbot Query Language (DQL). Supports operators like type:Article, sortby:date, etc. |
| `start` | integer | No | Pagination offset - number of results to skip (default=0) |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Start Bulk Job

**Slug:** `DIFFBOT_START_BULK`

Tool to start a Bulk Extract job. Use when processing large numbers of URLs asynchronously. The Diffbot Bulk API uses GET requests with query parameters to create jobs.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `name` | string | Yes | Unique job name for identification. Required by the Diffbot API. |
| `urls` | array | Yes | List of page URLs to process. URLs should be separated by whitespace in the API call. |
| `apiUrl` | string | Yes | Diffbot Extract API endpoint to use (e.g., 'https://api.diffbot.com/v3/article'). Do NOT include the token - it will be added automatically. |
| `maxToCrawl` | integer | No | Maximum number of URLs to crawl. |
| `notifyEmail` | string | No | Email to notify when job completes. |
| `maxToProcess` | integer | No | Maximum number of URLs to process. |
| `notifyWebhook` | string | No | Webhook URL to POST on job completion. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Start Crawl Job

**Slug:** `DIFFBOT_START_CRAWL`

Initiates a Diffbot crawl job that spiders a website starting from seed URLs and processes discovered pages with a specified Extract API. The crawler follows links within the domain, collects structured data (articles, products, etc.), and stores results for download. Use this to systematically extract data from entire websites or sections. Requires Diffbot Plus plan or higher.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `name` | string | Yes | Unique identifier for the crawl job. Used to manage and retrieve the crawl. |
| `seeds` | array | Yes | List of seed URLs where crawling will begin. URLs will be URL-encoded automatically. |
| `apiUrl` | string | Yes | Full Diffbot Extract API endpoint URL to process crawled pages. Examples: 'https://api.diffbot.com/v3/article' for articles, 'https://api.diffbot.com/v3/product' for products, 'https://api.diffbot.com/v3/analyze' for automatic type detection. |
| `repeat` | number | No | Number of days between automatic crawl repeats. Use 7.0 for weekly, 1.0 for daily. Omit for one-time crawl. |
| `crawlDelay` | number | No | Delay in seconds between requests to the same IP address. Default is 0.25 seconds. |
| `maxToCrawl` | integer | No | Maximum number of pages to crawl/spider. Default is 100,000. Use -1 for unlimited. |
| `obeyRobots` | integer | No | Whether to respect robots.txt directives. 1 = obey (default), 0 = ignore. |
| `notifyEmail` | string | No | Email address to notify when the crawl completes. |
| `maxToProcess` | integer | No | Maximum number of pages to process with the Extract API. Default is 100,000. Use -1 for unlimited. |
| `customHeaders` | object | No | Custom HTTP headers to include in crawl requests (e.g., {'User-Agent': 'MyBot/1.0'}). |
| `notifyWebhook` | string | No | Webhook URL to POST to when the crawl completes. |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Stop Bulk Job

**Slug:** `DIFFBOT_STOP_BULK_JOB`

Tool to pause (stop) a running Bulk job. Pausing halts further processing of URLs while preserving existing progress. To resume, use the appropriate resume action. Specify the exact job name (case-sensitive) as provided when the job was created.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `name` | string | Yes | Name of the Bulk job to pause/stop (as defined when creating the job) |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |

### Stop KG Bulk Job By ID

**Slug:** `DIFFBOT_STOP_KG_BULK_JOB_BY_ID`

Tool to stop an active Knowledge Graph Enhance bulk job by its ID. Halts processing of a running KG bulk job immediately. Use when you need to stop a specific KG bulk job using its bulkjobId.

#### Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `bulkjobId` | string | Yes | The unique identifier of the Knowledge Graph Enhance bulk job to stop (e.g., 'B-1ff60452-8421') |

#### Output

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `data` | string | Yes | Data from the action execution |
| `error` | string | No | Error if any occurred during the execution of the action |
| `successful` | boolean | Yes | Whether or not the action execution was successful or not |
