Diffbot

Turns any website into a structured data source by extracting articles, products, and discussions into your database without manual scrapers.

Try Diffbot in Ceven

Ask Ceven anything
Standard

Why use Ceven?

  1. AI native Diffbot integration

    • Describe the outcome and Ceven picks the right Diffbot calls, fills the parameters, and checks the result.
    • Structured, agent friendly tool schemas so each call runs reliably instead of by guesswork.
    • Rich coverage for reading, writing, and querying your Diffbot data, across all 35 of its actions.
  2. Managed auth

    • Built in OAuth with automatic token refresh and rotation.
    • One place to manage, scope, and revoke Diffbot access.
    • Per user and per environment credentials instead of shared keys.
  3. Agent optimized design

    • Actions are tuned from real success and error rates so reliability climbs over time.
    • Full execution logs so you always know what ran in Diffbot, when, and on whose behalf.
    • The agent pauses and asks when Diffbot is unclear instead of plowing ahead.
  4. Enterprise grade security

    • Fine grained access so you control which agents and people can reach Diffbot.
    • Least privilege by default, read scopes first and only the writes a workflow needs.
    • A full audit trail of every Diffbot action to support review and sign off.

Supported tools

Every action Ceven's agents can run on Diffbot, and when to use it.

Diffbot Search
Use this to query data extracted by crawl or bulk jobs using DQL queries after extraction is complete.
Get Account Details
Pull account details including plan information and usage statistics to verify daily quota status.
Diffbot Analyze
Use this when you have a URL and need Diffbot to automatically determine the content type and route it to the right extractor.
Get Article Data
Extract structured metadata from a web article URL including authors, publication dates, and images.
Get Discussion Thread
Extract structured discussion data from forums, comment sections, and review pages after identifying the URL.
Get Event Data
Use this to pull structured event details such as venue, date, and description from a web page.
Get Image Data
Extract detailed information about images including dimensions and recognition data for publicly accessible URLs.
Get Product Data
Pull structured product information including specifications, prices, availability, and reviews from a page.
Get Video Data
Extract structured video metadata including titles, descriptions, and embedded HTML from any web page.
List Bulk Jobs
Pull a list of all bulk jobs associated with a token to check the status of account jobs.
Resolve Lost ID
Map a lost identifier to its canonical counterpart in the knowledge graph for data consistency.
Start Bulk Job
Use this to process large numbers of URLs asynchronously through a bulk extract job.
Start Crawl Job
Spider a site for links and process them into a single collection using seed URLs.
Stop Bulk Job
Halt further processing of URLs in a job in progress using the specific job ID.
Get Diffbot Account Details
Tool to retrieve account details, including plan information and usage statistics. use after authenticating to verify subscription and daily quota status.
Diffbot Get Event
Tool to extract event details from web pages. use when you need structured event data such as venue, date, and description.
Diffbot Get Image
Tool to extract detailed information about images, including dimensions and recognition data. use after confirming the image url is publicly accessible.
Diffbot Get Product
Tool to extract product information such as specifications, prices, availability, and reviews. use when you need structured product data including specs, pricing, and reviews.

18 actions · scroll to see them all

Frequently asked questions

Diffbot does not rely on fixed HTML selectors or CSS paths which is why it does not break when a website updates its design. It uses computer vision and machine learning to recognize the visual structure of a page, identifying what a headline, a price, or a product image looks like regardless of the underlying code. When Ceven calls the Diffbot API, the AI analyzes the page in real time to find the relevant data points. This means your workflows remain stable even if the target site undergoes a complete redesign, as the AI understands the semantic meaning of the content rather than the specific position of a div tag.
Diffbot can access most public web pages, but it is subject to the robots.txt files and the terms of service of the target websites. Some sites employ aggressive anti bot protections or CAPTCHAs that can block automated extraction. While Diffbot uses advanced techniques to mimic human browsing, some highly protected sites may still return errors. If a crawl job fails for a specific domain, it is usually due to these server side restrictions. Users should ensure they have the legal right to scrape the data they are targeting and be aware that some sites specifically forbid automated access in their legal terms.
A Bulk Job is used when you already have a predefined list of URLs that you want to process asynchronously. You provide the list, and Diffbot processes each one independently. A Crawl Job is more expansive because it starts with a few seed URLs and then spiders the site, following links to discover and extract other relevant pages automatically. Crawl jobs are ideal for mapping out an entire website or finding all product pages in a category. Because crawl jobs consume more resources and can put more load on target servers, they are typically governed by different rate limits and plan requirements than simple bulk extraction tasks.
Yes, Crawl Jobs are typically gated behind specific subscription tiers such as the Plus plan. Basic accounts may have access to the individual extraction APIs for articles or products but cannot trigger the autonomous spidering functionality. When you attempt to start a crawl job via Ceven, the agent will check your account details to ensure the feature is enabled. If you receive a permission error, it usually means your current Diffbot tier does not support site spidering. You can verify your current plan and available features by using the Get Account Details action within your workflow to see your active subscription level.
The Knowledge Graph allows Diffbot to link entities across different pages and sites. For example, if it finds a company mentioned on a news article and the same company on a LinkedIn page, it recognizes them as the same entity. Ceven leverages this by using the Resolve Lost ID tool to clean up your data. When the agent pulls data from multiple sources, it can map various identifiers to a single canonical record. This prevents your CRM or database from being cluttered with duplicate entries for the same company or person, effectively turning the messy web into a structured relational database that your business tools can actually use.
No, Diffbot is designed to extract data from the public web. It cannot bypass login screens, paywalls, or authentication requirements to access private data. If a page requires a user session or a cookie to be viewed, Diffbot will only see the login page or a 403 Forbidden error. For workflows that require data from within a private account, you would need to use a different tool that supports browser session injection or official API access provided by that specific platform. Diffbot's strength lies in its ability to structure the massive amount of unstructured data available on the open internet.
Credits are consumed based on the type of API call made. A simple Analyze call uses fewer credits than a full Product or Article extraction. Bulk and Crawl jobs consume credits for every single page successfully processed. Because Ceven can trigger these jobs automatically, it is important to monitor your usage. If a crawl job is set to follow too many links, it can exhaust your monthly quota quickly. You can use the Get Account Details action to build a monitoring workflow that alerts you via email or Slack when your remaining credit balance drops below a certain threshold to avoid service interruptions.
Diffbot supports multiple languages through its natural language processing models. It can identify the language of a page and extract structured data accordingly. This allows Ceven to run global competitive intelligence workflows where the agent extracts product pricing from a German site and a Japanese site and then normalizes that data into English for your report. The extraction quality is generally highest for major global languages, but the AI is capable of identifying common entities like dates, prices, and names across most scripts. This makes it a powerful tool for companies operating in international markets who need a unified view of global web data.

Alternatives to Diffbot

Other tools that solve a similar problem. Ceven supports these too, so you can switch or run more than one at once.

Try Ceven on your stack

Plug Ceven on top of the tools you already run. Connect Diffbot and the rest of your stack, describe the outcome, and its agents handle the work end to end, days of it in minutes.

Get started for free