Question 1

How does Diffbot handle sites that change their layout frequently?

Accepted Answer

Diffbot does not rely on fixed HTML selectors or CSS paths which is why it does not break when a website updates its design. It uses computer vision and machine learning to recognize the visual structure of a page, identifying what a headline, a price, or a product image looks like regardless of the underlying code. When Ceven calls the Diffbot API, the AI analyzes the page in real time to find the relevant data points. This means your workflows remain stable even if the target site undergoes a complete redesign, as the AI understands the semantic meaning of the content rather than the specific position of a div tag.

Question 2

Are there any limitations on which sites Diffbot can crawl?

Accepted Answer

Diffbot can access most public web pages, but it is subject to the robots.txt files and the terms of service of the target websites. Some sites employ aggressive anti bot protections or CAPTCHAs that can block automated extraction. While Diffbot uses advanced techniques to mimic human browsing, some highly protected sites may still return errors. If a crawl job fails for a specific domain, it is usually due to these server side restrictions. Users should ensure they have the legal right to scrape the data they are targeting and be aware that some sites specifically forbid automated access in their legal terms.

Question 3

What is the difference between a Bulk Job and a Crawl Job in Diffbot?

Accepted Answer

A Bulk Job is used when you already have a predefined list of URLs that you want to process asynchronously. You provide the list, and Diffbot processes each one independently. A Crawl Job is more expansive because it starts with a few seed URLs and then spiders the site, following links to discover and extract other relevant pages automatically. Crawl jobs are ideal for mapping out an entire website or finding all product pages in a category. Because crawl jobs consume more resources and can put more load on target servers, they are typically governed by different rate limits and plan requirements than simple bulk extraction tasks.

Question 4

Does Diffbot require a specific plan for Crawl Jobs?

Accepted Answer

Yes, Crawl Jobs are typically gated behind specific subscription tiers such as the Plus plan. Basic accounts may have access to the individual extraction APIs for articles or products but cannot trigger the autonomous spidering functionality. When you attempt to start a crawl job via Ceven, the agent will check your account details to ensure the feature is enabled. If you receive a permission error, it usually means your current Diffbot tier does not support site spidering. You can verify your current plan and available features by using the Get Account Details action within your workflow to see your active subscription level.

Question 5

How does the Knowledge Graph feature work with Ceven?

Accepted Answer

The Knowledge Graph allows Diffbot to link entities across different pages and sites. For example, if it finds a company mentioned on a news article and the same company on a LinkedIn page, it recognizes them as the same entity. Ceven leverages this by using the Resolve Lost ID tool to clean up your data. When the agent pulls data from multiple sources, it can map various identifiers to a single canonical record. This prevents your CRM or database from being cluttered with duplicate entries for the same company or person, effectively turning the messy web into a structured relational database that your business tools can actually use.

Question 6

Can I use Diffbot to extract data from password protected pages?

Accepted Answer

No, Diffbot is designed to extract data from the public web. It cannot bypass login screens, paywalls, or authentication requirements to access private data. If a page requires a user session or a cookie to be viewed, Diffbot will only see the login page or a 403 Forbidden error. For workflows that require data from within a private account, you would need to use a different tool that supports browser session injection or official API access provided by that specific platform. Diffbot's strength lies in its ability to structure the massive amount of unstructured data available on the open internet.

Question 7

How are Diffbot API credits consumed during a workflow?

Accepted Answer

Credits are consumed based on the type of API call made. A simple Analyze call uses fewer credits than a full Product or Article extraction. Bulk and Crawl jobs consume credits for every single page successfully processed. Because Ceven can trigger these jobs automatically, it is important to monitor your usage. If a crawl job is set to follow too many links, it can exhaust your monthly quota quickly. You can use the Get Account Details action to build a monitoring workflow that alerts you via email or Slack when your remaining credit balance drops below a certain threshold to avoid service interruptions.

Question 8

How does Diffbot handle different languages on the web?

Accepted Answer

Diffbot supports multiple languages through its natural language processing models. It can identify the language of a page and extract structured data accordingly. This allows Ceven to run global competitive intelligence workflows where the agent extracts product pricing from a German site and a Japanese site and then normalizes that data into English for your report. The extraction quality is generally highest for major global languages, but the AI is capable of identifying common entities like dates, prices, and names across most scripts. This makes it a powerful tool for companies operating in international markets who need a unified view of global web data.

Diffbot

Try Diffbot in Ceven

Why use Ceven?

AI native Diffbot integration

Managed auth

Agent optimized design

Enterprise grade security

Supported tools

Frequently asked questions

Related integrations

Alternatives to Diffbot

Try Ceven on your stack