• MarTech Today
  • Sections
    • Ads
    • Marketing
    • Content
    • Sales
    • Analytics
    • Management
    • Resources
    • More
    • Home
  • MarTech Today
  • Ads
  • Marketing
  • Content
  • Sales
  • Analytics
  • Mgmt
  • Resources
  • More
  • Events
  • SUBSCRIBE

MarTech Today

MarTech Today
  • Ads
  • Marketing
  • Content
  • Sales
  • Analytics
  • Management
  • Resources
  • More
  • Events
  • Newsletters
  • Home
Martech: Analytics & Data

Data Extractor Diffbot Wants To Turn The Web Into The Semantic Web

Startup nabs $10 million to launch its massive structured database that “knows” about products, news, profiles, comments and other web content.

Barry Levine on February 11, 2016 at 11:00 am
  • More
Diffbot

From the Diffbot website

Many companies grab web content, analyze it and return stats on sentiment, product mentions and the like.

But startup Diffbot says it takes a different approach that can automatically sort the web into human-like categories of knowledge.

And today the Palo Alto, California-based company is announcing a new Series A round of $10 million in investment funding, which will back the expansion within a few weeks of its newest tool — a massive structured database called the Global Index — from a closed beta phase to general availability.

Founded in 2008, the company specializes in automatically extracting the unstructured content on web pages, categorizing it using artificial intelligence, computer vision and natural language processing, and then storing it by data type in a structured database.

It may be obvious to a human that, for example, an image on a retailer’s page is of a pair of shoes, this number on the page is the price and this abbreviation is the color. But unless the page has been marked up in XML or other semantic marking to identify which info is a color, a crawler and the processing engine won’t be able to store “BR” as the color for this pair of sneakers or “$100” as its price.

Semantic Web Content

Diffbot grabs the info at the URL, renders the page inside its system and employs computer vision to visually analyze the page’s structure.

Essentially, Diffbot is creating semantic Web content — that is, information that is characterized by its meaning — even though the page hasn’t been formatted that way. It can automatically detect product, article, image, video, author, date, discussion threads, pricing info, product IDs like SKU, brand, video thumbnail and other categories.

VP of product John Davi  told me it can also scan images and find, for instance, all photos of Barack Obama wearing a blue tie.

Each page element — headline, photo, SKU and so on — is stored separately and made available for searching. Here, for instance, is a Diffbot-generated breakdown of a story I posted yesterday:

Diffbot

Diffbot has been providing what Davi called a “web reading robot” in support of specific applications. Instapaper, for instance, utilizes Diffbot to capture articles, identify and store its elements (title, story, images and so on), and then make them available for offline reading later.

Similarly, Cisco has used its service to monitor forums to automatically capture, store and categorize comments about products and those of its competitors. Other customers include Microsoft’s Bing, Duck Duck Go, eBay and Adobe.

“It’s A Big Web Out There”

Davi said the company has been beta testing the Global Index since last summer. One test, for example, ranked travel brands according to the kinds of sentiments found on forums.

The idea of the Index is to build a huge structured database of sorted, web-based knowledge that developers can tap for marketing or other uses, or for applications. Eventually, he indicated, the company would like to make it available via a dashboard as a searchable knowledge base of web content for marketers and other non-technical users.

In many ways, the Global Index is comparable to Google’s Knowledge Graph, which also categorizes info on the web into usable and related knowledge. But, Davi said, the Google effort is based on Wikipedia, the database from its Metaweb acquisition, several other sources and human efforts. It’s also available only through Google’s search engine, while the Global Index will shortly be open to the public.

Diffbot says its index, which has been autonomously spidering only since the summer, already contains more than 1.2 billion objects, where an object is an assembly of data representing some useful piece of knowledge, like a product. Google’s Knowledge Graph, it says, has only recently passed a billion objects after some years.

The initial focus of the Index has been on news and information, but the company has a much bigger ambition: to categorize most of the business-valuable information on the web. That will take at least three to five years, Davi acknowledges.

“It’s a big web out there,” he pointed out.



About The Author

Barry Levine
Barry Levine covers marketing technology for Third Door Media. Previously, he covered this space as a Senior Writer for VentureBeat, and he has written about these and other tech subjects for such publications as CMSWire and NewsFactor. He founded and led the web site/unit at PBS station Thirteen/WNET; worked as an online Senior Producer/writer for Viacom; created a successful interactive game, PLAY IT BY EAR: The First CD Game; founded and led an independent film showcase, CENTER SCREEN, based at Harvard and M.I.T.; and served over five years as a consultant to the M.I.T. Media Lab. You can find him at LinkedIn, and on Twitter at xBarryLevine.

Related Topics

AnalyticsChannel: Martech: Analytics & DataMarketing Tools: Analytics

Subscribe to receive daily martech news and expert insights. See terms.


We're listening.

Have something to say about this article? Share it with us on Facebook and Twitter.

Get the daily newsletter digital marketers rely on.
See terms.

ATTEND OUR EVENTS

MarTech 2021: March 16-17

MarTech 2021: Sept. 14-15

MarTech 2020: Watch On-Demand

×

Attend MarTech - Click Here


Learn More About Our MarTech Events

White Papers

  • The Six Principles of Building a Memorable Customer Experience
  • 5 Reasons Agencies Adopt Marketing Automation
  • How to Land Higher-Paying Clients: A 7-Step Framework to Grow Your Agency
  • B2B Marketing Trends Shaping 2021
  • State of Email Marketing 2021 Report
See More Whitepapers

Webinars

  • Crawl Your Way Towards Better Search Results With Dynamic Rendering
  • The AI Revolution Is Coming to Every Stage of Your Buyer’s Journey
  • The Fundamentals of Link Building for E-Commerce & Affiliate Sites in 2021
See More Webinars

Research Reports

  • Local Marketing Solutions for Multi-Location Businesses
  • Enterprise Digital Asset Management Platforms
  • Identity Resolution Platforms
  • Customer Data Platforms
  • B2B Marketing Automation Platforms
  • Call Analytics Platforms
See More Research

Register For MarTech - Free

Receive daily martech news and analysis.

Channels

  • Advertising
  • Marketing
  • Content
  • Social
  • Commerce
  • Sales
  • Analytics
  • Management
  • Home

Our Events

  • MarTech
  • SMX

Resources

  • White Papers
  • Research
  • Webinars

About

  • About Us
  • Contact
  • Privacy
  • Marketing Opportunities
  • Staff

Follow Us

  • Facebook
  • Twitter
  • LinkedIn
  • Newsletters
  • RSS

© 2021 Third Door Media, Inc. All rights reserved.