What Happens When AI Learns by Reading the Entire Internet?

October 12, 2020 - 8 minutes read

When we read, we make connections and store those connections for future reference. Whether it’s a new way to use a word or learning an unfamiliar word or phrase, we can strengthen our knowledge by increasing the number of connections to the new information. This is exactly how an artificial intelligence (AI) application by Stanford startup Diffbot works.

It reads every single page on the Internet, including foreign-language pages, and extracts as much information and facts as it can from its reading. Diffbot takes what the AI read and turns it into a number of three-part data points that can help make more connections: object, subject, and verb.

Synthesizing Information

Each three-part data point gets added to the existing knowledge base, comprised of billions of three-part data points. The data points become part of an interconnected network of information, called a knowledge graph. Knowledge graphs aren’t a new innovation; they’ve been around since the early days of AI research, but they’ve mostly been done by hand.

You’ve seen knowledge graphs in Google search results. They help you make connections by bringing together information about what you’ve searched for. For example, looking up a movie shows the cast, films related to the search, box office information, a general summary of the movie, and photos from scenes in the movie. This information is available all over the Internet by itself, but when it’s aggregated, it brings creates more value and improves mental connections for the user.

The creator of the worldwide web, Tim Berners-Lee, wanted to create something called “the semantic web,” which would’ve ultimately contained information for humans as well as machines. This vision would have allowed bots to shop for us, book our flights, and give more knowledgeable and actionable answers to questions than search engines currently do. But the knowledge graph was too cumbersome to figure out by hand, so we haven’t seen the realization of the semantic web — yet.

Google’s knowledge graph is only available for the most popular search terms. But Diffbot’s AI is a promising step towards the semantic web because it wants to generate an enormous knowledge graph for everything (not just popular search terms). Diffbot fully automates the construction of the knowledge graph, and this saves the company a lot of time and manual effort while enabling the knowledge graph to proliferate at astonishing rates. Not only that, but Diffbot is only one of three U.S.-based companies to crawl the entire web, alongside Google and Microsoft.

Victoria Lin is a research scientist at San Francisco-based Salesforce. She works on knowledge representation and natural language processing (NLP). She says that crawling the web is an excellent way to automate generating a large knowledge base because otherwise, it would take a lot of human effort.

More Equipped than Humans

To accomplish its job, the Diffbot AI uses a super-charged version of the Google Chrome browser to view raw pixels on a webpage. It then uses an image recognition algorithm to categorize the page into one of 20 types: discussion thread, event, article, image, and video name a few. To begin reading the webpage itself, the Diffbot AI identifies and categorizes specific elements on the page, like paragraphs, headings, author byline, product price, description, author bio, and more, using NLP to extract facts. This is all done extremely rapidly, especially compared to a human.

When a three-part data point is generated, it gets added to the ever-growing knowledge graph. It doesn’t matter if the language doesn’t align with the user’s query; if a user asked about a specific movie, it’ll pull information from articles written in Hindi and Mandarin. The CEO of Diffbot, Mike Tung, says watching the AI read webpages is like watching someone play a video game. It must navigate around pop-ups, between tabs, and scrolling through pages.

Diffbot’s knowledge graph is rebuilt every four or five days, but the AI reading bot is crawling the web non-stop. It adds 100 to 150 million new three-part data points to the knowledge graph every month as companies, people, and products get added to the web. It uses machine learning applications to connect new facts with old ones, which sometimes requires rewriting out-of-date facts or simply fusing the new with the old. As the knowledge graph continues expanding rapidly, Diffbot has faced intermittent challenges with maintaining enough server space in its data centers.

Limitless Industry Applications

Diffbot allows researchers access to the knowledge graph for free. But the company also boasts an extensive portfolio of 400 paying customers, from DuckDuckGo (uses Diffbot to generate Google-like knowledge graphs), Snapchat (uses Diffbot to rapidly extract highlights from news articles), NASDAQ (uses Diffbot to get fast information about the stock market for financial research), and Zola (uses Diffbot to help brides and grooms make wedding lists by pulling in prices and images).

Even Adidas and Nike use Diffbot to scour the web for counterfeit shoes on sale. While Adidas could simply search in Google for articles mentioning “Adidas trainers,” Diffbot goes the extra mile by letting the company look for sites that have products that mention “Adidas trainers,” so all effort can be put towards sites actually selling “Adidas trainers” products.

Diffbot’s Next Move

Companies using Diffbot’s knowledge graph have to interact with it using code. But, eventually, Tung wants to add an NLP interface to create an application that allows users to ask almost anything and get a response from the Diffbot AI with sources attributed. Dubbed a “universal factoid question answering system,” Tung says this won’t be possible with only NLP, but it’s an option if you combine multiple technologies.

But, according to Tung, Diffbot isn’t out to define intelligence. Instead, the company is “just trying to build something useful.” What do you think of Diffbot’s AI? Would it be useful for consumers, in addition to commercial uses?

Tags: , , , , , , , , , , , , , , , ,