What Happens When AI Learns by Reading the Entire Internet?

October 12, 2020 - 10 minutes read

When we read, we make connections and store those connections for future reference. Whether it’s a new way to use a word or learning an unfamiliar word or phrase, we can strengthen our knowledge by increasing the number of connections to the new information. This is exactly how an artificial intelligence (AI) application by Stanford startup Diffbot works.

Artificial Intelligence is a technology that leverages computers and machines to mimic human brain and capabilities. AI that learns from the internet could answer a question you ask to it. Or, using a language model like GPT-3, AI can create well constructed sentences like a human being or even write passable rhyming poetry by drawing inspiration from the vast collection of human culture. On the other hand, newer generation AIs such as GPT-4 has the technical capacity to read humanity’s digitized books, all of our digitized scientific papers, and much of the blog sphere.

But, What Happens When AI Has Read Everything?

Diffbot’s AI reads every single web page on the Internet, including foreign-language pages, which means reading hundreds of billions, if not trillion, words from whether wikipedia articles, a scientific paper published by researches with domain expertise, or any other human created text. ! It then extracts as much information and facts as it can from its reading. By analyzing all the surviving text, Diffbot takes what the AI read and turns it into a number of three-part data points that can help make more connections: object, subject, and verb.

So, AI can read everything on the Internet but the infinite supply of online information grows and changes continuously. Meaning, the AI’s development will need to catch up with the ever changing and increasing knowledge on the Internet to update its data accordingly.

Synthesizing Information with Artificial Intelligence

Each three-part data point gets added to the existing knowledge base, comprised of billions of three-part data points. The data points become part of an interconnected network of information, called a knowledge graph. Knowledge graphs aren’t a new innovation; they’ve been around since the early days of AI research, but they’ve mostly been done by hand.

You’ve seen knowledge graphs in Google search results. They help you make connections by bringing together information about what you’ve searched for. For example, looking up a movie shows the cast, films related to the search, box office information, a general summary of the movie, and photos from scenes in the movie. This information is available all over the Internet by itself, but when it’s aggregated, it brings creates more value and improves mental connections for the user.

The creator of the worldwide web, Tim Berners-Lee, wanted to create something called “the semantic web,” which would’ve ultimately contained information for humans as well as machines. This vision would have allowed bots to shop for us, book our flights, and give more knowledgeable and actionable answers to questions than search engines currently do. But the knowledge graph was too cumbersome to figure out by hand, so we haven’t seen the realization of the semantic web — yet.

Google’s knowledge graph is only available for the most popular search terms. But Diffbot’s AI is a promising step towards the semantic web because it wants to generate an enormous knowledge graph for everything (not just popular search terms). After reading and analyzing unstructured data such as text documents, images, videos, social media posts etc., Diffbot fully automates the construction of the knowledge graph, and this saves the company a lot of time and manual effort while enabling the knowledge graph to proliferate at astonishing rates. Not only that, but Diffbot is only one of three U.S.-based companies to crawl the entire web, alongside Google and Microsoft.

Victoria Lin is a research scientist at San Francisco-based Salesforce. She works on knowledge representation and natural language processing (NLP). She says that crawling the web is an excellent way to automate generating a large knowledge base because otherwise, it would take a lot of human effort.

More Equipped than Humans with Advanced Machine Learning

To accomplish its job, the Diffbot AI uses a super-charged version of the Google Chrome browser to view raw pixels on a webpage. It then uses an image recognition algorithm to categorize the page into one of 20 types: discussion thread, event, article, image, and video name a few. To begin reading the webpage itself, the Diffbot AI identifies and categorizes specific, key elements on the page, like paragraphs, headings, author byline, product price, description, author bio, and more, using NLP to extract facts. Thanks to the state of the art deep learning neural networks, advanced version of machine learning algorithms, This is all done extremely rapidly, especially compared to a human.

When a three-part data point is generated, it gets added to the ever-growing knowledge graph. It doesn’t matter if the language doesn’t align with the user’s query; if a user asked about a specific movie, it’ll pull information from articles written in Hindi and Mandarin. The CEO of Diffbot, Mike Tung, says watching the AI read web pages is like watching someone play a video game. It must navigate around pop-ups, between tabs, and scrolling through pages.

Diffbot’s knowledge graph is rebuilt every four or five days, but the AI reading bot is crawling the web non-stop. It adds 100 to 150 million new three-part data points to the knowledge graph every month as companies, people, and products get added to the web. It uses machine learning applications to connect new facts with old ones, which sometimes requires rewriting out-of-date facts or simply fusing the new with the old. As the knowledge graph continues expanding rapidly, Diffbot has faced intermittent challenges with maintaining enough server space for training data and processing power in its data centers.

Limitless Industry Applications with Deep Learning

Diffbot allows researchers access to the knowledge graph for free. But the company also boasts an extensive portfolio of 400 paying customers, from DuckDuckGo (uses Diffbot to generate Google-like knowledge graphs), Snapchat (uses Diffbot to rapidly extract highlights from news articles), NASDAQ (uses Diffbot to get fast information about the stock market for financial research), and Zola (uses Diffbot to help brides and grooms make wedding lists by pulling in prices and images).

Even Adidas and Nike use Diffbot to scour the web for counterfeit shoes on sale. While Adidas could simply search in Google for articles mentioning “Adidas trainers,” Diffbot goes the extra mile by letting the company look for sites that have products that mention “Adidas trainers,” so all effort can be put towards sites actually selling “Adidas trainers” products.

Diffbot’s Next Move

Companies using Diffbot’s knowledge graph have to interact with it using code. But, eventually, Tung wants to add an NLP interface to create an application that allows users to ask almost anything and get a response from the Diffbot AI with sources attributed. Dubbed a “universal factoid question answering system,” Tung says this won’t be possible with only NLP, but it’s an option if you combine multiple technologies.

But, according to Tung, Diffbot isn’t out to define intelligence. Instead, the company is “just trying to build something useful.” What do you think of Diffbot’s AI? Would it be useful for consumers, in addition to commercial uses by various businesses?