The startup trying to turn the web into a database

“The web is a collection of data, but it’s a mess,” says Exa cofounder and CEO Will Bryk. “There’s a Joe Rogan video over here, an Atlantic article over there. There’s no organization. But the dream is for the web to feel like a database.”

Websets is aimed at power users who need to look for things that other search engines aren’t great at finding, such as types of people or companies. Ask it for “startups making futuristic hardware” and you get a list of specific companies hundreds long rather than hit-or-miss links to web pages that mention those terms. Google can’t do that, says Bryk: “There’s a lot of valuable use cases for investors or recruiters or really anyone who wants any sort of data set from the web.”

Things have moved fast since MIT Technology Review broke the news in 2021 that Google researchers were exploring the use of large language models in a new kind of search engine. The idea soon attracted fierce critics. But tech companies took little notice. Three years on, giants like Google and Microsoft jostle with a raft of buzzy newcomers like Perplexity and OpenAI, which launched ChatGPT Search in October, for a piece of this hot new trend.

Exa isn’t (yet) trying to out-do any of those companies. Instead, it’s proposing something new. Most other search firms wrap large language models around existing search engines, using the models to analyze a user’s query and then summarize the results. But the search engines themselves haven’t changed much. Perplexity still directs its queries to Google Search or Bing, for example. Think of today’s AI search engines as a sandwich with fresh bread but stale filling.

More than keywords

Exa provides users with familiar lists of links but uses the tech behind large language models to reinvent how search itself is done. Here’s the basic idea: Google works by crawling the web and building a vast index of keywords that then get matched to users’ queries. Exa crawls the web and encodes the contents of web pages into a format known as embeddings, which can be processed by large language models.

Embeddings turn words into numbers in such a way that words with similar meanings become numbers with similar values. In effect, this lets Exa capture the meaning of text on web pages, not just the keywords.

The startup trying to turn the web into a database
A screenshot of Websets showing results for the search: “companies; startups; US-based; healthcare focus; technical co-founder”

Large language models use embeddings to predict the next words in a sentence. Exa’s search engine predicts the next link. Type “startups making futuristic hardware” and the model will come up with (real) links that might follow that phrase.

Exa’s approach comes at cost, however. Encoding pages rather than indexing keywords is slow and expensive. Exa has encoded some billion web pages, says Bryk. That’s tiny next to Google, which has indexed around a trillion. But Bryk doesn’t see this as a problem: “You don’t have to embed the whole web to be useful,” he says. (Fun fact: “exa” means a 1 followed by 18 0s and “googol” means a 1 followed by 100 0s.)

Related articles

Introductory time-series forecasting with torch

This is the first post in a series introducing time-series forecasting with torch. It does assume some prior...

Does GPT-4 Pass the Turing Test?

Large language models (LLMs) such as GPT-4 are considered technological marvels capable of passing the Turing test successfully....