Categories: Technology Facts

This could lead to the next big breakthrough in common sense AIon November 6, 2020 at 9:00 am

AI models that can parse both language and visual input also have very practical uses. If we want to build robotic assistants, for example, they need computer vision to navigate the world and language to communicate about it to humans.

But combining both types of AI is easier said than done. It isn’t as simple as stapling together an existing language model with an existing object recognition system. It requires training a new model from scratch with a data set that includes text and images, otherwise known as a visual-language data set.

The most common approach for curating such a data set is to compile a collection of images with descriptive captions. A picture like the one below, for example, would be captioned “An orange cat sits in the suitcase ready to be packed.” This differs from typical image data sets, which would label the same picture with only one noun, like “cat.” A visual-language data set can therefore teach an AI model not just how to recognize objects but how they relate to and act on one other, using verbs and prepositions.

But you can see why this data curation process would take forever. This is why the visual-language data sets that exist are so puny. A popular text-only data set like English Wikipedia (which indeed includes nearly all the English-language Wikipedia entries) might contain nearly 3 billion words. A visual-language data set like Microsoft Common Objects in Context, or MS COCO, contains only 7 million. It’s simply not enough data to train an AI model for anything useful.

“Vokenization” gets around this problem, using unsupervised learning methods to scale the tiny amount of data in MS COCO to the size of English Wikipedia. The resultant visual-language model outperforms state-of-the-art models in some of the hardest tests used to evaluate AI language comprehension today.

“You don’t beat state of the art on these tests by just trying a little bit,” says Thomas Wolf, the cofounder and chief science officer of the natural-language processing startup Hugging Face, who was not part of the research. “This is not a toy test. This is why this is super exciting.”

From tokens to vokens

Let’s first sort out some terminology. What on earth is a “voken”?

In AI speak, the words that are used to train language models are known as tokens. So the UNC researchers decided to call the image associated with each token in their visual-language model a voken. Vokenizer is what they call the algorithm that finds vokens for each token, and vokenization is what they call the whole process.

The point of this isn’t just to show how much AI researchers love making up words. (They really do.) It also helps break down the basic idea behind vokenization. Instead of starting with an image data set and manually writing sentences to serve as captions–a very slow process–the UNC researchers started with a language data set and used unsupervised learning to match each word with a relevant image (more on this later). This is a highly scalable process.

The unsupervised learning technique, here, is ultimately the contribution of the paper. How do you actually find a relevant image for each word?

Vokenization

Let’s go back for a moment to GPT-3. GPT-3 is part of a family of language models known as transformers, which represented a major breakthrough in applying unsupervised learning to natural-language processing when the first one was introduced in 2017. Transformers learn the patterns of human language by observing how words are used in context and then creating a mathematical representation of each word, known as a “word embedding,” based on that context. The embedding for the word “cat” might show, for example, that it is frequently used around the words “meow” and “orange” but less often around the words “bark” or “blue.”

This is how transformers approximate the meanings of words, and how GPT-3 can write such human-like sentences. It relies in part on these embeddings to tell it how to assemble words into sentences, and sentences into paragraphs.

There’s a parallel technique that can also be used for images. Instead of scanning text for word usage patterns, it scans images for visual patterns. It tabulates how often a cat, say, appears on a bed versus on a tree, and creates a “cat” embedding with this contextual information.

The insight of the UNC researchers was that they should use both embedding techniques on MS COCO. They converted the images into visual embeddings and the captions into word embeddings. What’s really neat about these embeddings is that they can then be graphed in a three-dimensional space, and you can literally see how they are related to one another. Visual embeddings that are closely related to word embeddings will appear closer in the graph. In other words, the visual cat embedding should (in theory) overlap with the text-based cat embedding. Pretty cool.

You can see where this is going. Once the embeddings are all graphed and compared and related to one another, it’s easy to start matching images (vokens) with words (tokens). And remember, because the images and words are matched based on their embeddings, they’re also matched based on context. This is useful when one word can have totally different meanings. The technique successfully handles that by finding different vokens for each instance of the word.

For example:

Next Half the Milky Way's sun-like stars could be home to Earth-like planetson November 6, 2020 at 4:24 pm »

Previous « Why social media can't keep moderating content in the shadowson November 6, 2020 at 11:00 am

How Walking Boosts Employee Performance at Work

Discover how walking can help employees perform better at work, from boosting creativity to reducing…

32 minutes ago

General Knowledge

What To Do if Your Medication Is Recalled

A medication recall doesn't have to be an overwhelming experience. Properly manage the situation by…

20 hours ago

General Knowledge

Beginner’s Insight: What Does Cannabis Feel Like?

New to cannabis? In this post, we explore the sensations that cannabis can create and…

1 day ago

General Knowledge

Why the Dining Room Is Becoming a Thing of the Past

The dining room, once a well-loved place in many peoples’ homes, is becoming a thing…

2 days ago

Side Hustles

Feeding Your Bees: Tips for Using Sugar Syrup

Having bee colonies and hives is a year-round responsibility that comes with knowing how to…

2 days ago

TIPS & TRICKS

Home Renovations That Add the Most Value

Maximize your home’s worth with a few smart upgrades. Discover renovation ideas that not only…

5 days ago