For example, they trained 50 versions of an image recognition model on ImageNet, a dataset of images of everyday objects. The only difference between training runs were the random values assigned to the neural network at the start. Yet despite all 50 models scoring more or less the same in the training test–suggesting that they were equally accurate–their performance varied wildly in the stress test.
The stress test used ImageNet-C, a dataset of images from ImageNet that have been pixelated or had their brightness and contrast altered, and ObjectNet, a dataset of images of everyday objects in unusual poses, such as chairs on their backs, upside-down teapots, and T-shirts hanging from hooks. Some of the 50 models did well with pixelated images, some did well with the unusual poses; some did much better overall than others. But as far as the standard training process was concerned, they were all the same.
The researchers carried out similar experiments with two different NLP systems, and three medical AIs for predicting eye disease from retinal scans, cancer from skin lesions, and kidney failure from patient records. Every system had the same problem: models that should have been equally accurate performed differently when tested with real-world data, such as different retinal scans or skin types.
We might need to rethink how we evaluate neural networks, says Rohrer. “It pokes some significant holes in the fundamental assumptions we’ve been making.”
D’Amour agrees. “The biggest, immediate takeaway is that we need to be doing a lot more testing,” he says. That won’t be easy, however. The stress tests were tailored specifically to each task, using data taken from the real world or data that mimicked the real world. This is not always available.
Some stress tests are also at odds with each other: models that were good at recognizing pixelated images were often bad at recognizing images with high contrast, for example. It might not always be possible to train a single model that passes all stress tests.
Multiple choice
One option is to design an additional stage to the training and testing process, in which many models are produced at once instead of just one. These competing models can then be tested again on specific real-world tasks to select the best one for the job.
That’s a lot of work. But for a company like Google, which builds and deploys big models, it could be worth it, says Yannic Kilcher, a machine-learning researcher at ETH Zurich. Google could offer 50 different versions of an NLP model and application developers could pick the one that worked best for them, he says.
D’Amour and his colleagues don’t yet have a fix but are exploring ways to improve the training process. “We need to get better at specifying exactly what our requirements are for our models,” he says. “Because often what ends up happening is that we discover these requirements only after the model has failed out in the world.”
Getting a fix is vital if AI is to have as much impact outside the lab as it has inside. When AI underperforms in the real-world it makes people less willing to want to use it, says co-author Katherine Heller, who works at Google on AI for healthcare: “We’ve lost a lot of trust when it comes to the killer applications, that’s important trust that we want to regain.”