No one knows how to reliably test for AI safety

Testing AI safety is not just 'hard to do' - it's currently infeasible

Mar 28, 2023

By Marcus Arvan, Associate Professor of Philosophy, The University of Tampa

Large-language AI models such as GPT-3 and 4 do many incredible things—but they also learn to do many unexpected things. Developers at Microsoft were surprised (to say the least) when their Chatbot started threatening people. And recent research has shown that chatbots learn all kinds of other unexpected things all by themselves that researchers never predicted.

This unpredictability has recently led many people to suggest that far more regulation is needed to ensure that AI are developed in a safe way. As Elon Musk put it, “I think we need to regulate AI safety, frankly … It is, I think, actually a bigger risk to society than cars or planes or medicine.” Others have gone further, suggesting that we should slow down AI development altogether.

Who is right?

To answer this question in an informed manner, we need to answer a more basic question that has been mostly ignored: is there any way to reliably test whether AI is safe?

The answer, shockingly, is no.

The problem, right now and for the foreseeable future, is that nobody knows how to determine what AI are actually learning.

As Holden Karnofsky, co-founder and co-CEO of Open Philanthropy, recently stated in an interview:

If you look at this current state of machine learning, it’s just very clear that we have no idea what we’re building …
When Bing chat came out and it started threatening users and, you know, trying to seduce them and god knows what, people asked, why is it doing that? And I would say not only do I not know, but no one knows because the people who designed it don’t know, the people who trained it don’t know.

The reason why developers have no idea what they’re building is simple. Large language models like GPT-4 have well over 500 billion parameters. Developers know that those 500 billion parameters are able to produce convincing human-like language—but they don’t actually know what else the 500 billion parameters are actually learning. There are simply too many of them. The parameters are so complex that they have “No Human Interpretability.”

This is why researchers have been so surprised at what chatbots actually do. No one predicted that Microsoft’s Sydney would start threatening people—or that chatbots would start learning other things on their own—because no one is in a position to understand how all of her 500 billion parameters actually work.

This means that the only way that we currently have for testing whether AI are “safe” is to see what they actually do—that is, how they behave. But, as I show in a new peer-reviewed academic article, there is a very deep problem here that no one has resolved.

As anyone who works in scientific research design knows, tests are always carried out under particular conditions. So, for example, one widely used (and advocated) practice in AI safety development is to test how AI behave in a “sandbox”—a restricted setting or “safe space” in which the AI can do no real harm.

The idea then is that, if the AI behaves safely in that restricted setting, researchers can have some degree of confidence that the AI is safe. But, this is fallacious. As any parent knows who sends their child off to college, a child may behave well at home—where they know they are being carefully supervised by parents—but the moment the child goes off to college, all bets are off. They may fall into the wrong crowd, be intoxicated by their newfound freedom, become irresponsible, and so on.

In philosophy of science, this is known as the generalization problem. You just can’t reliably generalize from one setting (i.e., test conditions) to another setting (real world conditions) unless you have some good explanation about why the test results should generalize to the latter conditions. For example, when we release new cars into public use after they fare well in crash tests, we do have good reasons to think that they will be relatively safe for consumers. Why? Because we not only know that the car is safe in crash tests; we know why it is safe (namely, that the car’s frame was designed in ways that lessen the force from impact for those inside).

In short, you can only reliably generalize from “Product X appears safe under test conditions” to the conclusion “X will probably be safe in the real world” when you have some other clear design reasons to think that the findings in the test conditions will extend beyond the test to the real world.

Virtually all other products meet this standard. We know not only that airplanes are safe in test-flights; we know why they are safe, and what they are likely to do in the real world: we know that an airplane’s flaps are likely to do the same thing when passengers are aboard as they do when test pilots are aboard.

But, this is precisely not the case with AI. As noted above, no one is currently in a position to know what AI are really learning. So, even if AI appear to behave safely under test conditions, that is no reason at all to think that they really will behave when exported to the broader world.

To see further what the problem here is, consider one argument that I have recently encountered: that if GPT-4 only shows itself capable of producing text in a restricted sandbox, then we can have some confidence that it cannot do dangerous things besides writing text in the real world.

But this, again, is just a fallacy. If my child behaves does not drink or abuse alcohol at home when I’m supervising them, then—as many parents know all too well—that provides me no clear reason to think they won’t drink or abuse alcohol when they are unsupervised at college. The two circumstances are simply different.

The only way that I can really know what my child is likely to do is to actually know something about how their mind works—which, as any parent knows, is hard to do and can take decades of day-in, day-out experience with the child across many different settings.

But the problem now with AI is this. AI are not like children. Children can’t do very much. Their abilities are limited. AI, on the other hand, can already do many things as well as or better than many of us can do, and the pace of AI advances is increasing every day.

Perhaps most importantly of all, if or when AI do develop malevolent motives and capabilities—such as the aim and abilities to follow through on threats to human beings—an intelligent AI would presumably conceal these motives and capacities right up until the moment that it is too late for us to stop them.

For example, suppose you were an AI in a restricted setting who wanted to take over the world. What would you do? One thing you might do is act like you can only create text, to fool your testers into thinking that you lack other capacities that you don’t want your testers to see. As an intelligent agent, you might fool your testers precisely so that they release you into the real world and give you access to things—such as the internet and infrastructure—that you need to fulfill your goals of harming humanity.

This is exactly the scenario that occurs in the films Ex Machina, Terminator 3: Rise of the Machines, and I, Robot. In all three cases, malevolent AI conceal their true motives and capabilities, deceiving their developers and safety-testers…precisely up to the moment that they can escape their safety measures, kill, and exterminate or enslave humanity.

Now, of course, doomsday scenarios like these might seem unlikely. But here again is the point: nobody knows how likely they are, because no one knows what AI are really learning.

As Holden Karnofsky recently put it,

Is there a way to articulate how we’ll know when the risk of some of these catastrophes is going up from the systems? Can we set triggers so that when we see the signs, we know that the signs are there, we can pre-commit to take action based on those signs to slow things down based on those signs … That’s hard to do. And so the earlier you get started thinking about it, the more reflective you get to be.

This is a vast understatement. The problem isn’t just that determining the risks of AI is “hard to do”: the problem is that as of now, there’s no way to do it.

All we know, right now, is that AI are unpredictable, learn many unexpected things, and have unexpectedly threatened and manipulated people—and we have no idea why, because we don’t know what their 500 billion+ parameters are learning.

This should scare us all. Regulation of AI development isn’t enough if you don’t even know how to test for AI safety—and right now, no one does.

Marcus Arvan's Substack

Discussion about this post