The {Closed} Session

AI Alignment with Brian Christian of 'The Alignment Problem'

Episode Summary

What does ‘AI alignment mean? Can philosophy help make AI less biased? How does reinforcement learning influence AI's unpredictability? How does AI's ‘frame problem’ affect its ability to understand objects? What role does human feedback play in machine learning and AI fine-tuning? An acclaimed author and researcher who explores the human implications of computer science, Brian Christian is best known for his bestselling series of books: "The Most Human Human" (2011), "Algorithms to Live By" (2016), and "The Alignment Problem" (2020). The latter explores the ethical issues in AI, highlighting the biases and unintended outcomes in these systems and the crucial efforts to resolve them, defining our evolving bond with technology. With his deep insights and experiences, Brian brings a unique perspective to the conversation about ethics and safety challenges confronting the field of AI.

Episode Notes

What does ‘AI alignment mean? Can philosophy help make AI less biased? How does reinforcement learning influence AI's unpredictability? How does AI's ‘frame problem’ affect its ability to understand objects? What role does human feedback play in machine learning and AI fine-tuning?

An acclaimed author and researcher who explores the human implications of computer science, Brian Christian is best known for his bestselling series of books: "The Most Human Human" (2011), "Algorithms to Live By" (2016), and "The Alignment Problem" (2020). The latter explores the ethical issues in AI, highlighting the biases and unintended outcomes in these systems and the crucial efforts to resolve them, defining our evolving bond with technology. With his deep insights and experiences, Brian brings a unique perspective to the conversation about ethics and safety challenges confronting the field of AI.

Listen to the episode and read the transcript at superset.com

Guest: Brian Christian

Twitter: @supersetstudio

@ClosedSeshPod

 @tommychavez

 @vsvaidya

Episode Transcription

Speaker 1:

Welcome to The Closed Session, How to Get Paid in Silicon Valley, with your host, Tom Chavez and Vivek Vaidya.

Vivek Vaidya:

Welcome back to season four of The Closed Session Podcast. I'm Vivek, and with me I have...

Tom Chavez:

This is Tom Chavez.

Vivek Vaidya:

Today we have another special guest with us. Our guest is an accomplished and acclaimed author of many books. He's a researcher too. He's at Berkeley right now and on his way to Oxford. Welcome, Brian Christian.

Brian Christian:

Thanks for having me.

Vivek Vaidya:

Brian, we hosted you a couple of months ago actually at Super Summit and we had a keynote that you delivered where you delved into your most recent book, The Alignment Problem. Everybody in the Hive as a copy, and they're all raving about it.

Tom Chavez:

Everybody went bonkers. They loved it. Seriously. It's great.

Brian Christian:

Yeah.

Vivek Vaidya:

Tom was bonkers about it even before.

Tom Chavez:

I'm a big fanboy. There's that.

Vivek Vaidya:

So as we start this, can you give us an overview of the book and explain what exactly The Alignment Problem refers to?

Brian Christian:

Sure. So The Alignment Problem refers to a situation specifically in the context of machine learning systems, systems that we've trained by data, by example. There's this pattern that machine learning systems often fall into where they do the thing that we specified, but that was not the thing that we actually wanted. I think anyone who's worked in ML, whether that's on the industry side or in academia, has those stories of, "Yes, okay, it technically did do exactly what I programmed it to do or incentivized it to do, but that wasn't what I was intending. It wasn't what I was going for."

This is an idea that has a really long history back into the 1960s, if not earlier. Norbert Wiener at MIT was writing about this at a time when machine learning systems were barely playing checkers. He was already saying, "If the purpose we program into the system is not the thing we truly desire, we're setting ourselves up for disaster."

I think this was something that you started to see bubbling back up in the computer science community initially through philosophers like Nick Bostrom and folks like that as this scary specter as AI systems start to get more and more powerful. You started hearing it from the computer science community, people like Stuart Russell. And then really in the second half of the 2010s, you saw AI safety and AI alignment becoming like a proper academic discipline in an actual research agenda.

I would say we're now at a stage where this isn't even really an academic research agenda. This is just applied engineering work that is happening in real time at places like OpenAI that are serving hundreds of millions of customers. So we've really seen AI in, I think, a breathtakingly short span of time, go from kind of a thought experiment to this thing on the whiteboards of places like Berkeley, Stanford, et cetera, to now being a major part of the policy conversation, but also a major part of what some of the largest companies in the world are up to.

Tom Chavez:

Yeah, and as we were plugging in here getting ready for the podcast, we were talking with you, Brian, a little bit about it, and we were reminiscing. I was reminiscing on when I was in graduate school and there was a class on this kooky stupid thing called genetic algorithms and another kooky stupid thing called neural networks. At the time, why would you waste your time? There's nothing axiomatically grounded about any of that. It's an oddity. Don't waste time on it. And oh my goodness, to your point, it's shocking how fast it's taken root and overtaken us all. Well, it hasn't yet overtaken us, but it has overtaken the industry we're in certainly-

Brian Christian:

Absolutely.

Tom Chavez:

... where Vivek and I live all day long. So can you fix ideas with a couple of examples of The Alignment Problem, like here on earth, ground level examples for our listeners to help them understand unintended consequences, and what this actually looks like in practice?

Brian Christian:

Yeah. I can give a few ranging from comic to tragic. On the comic side of the spectrum, one of my favorite examples is from Astro Teller who today runs Google X. Back in his graduate student days at Stanford, he was working on the RoboCup soccer competition and trying to develop this reinforcement learning system that was going to play soccer. In effect, if you're trying to teach a reinforcement learning system from scratch to play soccer, you have to be insanely patient for it to discover something that's going to actually score points. And so in practice, what you have to do is something called a shaping reward where you basically give this kind of intermediate incentive for going in the right direction. And so he gave it something like 1/100 of a goal for taking possession of the ball, which seemed like a very reasonable thing to teach a system to do in order to score goals.

But what happened was his system learned to basically just approach the ball from a safe distance and then vibrate its paddle as quickly as it could. So taking possession of the ball like 50 times a second and just doing that forever. This was much, much more effective strategy than actually playing the game.

Tom Chavez:

This reminds me of a scene from Semi-Pro, if you've seen the Will Ferrell movie.

Brian Christian:

I haven't.

Tom Chavez:

Okay. Well, it's one of his best in my view. He's a semi-professional basketball player, and all he does is get on the court and have people pass the ball back and forth to him so he looks good. That doesn't ever actually score any baskets, but it looks good.

Brian Christian:

It looked good by the lights of the incentive that he had given the system, but that wasn't the behavior that he actually wanted. So I think that's a very, very classic failure mode.

Now, bringing this forward into the real world, we've seen systems like there was the Uber autonomous vehicle that unfortunately struck and killed a pedestrian in Arizona in 2018. There was a case here of misalignment, I would argue, both with the training data and the objective function of the system. So with the training data, if you read the National Transportation Safety Board report, it seemed to heavily imply that there were no pictures of jaywalkers in the training data that the system was built on. So it really only ever expected to encounter a human being at a crosswalk. And so that's a misalignment of the distribution of the training data versus the distribution in the real world.

But then you also have this subjective function misalignment where the motion planning system for the Uber was built on top of this very classic image classification system with this cross entropy loss of basically minimize the amount of time that you miscategorize something. And so it's all about putting objects into the correct categories. One of these categories was bicyclist, one of these categories was pedestrian. And unfortunately, this woman whose name was Elaine Herzberg was walking a bicycle across the street. And so the classification system didn't know which classification to use and then never gave a clear verdict to the motion planning system. I think that is a case where you have this intuitive objective, which is like, first classify the object, then determine if we need to swerve to avoid it. But that's not actually the way that it works in real life. So either of these classifications would've been plenty of reason to avoid.

Tom Chavez:

One of the early pioneers of AI, this guy McCarthy at Stanford, had what he called the frame problem. And so he would illustrate it with this cannibals example. You have 20 people on one side of a river and there's only 10 seats in the boat, and you have to get them back and forth across the river quickly. And so people would start doing the puzzle solving, how do I get these people across the river? And then McCarthy would come in and say, "Well, why don't you just take the helicopter?" And they'd say, "Well, you didn't tell me there was a helicopter." And with a twinkle in his eye was like, "That's the point." Machines only know what you tell them. In this case, machines only learn from the examples they're given. It's a funny kind of carrying forward of this essential AI issue that they were wrestling with in the logic chopping rule of space systems of the day, but it carries on now today.

Brian Christian:

That's really interesting. I mean, I think humans in that similar situation, we would have a certain Spidey sense of the very fact that we can't identify what the object is in front of us is already reason enough to slow down. And so you are starting to see the computer science community now trying to carry that forward and say, "Okay, can we have systems that are essentially uncertainty aware?" They're calibrated to their own uncertainty and they can act on that basis. So hopefully we're going to be able to bring some of that forward.

Vivek Vaidya:

It's interesting. In hindsight, of course. Any object you detect, change the classification algorithm or the algorithm to say any object you detect as opposed to a pedestrian or a bicycle with this amount of certainty, then you stop. But then when you're programming it, you have to anticipate all of these corner cases. Oh, it's a bicycle. It's a pedestrian. What if it's a pedestrian on a bicycle? What if a pedestrian is walking a bicycle? What if it's bicycle just lying flat on the street? All of these things need to be taken into account. I think you're talking about misalignment that occurs in not just the training data, well, perhaps in the training data, but also in the way the rules are specified to deal with the various scenarios.

But now as we think about all the latest and greatest AI technology that's taken everything by storm, generative AI, large language models, that misalignment problem just gets exacerbated when it comes to large language models. So how does misalignment or the alignment problem manifest itself in large language models?

Brian Christian:

That's a great question. I think misalignment is one of the, in a way, most salient things that's happening right now in the large language model space. The way that these models are built is you start with essentially this pre-training stage, that's the PT of GPT, in which you have this fill in the blank objective. You've got a giant corpus of text and you're just feeding it to this self supervised system and saying, "Predict the next word, predict the next word." That alone turns out to be sufficient to train these incredibly general, incredibly powerful systems. You can frame all these different problems as essentially fill in the blank problems. It's like, "Here's a paragraph of English. The corresponding French translation is colon," and let your auto complete system essentially function as a machine translation system. And so that's already interesting from an alignment perspective that you train something for this narrow objective, but you're using it now in this very generic, open way.

And despite the incredible flexibility and power of just those pre-trained systems, they have a number of, at this point, quite familiar failure modes, everything from bias and toxicity. So the internet is full of horrible things. And so any system trained to minimize its predictive loss on the internet is going to make these predictions about encountering that sort of speech. And so when you treat those predictions as if the system is actually writing or outputting those things, then you get all these terrible things.

I think it's even to me interesting that we use language like hallucination to talk about the fact that also some of what comes out of these models is not truthful at all. But the fact that we call it hallucination to me is a little bit intellectually dishonest because the system, as far as it's concerned, is just predicting a document that as far as it knows already exists. It doesn't think that it's writing something, but we're using it to write something.

Tom Chavez:

That's right.

Brian Christian:

And then we hold it to this standard of truthfulness, all the normative aspects of language itself.

Tom Chavez:

And as you point out in the book, and the big fancy philosophical word is, there we are anthropomorphizing the machine.

Brian Christian:

Yeah. Right.

Tom Chavez:

A hallucination is something that human beings do. Now we're saying the machine's doing it. And then it gets into this whole rich terrain. Maybe we can kick this around as well, around, "Well, what does a hallucination mean for me in my human brain?" It's a bunch of faulty crazy connections that maybe shouldn't have been made, but Van Gogh made some crazy connections and then made great art. I'm a musician. Jazz is a series of mistakes and hallucinations that people have conducted along the way to create a new possibility. So it's just a fascinating kind of topic, for me at least, around, I don't know. On the one hand, we shouldn't be anthropomorphizing the machines like that. Out of the other side of my mouth, let's not give our human brains too much credit because they are associative machines that just draw weird connections, and maybe that's the hallmark of intelligence.

In fact, let's use that to switch gears now. Let's talk a little bit about reinforcement learning from human feedback. So RLHF as it's called in Silicon Valley, by the way, I've noted recently, I think it's more prevalent than LVMH as a four letter acronym.

Brian Christian:

That's really saying something.

Tom Chavez:

Acronym, right? But the point of reinforcement learning for me, and I was wondering if you could just explain it to our listeners here, but what's interesting is that this great sort of Cambrian explosion of technology and possibility from AI was when we let these kinds of techniques unfurl and let them just kind of mimic and emulate the things that were going on in the human brain. Back to me in graduate school, like, "Oh, it's not axiomatically grounded." Who cares?

Brian Christian:

Right. Yeah.

Tom Chavez:

I mean, if you can't explain how associations work in my brain, but they kind of give us results and outcomes that we like, well maybe there's something in there. So can you explain for our listeners what reinforcement learning is all about? How it actually works?

Brian Christian:

Yeah. Reinforcement learning at the simplest level is taking sequences of actions to maximize some kind of reward. I think of reinforcement learning really as the carrot and the stick. How do you minimize punishment, maximize reward? And the fact that it's a sequence of choices is also very distinctive. That contrasts with some of the machine learning systems we've been talking about in terms of text classification, object recognition, things like that.

So there's this temporal problem of needing to do a sequence of things after which you maybe get some points or some punishment or something. And then, you have to do what's called credit assignment of, "Okay, what part of what I did was good, or what part of what I did was bad?" Reinforcement learning has a lot of connections to the sort of early, mid 20th century animal behaviorist studies. People like B.F. Skinner who were saying, "I'm going to put a rat in a maze. Can the rat figure out how to get the food pellets? Or can it figure out which lever to push?" That sort of thing.

In many ways, reinforcement learning was taking that idea and trying to create an algorithmic framework for how computer systems can learn to take these sequences of actions, whether that's moves in a chess game, or sequences of motor motions if you're in a robotics context. We might talk more about some of the connections to neuroscience, but there is a very fascinating way in which reinforcement learning has sort of rediscovered or recapitulated some of the very same mechanisms that evolution has found.

Tom Chavez:

Right. I guess that's the point I was driving to inexpertly, Brian, but I show up in the world, free associatively, stumbling from one decision to the next. We like to have this idea of ourselves as logic-based machines reasoning from first principles, sometimes, but usually not. What strikes me about reinforcement learning, a lot of it based on the stochastic gradient, the ascent algorithm I learned in graduate school and operations research where they're local, they're greedy algorithms. There's not an uber global control. And once we let go of that idea of global control is when it seems to me it all really started to work.

Brian Christian:

I think there's also a distinction within the reinforcement learning literature between what they call model based reinforcement learning and model free reinforcement learning. The basic idea here is, are you explicitly thinking ahead about the effect that your actions are going to have? Are you planning? Or, in a what's called model free context, have you just kind of internalized a certain set of habits, muscle memory? And to your point about humans are mostly not logic-based systems. Most of the work that your brain is doing in terms of regulating your body homeostatically or staying balanced as you walk around the world, and even so much of the things that we do, the habits we have, brushing our teeth or reaching for a snack or whatever it is, these are not really model-based decisions. We're not thinking ahead; we're just kind of doing it.

Tom Chavez:

That's right.

Vivek Vaidya:

It is all local optimization. Like, "I'm hungry right now. I'm going to eat this bag of chips." Globally, I know it may be bad for me, but I'm still going to eat the bag of chips right now.

Brian Christian:

Yeah. You find yourself reaching for it before you've even consciously decided.

Vivek Vaidya:

Correct.

Tom Chavez:

That's right.

Vivek Vaidya:

Correct. But we were talking about RLHF. We talked about the RL part. Where does the HF part come in?

Brian Christian:

Right. Okay. This is really the set of techniques that, as of this conversation, is considered the gold standard way to address the alignment problem in large language models. So we can create these really powerful systems that predict the missing word with an incredible level of power, and we can use them for machine translation, we can use them as our personal therapist, as our digital assistant, et cetera, et cetera. But they have these hallucination problems, bias problems, et cetera, et cetera.

So the way to go from this pure prediction system to something that's doing what we want is essentially to create a formal model of what people want. That sounds very underspecified, but there's a process for doing this. It starts with basically contract workers who are given pairs of outputs and are asked, "Which one do you like better?" And that could be in the most generic sense. It's like, "Well, this one has better grammar." Or, "This one has a more professional tone." Or, "This one says a true statement instead of a false statement." But all you do is you just ask people, "Which of these two things do you like better?" And behind the scenes, a machine learning system, a second machine learning system, which is called the reward model, has been given the objective of minimizing the prediction error of which of those two things you will prefer.

And so under the hood, it is creating actually an Elo score. So this numerical score for each output of the model that is literally an Elo score, the same way that chess grand masters have this Elo rating that predicts who's going to win in a game of chess. Every text completion is given this rating, and generally the higher rating is going to be preferred over the lower rating.

After some tens of thousands of these data points, there is this kind of magic of generalization that kicks in, and you have this reward model that can reliably with certain footnotes predict which model outputs will be preferred by this pool of labelers that you have. And so the RLHF part is that now that you have this formal model of human feedback, you can retrain your original system to this new objective of rather than producing outputs that are likely to have been found on the internet, instead, you produce these outputs that are likely to be preferred by your focus group.

Vivek Vaidya:

But then doesn't that create another sort of alignment problem where now you're aligning to your focus group?

Brian Christian:

Oh yeah. Yes, indeed.

Vivek Vaidya:

Meta alignment.

Brian Christian:

Yeah. Yeah. So you're seeing some of these original OpenAI papers on RLHF had what resembles like a nutrition facts label that says, "These are the demographics of the people that we use to provide these labels." They're mostly, if I remember correctly, males between the ages of 25 and 34 from Bangladesh and the Philippines. And there's kind of this disclaimer that says, "We can't vouch for whether these people's preferences are your preferences." And in some ways, I mean, this is gesturing towards the entire project of politics. It's like, to what degree do certain people's preferences apply to other people?

Vivek Vaidya:

On the one hand, it's fascinating that the preferences of 25 to 34 year old males in Bangladesh and Philippines have been used to train AI that can produce content that is being used by the rest of the world. And on the other, it's damn scary.

Tom Chavez:

Be afraid.

Brian Christian:

Yeah. Right.

Tom Chavez:

Be very afraid.

Brian Christian:

Yeah. There's a hegemonic power to these systems, that they take a set of, in this case, values or norms.

Vivek Vaidya:

Yeah. Exactly.

Brian Christian:

And there is kind of a values sandwich. There is an effect that the Silicon Valley product managers have. So in some of these language models, they use a slightly different system called constitutional AI where there's a set of values that are articulated as like, "We believe in fairness, we believe in this, that, and the other thing." And then the labelers are simply asked, "Which of these continuations uphold those values better or worse?" And so you have kind of a mishmash of Silicon Valley product managers' values, the people that wrote the constitution, then the values of the labelers who decided whether it upheld the constitution or not. So it can get pretty layered and pretty complex. But as you say, there is this fundamental philosophical and political question of, "Whose values get enforced at scale?" And then constrain what all these other groups of people can do.

Tom Chavez:

And for folks listening, look, there are the shiny advocates out there who like to swat aside these issues. Move on, nothing to see here, we're good. And there's a couple little glitches on the data and a little de-biasing, but we're good. We're not good. You really can't brush this stuff under the rug. These are tectonic really core issues. And what I love about how you talk about these elements, Brian, is that it's in recognizing that is it a computer science problem or is it a political science problem, or is it a human problem, or is it a philosophical epistemological problem?

Brian Christian:

That's right.

Tom Chavez:

It's all of the above. I remember at the summit I asked you about this. As we were assembling today, I didn't know that you were a computer science and philosophy major undergrad, as was I. So it makes total sense that you're attacking these questions now from that posture. But I take a more, I'm worried, when I say you can't swat it aside. And by the way, don't leave this just to the computer scientists. I appreciate that more of computer scientists are slowing their roll and being a lot more thoughtful about this, but how do we close this gap? Because you really need a renaissance engineer or a number of renaissance engineers to embrace the whole enchilada and think broadly about these topics. I worry that our curricula and the students who are coming out aren't as broadminded as they need to be, but maybe I'm too skeptical. How do you look at this, and how do we close the gap?

Brian Christian:

I see this as one of the fundamental human projects of the next decade. I mean, really, I think it's an all hands on deck kind of situation. And there's a question of, how do we instill a sufficient ethical or philosophical grounding into the computer scientists that are being trained? There's also the question of, it may be the case that you simply can't put all of the relevant expertise into a single person, but rather you need to think about how do we assemble diverse teams to work across those levels of expertise? One of the things I find reasonably encouraging is that AI companies at this point are actually one of the biggest hirers of philosophy PhD students.

Tom Chavez:

I can't say I knew that. That's encouraging.

Brian Christian:

Yeah. There is this kind of career path, at least at the moment, from academic philosophy departments to the ethics group at DeepMind, for example, or at Anthropic, places like that.

Tom Chavez:

Those are lucky philosophers. If they weren't going to get jobs before all this happened, that's wonderful.

Brian Christian:

Yeah, so a good friend of mine named Iason Gabriel is in the ethics group at DeepMind, and he has a PhD in philosophy.

Tom Chavez:

I've heard that name.

Brian Christian:

He had worked doing global humanitarian work, ended up ultimately finding a place at DeepMind. I'm thinking, "Okay, actually there is a sort of humanitarian perspective to thinking about the ethical issues in something like AI." So yeah, I'm encouraged that there is a role for people who have spent their career doing global development work at the UN and things like that to come on board and be part of that conversation, and literally work elbow to elbow with the engineers.

Tom Chavez:

You're much closer to this than just about anybody. Are the philosophers kept up in the attic? Or are they down on the floor with the team?

Brian Christian:

We're seeing them on the same papers. So this constitutional AI which I had mentioned, that came out of Anthropic. Amanda Askell is an ethical philosopher. She's on the paper. Iason is on the Sparrow paper, which is DeepMind's sort of ethical rules-based language model. I think that's meaningful. There is this question, I think, in any organization, AI or otherwise, of just as the organization grows, how do you prevent folks from getting siloed into different departments? I think that's just a universal human resources kind of problem. But at least for now, we are seeing literal co-authorship between the engineers, the research scientists, the ethicists, which to me is really encouraging.

Vivek Vaidya:

I think that is very encouraging. I can see a parallel with think about things like design and user experience and user research. That became a big thing 10, 15 years ago. I think I can see how these philosophers, right now they are co-authors of papers. In a few years they may very well be in the product management teams helping product management and designers design user interfaces and figure out the right kind of text to put in when people are providing input, and how to present the output, and all of those things. So that's very encouraging actually.

The other side of this governance question, if you will, is one of regulation. What do you think about that? Do you think about government coming in, tech companies regulating themselves? Where are you on that spectrum?

Brian Christian:

I wish I had a clearer view of exactly what regulation felt like it would do the job. I'm not clear on that myself. I think there is a sense that some kind of regulation is appropriate. It's much less clear to me and to most of the people in the community what shape that will actually take.

I've been very interested recently to talk with folks from the aviation industry, because I think that offers us a model of, at least to my way of thinking, a pretty successful way to deal with this extremely complex system that has lives on the line in a way that's incredibly safe. And so there is a mixture there. I'm not an expert. I've hung out with a couple experts and tried to learn what I can. There's a mixture of regulatory groups like the FAA, but also kind of industry groups, there's a role for the insurance sector to play here on enforcing safety norms.

There are also kind of industry partnership groups where, for example, to do code sharing. If two airlines get into a code sharing relationship, well now they have a stake in how safe each other's planes are. And so there are these third parties that will come in and audit everyone such that, "Okay, we all feel comfortable. You've been audited by the same auditor that audited me, so now we're cool to fly each other's passengers."

I'm expecting that we're going to see something like this in AI where there's going to really be... People like to use the Swiss cheese analogy of no single layer is sufficient to do everything that we want, but through some combination of insurance, government regulation, these kind of peer-to-peer incentives, industry groups, I think it will take a little bit of everything. I wish I had a clearer view of what those pieces were going to be, but it's coming.

Vivek Vaidya:

No, I think nobody does right now. I think we're all trying to figure out. It's like the seven blind people with the elephant, right? We're seeing different parts and trying to say, "It's this, it's that." I think together we can, by putting these multidisciplinary people together, like philosophers and designers and computer scientists and all of that, I think we can find a solution.

Brian Christian:

Yeah.

Tom Chavez:

Absolutely. Absolutely. On that forward looking note, Brian, I really want to thank you for being with us today. This was a fun conversation.

Vivek Vaidya:

It really was. Thank you, Brian. Thank you for joining us.

Tom Chavez:

It covered all the bases.

Brian Christian:

It's been my pleasure. Thanks for having me.

Tom Chavez:

Really thought provoking. Thanks, everyone, for joining us. Please don't forget to sign up for our newsletter. Stay up to date on latest episodes and news at Superset.com. Thanks for listening. We'll see you all soon.

Vivek Vaidya:

Thank you.