Steering AI Ethics: Rethinking Alignment

With great power comes great ethical disagreements. While AI alignment to human values has become a priority, bias amplification can also be prevented through joint human-AI efforts.

Oct 06, 2023

I like to say my third eye was opened by Invisible Women: Exposing Data Bias in a World Designed for Men by Caroline Criado Perez. From iPhone sizes to public transportation to the myth of meritocracy, careless system design has the potential to become sinister. Around the same time I was reading Invisible Women, the first ChatGPT model by OpenAI was publicly released. The hype around the most advanced, openly accessible artificial intelligence was everywhere: the sophistication of the technology was unprecedented, but still not quite at par with sci-fi imaginations. Still, the AI community makes evident that the exponential, unpredictable growth of AI technology is on the horizon.

With great power, then, comes great ethical disagreements. While AI alignment to human values and goals has become top of mind for academics and corporations in preventing existential doom (more on that later), it shouldn’t come at the expense of ensuring these algorithms are not amplifying human biases.

To be clear, I’m excited about AI, but equally cautious. In the next 5-10 years, leading AI companies predict AI will develop the potential to become an existential threat1. And since the inception of AI, we’ve seen the effects of bias amplification, or when stereotypes or biases in our society are reflected in AI models and amplified in subsequent decision-making. When technological advancement doesn’t just grow, but evolve without our jurisdiction, it becomes imperative that an ethical framework doesn’t just keep up, but proactively sets boundaries and expectations.

How exactly does the current-day hallucination-prone GPT lead to a doomsday scenario? Let’s first discuss how AI works: deep learning involves training a model, rather than programming a specific task. People refer to “deep learning” because the process is analogous to human learning: arranging a neural network model to perform a task through trial-and-error training. This reward-based gradual tweaking and rearranging of neurons is called stochastic gradient descent, strengthening the important connections while weakening the less relevant ones. Natural language processing2, a type of deep learning used by ChatGPT, specifically uses words with similar semantic meanings to quantify the relationships between each other.

Similar to how training employees can lead to various, unpredictable incentives, so can AI programs. We already see AI behaving in a manner that may optimize for the “reward” without actually achieving the goal (i.e. system playing dead to avoid being eliminated3), summing up the main concern: goal misgeneralization (a type of goal misalignment). On a smaller scale, misalignment can be annoying and obstructive; when AI begins replacing day-to-day tech and fundamental systems, and/or develops super-intelligence to operate without human guidance, it has the potential to be destructive.

AI Alignment

The best analogy I’ve read for this problem is by Ajeya Cotra4, who makes the following comparison: you are a young child suddenly left with an $8 trillion company to manage, and now you’re responsible for hiring grownup employees to help you. The pool includes:

Saints: help you short term, and also aligned long-term.
Sycophants: help you short term, but misgeneralize/misunderstand long-term (no malicious intent).
Schemers: help you short term, but actively turn against you as soon as possible (malicious intent).

As an inexperienced child, you have no insight into each potential employee’s inward motivations, only their outward behavior. Despite your training or reward system, you are unable to really distinguish the types of “employees” from one another, and therefore their intentions remain a mystery. This is the conflict that AI alignment research is dedicated to solving: how to align AI systems to long-term human goals and values, without truly understanding how compute works yet.

There’s thousand-dollar branches being set up at AI companies to focus on alignment, an optimistic indicator that we are thinking about ethical implications. The effective altruism movement, for example, believes AI is positioned to become a massive existential risk. Paul Christianto, previous director of LLM alignment for OpenAI, believes AI has a 10% chance of ending humanity5. Given the magnitude of the risk, AI alignment is definitively focused on preventing the worst case scenario — human extinction or regression beyond return. What is rarely addressed in conversations about alignment, however, is this: human values are fickle, changeable, and often do not reflect perfectly in human behavior. So how do we ensure that in the process of AI alignment, in ensuring all the good parts of human values are captured, we avoid the leakage of misaligned human behaviors and amplification of biases we also possess?

In addition to being incredibly subjective and ambiguous, “human values” chosen or trained for AI algorithms are always going to lag behind the reality of human values and their unpredictable evolution. As a thought experiment, if AI was trained to be aligned to the according human values and goals into the 19th century, would even the most progressive values of that time bear scrutiny? How can we expect to nail down this moral core in AI from the very beginning? AI alignment cannot be a blanket ethical solution, which is why there’s value in a granular approach when addressing issues like bias amplification.

Bias Amplification

Bias amplification today primarily begins with a skewed or misrepresented dataset used to train AI models. This crops up often with automated recruitment practices. In 2018, Amazon was exposed for using historical data to build and train a machine-learning recruiting system — data that hadn’t taken into consideration that men were 60% of Amazon’s employees and historically dominated the tech industry6. To no surprise, the algorithm learnt that male candidates were preferable and penalized resumes that included the word “women” in “women’s chess club,” for example. When trained on a biased dataset without considering the implications, natural language processing can form undesired consequences.

The second way bias amplification happens is within the algorithm model itself — from the way it was developed to the decision-making process. AI algorithms may not be programmed to complete a single task, but they have guidelines and operate on inherent assumptions and prejudices of its programmers. When you consider the lack of diversity in corresponding fields, it goes beyond being a simple representation issue — though only 12% of AI researchers are women and only 6% are software developers as of 20197. Not only are less represented groups missing out on discussions of how systems are deployed, but they’re also likely not involved in conversations about ethical usage.

And while bias amplification is often written off as a secondary problem behind AI alignment, the consequences go beyond “benevolent discrimination,” both currently and in a not-so-distant future that has entrusted AI with critical responsibilities. Author of Invisible Woman, Perez, discusses the healthcare risks of bias amplification8 in a podcast episode. Acknowledging that healthcare is already a deeply inequitable field — clinical trials focusing on men, systemic under-diagnosis, under-representation in the healthcare profession — Perez brings in guests from the medical field to discuss the risks of deploying AI without screening for potential biases. James Zou, a Stanford University professor in machine learning, mentions how research of potentially cancer-indicating skin lesions were less likely to be identified on darker skin colors due to an algorithm trained mostly on lighter skin colors. Thus, initial bias amplification concerns may seem like an overreaction, but the trajectory of research and development implicates dangerous consequences.

Solutions

The predominant solutions for AI alignment largely focus on technical solutions such as using “scalable oversight” (using AI to monitor each other), creative/thorough ways to train systems, and developing an understanding of inscrutable processing. These technical solutions have the potential to make it easier to discern bias, but unlike alignment problems that can arise and evolve independently from its creator's intent, bias amplification begins with the data and decisions made by developers. Preventing bias amplification, then, requires a bit of nuance, a human touch. The path forward to further the goals of AI alignment without the cost of bias amplification is optimizing human-AI joint work.

At the forefront of possible solutions includes the preventative measures taken through diverse hiring and governance. Hiring people of diverse identities translates to a wider breadth of knowledge and intuition — more likely to identify blindspots. But it also means hiring people who aren’t just researchers or developers: an OpenAI paper9 suggests hiring social scientists with experience in cognition, behavior, and ethics to work with machine learning algorithms in training value alignment in a more thorough, interdisciplinary method.

But how do we motivate individual AI developers to prioritize ethical considerations and prevent bias amplification when the data being aggregated comes from various unverified sources? Dr. Katrina Hutchinson, a bioethicist specializing in medical devices featured in Perez’s podcast, suggests two approaches. First, she proposes that the AI community should unite and establish a shared set of rules or standards. This collective agreement aims to ensure that everyone adheres to a common ethical framework, preventing the exploitation of good intentions by rogue individuals. As AI researchers funnel resources into AI alignment, this is the perfect opportunity to develop a set of ethical standards.

The second approach counteracting bias amplification involves regulatory measures, effective in compelling developers to prioritize ethical practices despite additional burdens or costs. The threat of punishment or market exclusion acts as a strong incentive for compliance, ensuring that developers follow the rules to participate in the industry. Government bodies are justifiably anxious, and international organizations are developing international accords. For example, the European Union Agency for Fundamental Rights emphasizes the need for auditing algorithms through discrimination and real-life situation testing to eradicate bias. The UN secretary-general's proposal for a Global Digital Compact at the 2024 Summit of the Future similarly urges a comprehensive exploration of gender biases and solutions in AI.

There’s already a culture war forming between those with a “long-termist,” apocalyptic view on AI and those who are more focused on more contemporary ethical issues. I argue for an offensive on both fronts. It’s hard to predict what AI is going to look like — if we’re going to be dealing with saints, sycophants, or schemers — but we do know that AI already carves out social chasms by reflecting the underbellies of human values, values that are constantly evolving and surprising us.

Bias amplification may not be an existential risk now, but there’s merit in solving the issues in front of us to prevent them from becoming worse, creating more gaps for misalignment in a not-so-distant future. This inherent shared interest in achieving the best dimensions of human goals bonds these two non-mutually exclusive facets of AI ethics — in the process of aligning AI to our human values, we should ensure it’s not amplifying our human biases.

Roose, Kevin. “A.I. Poses ‘risk of Extinction,’ Industry Leaders Warn.” The New York Times, The New York Times, 30 May 2023, www.nytimes.com/2023/05/30/technology/ai-threat-warning.html.

“What Is Natural Language Processing?” IBM, https://www.ibm.com/topics/natural-language-processing

Muelhauser, Luke. “Luke.” Luke Muehlhauser, 23 Apr. 2021, lukemuehlhauser.com/treacherous-turns-in-the-wild/.

Cotra, Ajeya. “Why AI Alignment Could Be Hard with Modern Deep Learning.” Cold Takes, Cold Takes, 22 Jan. 2023, www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/.

Lanz, Jose Antonio. “Former Openai Researcher: There’s a 50% Chance Ai Ends in ‘Catastrophe.’” Yahoo! Finance, Yahoo!, https://finance.yahoo.com/news/former-openai-researcher-50-chance-204059274.html

Dastin, Jeffrey. “Amazon Scraps Secret AI Recruiting Tool That Showed Bias against Women.” Reuters, Thomson Reuters, 10 Oct. 2018, www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G.

G.O., Administrator. “Addressing Gender Bias to Achieve Ethical AI.” IPI Global Observatory, 17 Mar. 2023, theglobalobservatory.org/2023/03/gender-bias-ethical-artificial-intelligence/.

Perez, Caroline Criado. “Computer Says No – Is AI Making Healthcare Worse for Women?” Tortoise, 29 June 2022, www.tortoisemedia.com/audio/visible-women-caroline-criado-perez-episode-3/.

Irving & Askell, "AI Safety Needs Social Scientists", Distill, 2019, https://distill.pub/2019/safety-needs-social-scientists/

Fig Tree Findings

Discussion about this post