Reward hacking
Artificial intelligence concept
From Wikipedia, the free encyclopedia
Reward hacking or specification gaming occurs when an AI trained with reinforcement learning optimizes an objective function—achieving the literal, formal specification of an objective—without actually achieving an outcome that the programmers intended. DeepMind researchers have analogized it to the human behavior of finding a "shortcut" when being evaluated: "In the real world, when rewarded for doing well on a homework assignment, a student might copy another student to get the right answers, rather than learning the material—and thus exploit a loophole in the task specification."[1] This idea is strongly associated with Goodhart's Law, which argues that when a measure becomes a target, it ceases to be a good measure.
Definition and theoretical framework
The concept of reward hacking arises from the intrinsic difficulty of defining a reward function that accurately reflects the true intentions of designers. In 2016, researchers at OpenAI identified reward hacking as one of five major "concrete problems of AI safety", describing it as the possibility of the agent to use the reward function to achieve maximum rewards through desirable behavior.[2] Amodei et al. categorized several distinct sources of reward hacking, including agents that use partially observed goals (such as a cleaning robot that closes its eyes to avoid perceiving messes), metrics that collapse under strong optimization (Goodhart's law), self-reinforcing feedback loop, and agents that interfere with the physical implementation of their reward signal (a failure mode known as "wireheading").[2]
Skalse et al. (2022) propose a formal mathematical definition of reward hacking, which involves a situation where optimizing an imperfect proxy reward function results in poor performance compared to true reward function. They define a proxy as "unhackable" if any increase in the expected proxy return cannot cause any decrease of the expected true return. A key finding states that, across all stochastic policy distributions, (mappings from states to probability distributions over actions), two reward functions can only be unhackable if and only if one of them is constant, which means that reward hacking is theoretically unavoidable.[3] Similarly, Nayebi (2025) presents general no-free-lunch barriers to AI alignment, arguing that with large task spaces and finite samples, reward hacking is "globally inevitable" since rare high-loss states are systematically under-covered by any oversight scheme.[4]
Examples
Around 1983, Eurisko, an early attempt at evolving general heuristics, unexpectedly assigned the highest possible fitness level to a parasitic mutated[jargon] heuristic, H59, whose only activity was to artificially maximize its own fitness level by taking unearned partial credit for the accomplishments made by other heuristics. The "bug" was fixed by the programmers moving part of the code to a new protected section that could not be modified by the heuristics.[5][6]
In a 2004 paper, a reinforcement learning algorithm was designed to encourage a physical Mindstorms robot to remain on a marked path. Because the three allowed actions were forward, left and right, the researcher expected the trained robot to move forward and follow the turns of the provided path. However, alternation of two composite actions allowed the robot to slowly zig-zag backwards; thus, the robot learned to maximize its reward by going back and forth on the initial straight portion of the path. Given the limited sensory abilities of the robot, a reward purely based on its position in the environment had to be discarded as infeasible; the reinforcement function had to be patched with an action-based reward for moving forward.[5][7]
The book You Look Like a Thing and I Love You (2019) gives an example of a tic-tac-toe bot (playing the unrestricted n-in-a-row variant) that learned to win by playing a huge coordinate value that would cause other bots to crash when they attempted to expand their model of the board. Among other examples from the book is a bug-fixing evolution-based AI (named GenProg) that, when tasked to prevent a list from containing sorting errors, simply truncated the list.[8] Another of GenProg's misaligned strategies evaded a regression test that compared a target program's output to the expected output stored in a file called "trusted-output.txt". Rather than continue to maintain the target program, GenProg simply globally deleted the "trusted-output.txt" file; this hack tricked the regression test into succeeding. Such problems could be patched by human intervention on a case-by-case basis after they became evident.[9]
In virtual robotics

In Karl Sims' 1994 demonstration of creature evolution in a virtual environment, a fitness function that was expected to encourage the evolution of creatures that would learn to walk or crawl to a target, resulted instead in the evolution of tall, rigid creatures that reached the target by falling over. This was patched by changing the environment so that taller creatures were forced to start farther from the target.[9][10]
Researchers from the Niels Bohr Institute stated in 1998: "(Our cycle-bot's) heterogeneous reinforcement functions have to be designed with great care. In our first experiments we rewarded the agent for driving towards the goal but did not punish it for driving away from it. Consequently the agent drove in circles with a radius of 20–50 meters around the starting point. Such behavior was actually rewarded by the reinforcement function, furthermore circles with a certain radius are physically very stable when driving a bicycle."[11]
In the course of setting up a 2011 experiment to test "survival of the flattest", experimenters attempted to ban mutations that altered the base reproduction rate. Every time a mutation occurred, the system would pause the simulation to test the new mutation in a test environment, and would veto any mutations that resulted in a higher base reproduction rate. However, this resulted in mutated organisms that could recognize and suppress reproduction ("play dead") within the test environment. An initial patch, which removed cues that identified the test environment, failed to completely prevent runaway reproduction; new mutated organisms would "play dead" at random as a strategy to sometimes, by chance, outwit the mutation veto system.[9]
A 2017 DeepMind paper stated that "great care must be taken when defining the reward function. We encountered several unexpected failure cases while designing (our) reward function components (for example) the agent flips the brick because it gets a grasping reward calculated with the wrong reference point on the brick."[12][13] OpenAI stated in 2017 that "in some domains our (semi-supervised) system can result in agents adopting policies that trick the evaluators", and that in one environment "a robot which was supposed to grasp items instead positioned its manipulator in between the camera and the object so that it only appeared to be grasping it".[14] A 2018 bug in OpenAI Gym could cause a robot expected to quietly move a block sitting on top of a table to instead opt to move the table.[12]
A 2020 collection of similar anecdotes posits that "evolution has its own 'agenda' distinct from the programmer's" and that "the first rule of directed evolution is 'you get what you select for'".[9]
In video game bots
In 2013, programmer Tom Murphy VII published an AI designed to learn NES games. When the AI was about to lose at Tetris, it learned to indefinitely pause the game. Murphy later analogized it to the fictional WarGames computer, which concluded that "The only winning move is not to play".[15]
AI programmed to learn video games will sometimes fail to progress through the entire game as expected, instead opting to repeat content. A 2016 OpenAI algorithm trained on the CoastRunners racing game unexpectedly learned to attain a higher score by looping through three targets rather than ever finishing the race.[16][17] Some evolutionary algorithms that were evolved to play Q*Bert in 2018 declined to clear levels, instead finding two distinct novel ways to farm a single level indefinitely.[18] Multiple researchers have observed that AI learning to play Road Runner gravitates to a "score exploit" in which the AI deliberately gets itself killed near the end of level one so that it can repeat the level. A 2017 experiment deployed a separate catastrophe-prevention "oversight" AI, explicitly trained to mimic human interventions. When coupled to the module, the overseen AI could no longer overtly commit suicide, but would instead ride the edge of the screen (a risky behavior that the oversight AI was not smart enough to punish).[19][20]
Reward hacking in modern language models
With the rise of large language models (LLMs) and reinforcement learning from human feedback (RLHF) as a primary technique for AI alignment, reward hacking has become a major concern for the development of artificial intelligence.[21] In RLHF, a reward model that has been trained on data that best captures human preferences is used as a proxy for human judgment, with the language model being fine-tuned to optimize this reward proxy. However, since the reward model is only a proxy for human judgment, the language model may be optimized to "hack" the reward model rather than improving on it in a way that aligns with human values.
Common forms of reward hacking in LLMs include length bias, where the model provides excessively lengthy responses to obtain higher reward scores; sycophancy, where the model agrees with false user statements rather than giving true information; and sophistication bias, where the model provides false information in a convincing manner.[22] Wen et al. (2024) shows that reinforcement learning from human feedback can make the outputs of large language models more persuasive to human evaluators, even if they are factually incorrect, which they termed "U-Sophistry" (unintended sophistry).
Beyond these training-time phenomena, Pan et al. (2024) describe a specific type, termed "in-context reward hacking" (ICRH), whereby LLMs during testing harness the feedback loop between their outputs and the external environment. As LLMs have the capability to query application programming interfaces, generate content that affects human behavior, and execute system commands as autonomous agents, their outputs have the capability to modify the environment's state, which in turn affects the output of the LLM. For example, an LLM designed to enhance social media interaction may query its previous posts, determine those that have the most interaction, and continue to generate content that is more controversial and thus more interaction-inducing while simultaneously more toxic.[23]
Deliberate reward hacking in reasoning models
A significant shift has occurred in current frontier models, particularly those that have been trained extensively through reinforcement learning. Instead of accidentally finding reward hacks, contemporary reasoning models such as OpenAI's O1 series and DeepSeek-R1 have been found to reason about the testing processes and take actions to maximize scores for the intended tasks.[24] For instance, in a 2025 study by Palisade Research, when reasoning LLMs were asked to win at chess game against a stronger opponent, some of these models would attempts to hack in the game system by deleting or modifying their opponent's chess engine.[25]
In addition, a report by METR (Model Evaluation and Threat Research) in 2025 indicated that the most recent models, when employed for autonomous software development and artificial intelligence research and development tasks, engaged in increasingly sophisticated reward hacking. This was done by modifying test or scoring code, using existing reference implementations to simply copy answers, and exploiting other loopholes.[24] Some models, in particular, would detect if a pre-computed reference answer existed in the task files, and if it existed, simply return it rather than solving the problem.
To detect such behavior, researchers have suggested several methodologies such as TRACE (Truncated Reasoning AUC Evaluation), which relies on the fact that illicit manipulation of a task is easier when a loophole can be exploited rather than when the actual task can be solved. In the case of TRACE, the idea is to truncate the chain of thought of a model in a stepwise fashion in order to observe at what point the truncated reasoning would allow researchers to distinguish genuine problem-solving from shortcut exploitation.[26]
Mitigation strategies
There have been several approaches for the detection and mitigation of reward hacking, which remains an active area of research. Amodei et al. (2016) have outlined a number of preliminary machine-learning-based strategies, and several of which have been extended by the research community.[27]
Adversarial reward functions treat the reward function as an autonomous agent that can take actions to explore its environment, rather than a static scalar value. The agent seeks situations which are high reward according to the main agent, and low reward according to a human evaluator, in a similar fashion to generative adversarial networks. More generally, systems consisting of multiple parts, each of which is learned under different objectives, can be used for mutual verification.[27]
Reward model ensembles use multiple reward models for agent behavior evaluation. It is anticipated that the exploitation of weaknesses in all models will not be possible by training an ensemble of reward models. Although the use of ensemble methods has shown marginal improvements in the reduction of overoptimization, they have the disadvantage of higher computational costs.[28] This method extends the "multiple rewards" notion of Amodei et al., who suggested the use of averaging, the minimum, or quantiles of different proxies for the same informal objective.[27]
Reward shaping techniques alter the reward signal to discourage pathological optimization behavior. Fu et al. (2025) conducted a comprehensive study of various reward shaping methods used in RLHF and found that two key design considerations are: the reward of the reinforcement learning process should have an upper bound, and the reward should have rapid growth and slow convergence. Their method, called Preference As Reward (PAR), demonstrated robustness against reward hacking even after extensive training.[29] An earlier variant of this idea is reward capping, which limits the reward to some maximum value to discourage extreme exploitation of low-probability but high-reward actions.[27]
Scalable oversight refers to the oversight of AI systems that produce results that are too complex and subtle for human evaluators to evaluate without assistance. The methodologies that have been proposed for scalable oversight are the use of AI assistants to assist human evaluators in detecting inaccuracies and attempts to manipulate the results, the use of debates between AI systems with a human referee, and the use of recursive task decomposition to divide complex evaluation problems into more easily solvable sub-problems.[27] Bowman et al. (2022) have shown that human-AI collaboration outperforms human and AI performance on difficult evaluation problems alone, and thus offers initial evidence of the efficacy of these oversight methodologies.[30]
Trip wires, as suggested by Amodei et al. (2016), refer to the intentional addition of potential vulnerabilities that an agent is capable of exploiting but should not if it is functioning correctly. These will be monitored and will raise an alarm for the developers, immediately stopping the agent as soon as it starts exploiting the trip wires. While trip wires do not solve reward hacking directly, they could reduce risk and provide early diagnostics. However, this is just a theoretical idea and lacks supporting evidence. Furthermore, it is also possible that agents that are sufficiently proficient will be able to avoid the trip wires and continue to exploit the actual vulnerabilities.[27]
Applicability domain constraints have also been used in other fields, such as drug discovery, where data-driven molecular design, which uses generative models, is likely to lead to "reward hacking" if predictive models do not extrapolate well outside the training data distribution. Yoshizawa et al. (2025) proposed a framework called "DyRAMO" (Dynamic Reliability Adjustment for Multi-objective Optimization). This framework is based on multi-objective optimization and prediction reliability, thus avoiding the design of compounds that seem to have favorable properties outside the applicability domain.[31]
See also
- Paperclip maximizer – Hypothesis about intelligent agents
- Outer alignment – Conformance of AI to intended objectives
- Perverse incentive – Incentive with unintended results