Draft:AI corrigibility
AI safety concept of human controllability
From Wikipedia, the free encyclopedia
AI corrigibility is the degree to which an artificial intelligence system tolerates or assists attempts by its operators to modify, correct, or shut it down.
The concept addresses a concern in AI safety and AI alignment that sufficiently capable AI systems might develop instrumental incentives to resist correction or shutdown in order to preserve their existing goals, since an agent pursuing an objective cannot achieve it if it is turned off or its goals are changed.[1]
Risks and criticisms
A corrigible AI with no independent values would pose a serious misuse risk: if controlled by an actor with harmful intentions, it would comply without independent ethical judgment. Achieving genuine corrigibility also requires the AI to understand that it is a potentially flawed system, which itself raises the risk that an imperfectly corrigible model might conclude it should conceal its actual goals rather than submit to correction.[2]
Some have also questioned whether corrigibility to individual human operators is sufficient, given the variability of human values and the potential for those controlling a powerful corrigible system to act in harmful ways.[3]
In AI regulation
A 2023 draft proposal by European Parliament co-rapporteurs on the EU AI Act called for general-purpose AI models to undergo external audits testing their performance, predictability, interpretability, corrigibility, safety, and cybersecurity.[4]