Machine unlearning
From Wikipedia, the free encyclopedia
Machine unlearning is a branch of machine learning focused on removing specific undesired element, such as private data, wrong or manipulated training data, outdated information, copyrighted material, harmful content, dangerous abilities, or misinformation, without needing to rebuild models from the ground up.
Large language models, like the ones powering ChatGPT, may be asked not just to remove specific elements but also to unlearn a "concept," "fact," or "knowledge," which aren't easily linked to specific examples. New terms such as "model editing," "concept editing," and "knowledge unlearning" have emerged to describe this process.[1]
Early research efforts were largely motivated by Article 17 of the GDPR, the European Union's privacy regulation commonly known as the "right to be forgotten" (RTBF), introduced in 2014.[2] The GDPR did not anticipate that the development of large language models would make data erasure a complex task. This issue has since led to research on "machine unlearning," with a growing focus on removing copyrighted material, harmful content, dangerous capabilities, and misinformation. Just as early experiences in humans shape later ones, some concepts are more fundamental and harder to unlearn. A piece of knowledge may be so deeply embedded in the model's knowledge graph that unlearning it could cause internal contradictions, requiring adjustments to other parts of the graph to resolve them.[citation needed]
Researchers have now also started studying unlearning in the context of removing incorrect or adversarially manipulated training data such as systematically biased labels or poisoning attacks.[3]
Motivations
At present, machine unlearning is motivated by a growing range of concerns that extend well beyond the field's original focus on data privacy. A widely used taxonomy in the literature distinguishes two high-level categories of motivation.[1]
Access revocation covers cases where a data subject or rights holder requests the removal of data they own or control. This is most commonly associated with RTBF established by the European Union's General Data Protection Regulation (GDPR) and analogous legislation such as the California Consumer Privacy Act (CCPA). These regulations grant individuals the legal right to request erasure of their personal data from any system that has processed it, including models that were trained on it. Access revocation also encompasses the removal of copyrighted or pay-walled content that was incorporated into training corpora without the necessary licenses, a concern that has become prominent with the widespread use of largely web-scraped pre-training datasets.[4][1]
Model correction covers cases where the model exhibits undesirable behavior arising from the training data, regardless of any individual's request. This includes:
- Removal of toxic, biased, or unsafe outputs introduced by harmful content in the training set
- Correction of stale or factually incorrect associations, such as outdated knowledge encoded in a deployed model
- Removal of dangerous capabilities, such as detailed knowledge of the synthesis of chemical or biological agents
- Correction of the influence of data poisoning or adversarial attacks that have corrupted model behavior[4]
This second category has been formalized as corrective machine unlearning, which frames unlearning as a post-training mechanism for repairing the effects of bad or harmful training data. It is closely related to the AI safety literature, where data filtering alone has been found insufficient to prevent hazardous knowledge from being encoded in model weights, motivating unlearning as a complementary risk mitigation strategy.[5][1]
A further distinction has been drawn in the literature between removal {eliminating the influence of specific training data on model parameters) and suppression (preventing the model from generating specific outputs regardless of how that knowledge is encoded). These two goals are not equivalent: removing training data does not guarantee meaningful output suppression, and suppressing outputs does not constitute removal of the underlying training data's influence.[5]