Machine unlearning

Machine unlearning is a branch of machine learning focused on removing specific undesired element, such as private data, wrong or manipulated training data, outdated information, copyrighted material, harmful content, dangerous abilities, or misinformation, without needing to rebuild models from the ground up.

Large language models, like the ones powering ChatGPT, may be asked not just to remove specific elements but also to unlearn a "concept," "fact," or "knowledge," which aren't easily linked to specific examples. New terms such as "model editing," "concept editing," and "knowledge unlearning" have emerged to describe this process.^[1]

Early research efforts were largely motivated by Article 17 of the GDPR, the European Union's privacy regulation commonly known as the "right to be forgotten" (RTBF), introduced in 2014.^[2] The GDPR did not anticipate that the development of large language models would make data erasure a complex task. This issue has since led to research on "machine unlearning," with a growing focus on removing copyrighted material, harmful content, dangerous capabilities, and misinformation. Just as early experiences in humans shape later ones, some concepts are more fundamental and harder to unlearn. A piece of knowledge may be so deeply embedded in the model's knowledge graph that unlearning it could cause internal contradictions, requiring adjustments to other parts of the graph to resolve them.^{[citation needed]}

Researchers have now also started studying unlearning in the context of removing incorrect or adversarially manipulated training data such as systematically biased labels or poisoning attacks.^[3]

Motivations

At present, machine unlearning is motivated by a growing range of concerns that extend well beyond the field's original focus on data privacy. A widely used taxonomy in the literature distinguishes two high-level categories of motivation.^[1]

Access revocation covers cases where a data subject or rights holder requests the removal of data they own or control. This is most commonly associated with RTBF established by the European Union's General Data Protection Regulation (GDPR) and analogous legislation such as the California Consumer Privacy Act (CCPA). These regulations grant individuals the legal right to request erasure of their personal data from any system that has processed it, including models that were trained on it. Access revocation also encompasses the removal of copyrighted or pay-walled content that was incorporated into training corpora without the necessary licenses, a concern that has become prominent with the widespread use of largely web-scraped pre-training datasets.^[4]^[1]

Model correction covers cases where the model exhibits undesirable behavior arising from the training data, regardless of any individual's request. This includes:

Removal of toxic, biased, or unsafe outputs introduced by harmful content in the training set
Correction of stale or factually incorrect associations, such as outdated knowledge encoded in a deployed model
Removal of dangerous capabilities, such as detailed knowledge of the synthesis of chemical or biological agents
Correction of the influence of data poisoning or adversarial attacks that have corrupted model behavior^[4]

This second category has been formalized as corrective machine unlearning, which frames unlearning as a post-training mechanism for repairing the effects of bad or harmful training data. It is closely related to the AI safety literature, where data filtering alone has been found insufficient to prevent hazardous knowledge from being encoded in model weights, motivating unlearning as a complementary risk mitigation strategy.^[5]^[1]

A further distinction has been drawn in the literature between removal {eliminating the influence of specific training data on model parameters) and suppression (preventing the model from generating specific outputs regardless of how that knowledge is encoded). These two goals are not equivalent: removing training data does not guarantee meaningful output suppression, and suppressing outputs does not constitute removal of the underlying training data's influence.^[5]

SISA Training

SISA is a training strategy consisting of four mechanisms designed to make machine unlearning more efficient by structuring how models are trained and updated. Its goal is to allow a system to remove the influence of specific data points without retraining an entire model from scratch. By reorganizing training data and workflows, SISA reduces the computational burden of unlearning requests.^[6]

Sharding divides the training dataset into multiple disjoint subsets, or shards. Each shard is used to train a separate model instance. This ensures that a single data point affects only one shard, so unlearning it requires updating only the corresponding shard rather than the full model.

Isolation refers to training each shard independently, with nothing shared across shards during the training process. This separation prevents cross-contamination between shards, ensuring that forgetting data in one shard does not require adjustments to any others.

Slicing breaks the data within each shard into sequential slices and stores model states after each slice is trained on. When an unlearning request targets a piece of data, the system can roll back to the checkpoint before the point was seen and retrain only from that slice forward. This reduces retraining time even within a shard.

Aggregation occurs at inference, when the model is queried. It combines the outputs of each shard to determine the output of the overall model. This is often through majority voting or averaging. This allows SISA-trained systems to behave like a single model despite being composed of multiple shard-level models.

Together, these mechanisms enable machine learning systems to forget specific data points with far lower computational cost than full retraining. The trade-off is that sharding and slicing can lead to reduced model accuracy, worse generalization, and increased storage requirements for the intermediate checkpoints. This can be tolerable based on the needs of the individual or organization to comply with "right to be forgotten" or efficiently recover from backdoor attacks.

Algorithms

Machine unlearning algorithms are broadly categorized into exact and approximate methods, reflecting a fundamental trade-off between formal guarantees and computational tractability.^[1]

Exact Unlearning

Exact unlearning methods produce a model that is statistically indistinguishable from one retrained from scratch on the dataset with the forget data removed. The canonical framework for exact unlearning is SISA Training (Sharded, Isolated, Sliced, and Aggregated), introduced by Bourtoule et al. (2021). SISA partitions the training dataset into disjoint shards and trains a separate sub-model on each. At inference time, predictions are aggregated across sub-models. When an unlearning request is received, only the sub-model corresponding to the shard containing the target data requires retraining, reducing computational overhead proportionally to the number of shards. Exact methods provide the strongest guarantees but become prohibitively expensive for large pre-trained neural networks and are generally limited to settings where training can be structured in advance.^[4]

Approximate Unlearning

Approximate unlearning methods seek to produce a model whose behavior is sufficiently close to an exactly unlearned model without the cost of full retraining. These methods dominate practical applications. Common approaches include:^[7]

Gradient Ascent: The model is fine-tuned by maximizing the loss on the forget set, directly degrading its performance on targeted data. This is the most direct approach but risks destabilizing performance on retained data.
Random Labelling: The model is fine-tuned on the forget set using randomly shuffled labels, confusing its associations with the targeted data while producing a less aggressive weight shift than pure gradient ascent.^[1]
Gradient Difference: Combines gradient ascent on the forget set with simultaneous gradient descent on the retain set, using the retain objective as a regularizer to preserve general model utility.^[8]
KL Divergence Regularization: Minimizes the KL divergence between the outputs of the unlearned model and the original model on the retain set, anchoring behavior on data the model should remember.^[1]
Weight Pruning and Fine-tuning: Parameters with the smallest L1-norm are pruned — targeting weights most weakly associated with general knowledge and potentially most associated with the forget set — followed by fine-tuning on the retain set to restore utility.^[1]
Layer Reset and Fine-tuning: The first or last k layers are re-initialized to random weights and the model is subsequently fine-tuned on the retain set. This is a coarse but computationally simple approach.^[1]
Selective Synaptic Dampening: Uses influence functions to estimate the effect of individual training examples on model parameters via first- and second-order derivatives, then applies a targeted weight update to reduce their influence without iterative optimization.^[4]
Representation Misdirection for Unlearning: Neurons causally implicated in encoding forget-set knowledge are made to produce random representations when presented with forget-set inputs, while retain-set knowledge is simultaneously reinforced. This method has demonstrated strong results on capability-removal benchmarks.^[7]

Noise and Weight Perturbation

A further family of methods adds calibrated noise to model weights or intermediate activations to obscure the statistical influence of the forget set. These approaches draw on differential privacy theory, where the addition of Laplace or Gaussian noise provides formal indistinguishability guarantees. Noise-based methods are typically combined with a fine-tuning step on the retain set to recover utility lost through the perturbation.^[9]

Adversarial Approaches

More recent work has introduced adversarially motivated unlearning frameworks, such as AMUN (Adversarial Machine UNlearning), which formulates the unlearning objective in terms of similarity to a retrained model's predictions on both the forget and retain sets, verified through membership inference attacks rather than loss-based proxies alone. Similarly, the Attack-and-Reset for Unlearning (ARU) algorithm uses adversarial noise to generate a parameter mask that selectively resets parameters associated with the forget set. These approaches address a key weakness of gradient-based methods: apparent unlearning may reflect surface-level output suppression rather than true parameter-level removal, a limitation exposed by adversarial probing.^[10]^[11]^[5]