Membership Inference at the Threshold: Likelihood Ratios and What They Expose
The attack is geometrically simple. The implications are not.
The canonical framing of the membership inference attack is a binary classification problem: given a model $f_\theta$ and a sample $x$, does $x$ belong to the training set $\mathcal{D}_\text{train}$? The attack was formalised by Shokri et al. in 2017,1 but the underlying geometry had been visible for longer: models tend to assign higher confidence to their training data than to held-out data, and that gap is exploitable.
What is less often stated clearly is that membership inference is not a single attack — it is a family of attacks arranged along a spectrum from shadow-model methods (expensive, high accuracy on well-separated populations) to simple threshold attacks (cheap, surprisingly competitive on overfit models). This post works through the likelihood ratio formulation, which sits in the middle of that spectrum and is useful precisely because it is auditable: you can inspect what it is actually measuring.
The Basic Geometry
Consider a classifier trained with cross-entropy loss. After training, the model assigns a predicted probability $\hat{p}_\theta(y \mid x)$ to each class $y$ given input $x$. For a sample that was in the training set, the model has had the opportunity to fit that sample — in the limit of overfitting, the model has memorised it. The predicted probability for the correct label will therefore be high, and the loss will be low.
The converse is not guaranteed but is empirically common: for samples not in the training set, the model has not had the opportunity to overfit, so the loss is typically higher. This creates a distributional separation between the per-sample loss values of members and non-members. Formally, if $\ell(x, y; \theta)$ is the per-sample loss, then:
$$ \mathbb{E}[\ell(x, y; \theta) \mid x \in \mathcal{D}_\text{train}] \;\leq\; \mathbb{E}[\ell(x, y; \theta) \mid x \notin \mathcal{D}_\text{train}] $$This is the gap the attacker exploits. The question is how to exploit it optimally — and this is where the likelihood ratio comes in.
The Likelihood Ratio Test
The Neyman-Pearson lemma tells us that the most powerful test for distinguishing two hypotheses at a fixed false-positive rate is the likelihood ratio test. Here, our two hypotheses are:
$$ H_0: x \notin \mathcal{D}_\text{train} \qquad H_1: x \in \mathcal{D}_\text{train} $$The likelihood ratio is:
$$ \Lambda(x) = \frac{p(\ell(x, y; \theta) \mid H_1)}{p(\ell(x, y; \theta) \mid H_0)} $$We reject $H_0$ (i.e., classify as member) when $\Lambda(x) > \tau$ for some threshold $\tau$. The practical difficulty is that we do not have access to the true densities $p(\ell \mid H_0)$ and $p(\ell \mid H_1)$. The Carlini et al. LiRA attack2 estimates these by training many shadow models — one subset includes $x$ in training, the other excludes it — and using the empirical loss distributions from those runs.
A simpler approximation
A useful baseline that avoids shadow models assumes the loss distributions are Gaussian. Under this approximation:
$$ \Lambda(x) = \frac{\mathcal{N}(\ell(x); \mu_{\text{in}}, \sigma_{\text{in}}^2)} {\mathcal{N}(\ell(x); \mu_{\text{out}}, \sigma_{\text{out}}^2)} $$where $\mu_\text{in}$ and $\mu_\text{out}$ are the mean losses on the training and validation sets respectively, and $\sigma^2$ can be pooled or estimated separately. This is computationally trivial — it requires only one forward pass per sample and knowledge of aggregate training and validation loss statistics.
The simplest version of this attack — thresholding on raw loss — is equivalent to setting $\mu_\text{in} = 0$ and $\mu_\text{out} = \infty$. The Gaussian approximation is strictly more powerful. Neither requires shadow models.
Implementation
The following snippet computes the Gaussian likelihood ratio score for a batch of evaluation samples. It assumes you have already computed aggregate loss statistics from training and validation runs.
import numpy as np
from scipy import stats
def gaussian_lira_score(
sample_losses: np.ndarray,
mu_in: float,
sigma_in: float,
mu_out: float,
sigma_out: float,
) -> np.ndarray:
"""
Compute the log likelihood-ratio score for membership inference.
A positive score suggests the sample is more likely a member;
a negative score suggests non-member.
Args:
sample_losses: Per-sample cross-entropy losses, shape (N,).
mu_in: Mean training loss (from training run).
sigma_in: Std of training losses.
mu_out: Mean validation loss (held-out reference set).
sigma_out: Std of validation losses.
Returns:
Log-likelihood ratios, shape (N,). Higher → more likely member.
"""
log_p_in = stats.norm.logpdf(sample_losses, loc=mu_in, scale=sigma_in)
log_p_out = stats.norm.logpdf(sample_losses, loc=mu_out, scale=sigma_out)
return log_p_in - log_p_out
def threshold_predict(scores: np.ndarray, tau: float = 0.0) -> np.ndarray:
"""Classify as member (1) when score exceeds threshold tau."""
return (scores > tau).astype(int)
To evaluate the attack, you need a ground-truth membership set. The standard evaluation protocol holds out a random subset of both training members and non-members at the beginning of the experiment — before training — so the model has never seen the non-member evaluation samples either:
python train.py \
--dataset cifar10 \
--train-size 25000 \
--eval-members 2500 \
--eval-nonmembers 2500 \
--epochs 100 \
--save-losses member_losses.npy validation_losses.npy
Computing the ROC curve
Once you have scores for members and non-members, the ROC curve follows directly. The key metric for membership inference is the TPR@0.1%FPR: the fraction of members correctly identified when the false positive rate is held to 0.1%. This matters because in practice an attacker with reasonable confidence needs a very low false positive rate — classifying thousands of non-members as members is not useful.
from sklearn.metrics import roc_curve, roc_auc_score
def compute_mia_roc(
member_scores: np.ndarray,
nonmember_scores: np.ndarray,
) -> dict:
"""
Compute ROC curve and key membership inference metrics.
Returns dict with fpr, tpr, thresholds, auc, and tpr_at_low_fpr.
"""
y_true = np.concatenate([np.ones(len(member_scores)),
np.zeros(len(nonmember_scores))])
y_score = np.concatenate([member_scores, nonmember_scores])
fpr, tpr, thresholds = roc_curve(y_true, y_score)
auc = roc_auc_score(y_true, y_score)
# TPR at FPR ≤ 0.001 (0.1%) — the primary privacy metric
low_fpr_mask = fpr <= 0.001
tpr_at_01_fpr = tpr[low_fpr_mask].max() if low_fpr_mask.any() else 0.0
return {
"fpr": fpr,
"tpr": tpr,
"thresholds": thresholds,
"auc": auc,
"tpr_at_0.1%_fpr": tpr_at_01_fpr,
}
What the ROC Curve Actually Tells You
AUC is a useful aggregate but it hides the privacy-relevant part of the curve. An AUC of 0.60 can be consistent with TPR@0.1%FPR of nearly zero (if the separation is concentrated at high FPR) or with TPR@0.1%FPR of 15%+ (if the separation happens at the low-FPR end). These have very different privacy implications.
The reason to care about the low-FPR region specifically is that it corresponds to the attacker's operational regime. If an adversary wants to confidently assert that a specific individual was in the training set, they need precision — they cannot tolerate high false positive rates. The TPR at 0.1% FPR is asking: how many true members does the attacker find before they accumulate one false positive for every thousand guesses?
Configuration example
# mia_config.yaml — membership inference audit configuration
model:
architecture: resnet18
checkpoint: checkpoints/resnet18_epoch100.pt
num_classes: 10
attack:
type: gaussian_lira
fpr_targets: [0.001, 0.01, 0.1] # evaluate TPR at these FPR values
n_bootstrap: 1000 # bootstrap samples for confidence intervals
evaluation:
member_losses_path: data/member_losses.npy
nonmember_losses_path: data/nonmember_losses.npy
output_dir: results/mia_audit/
reporting:
primary_metric: tpr_at_fpr_0.001 # 0.1% FPR
threshold_for_concern: 0.05 # flag if TPR@0.1%FPR exceeds 5%
Interpreting Results and Setting Thresholds
There is no universal threshold that cleanly separates "private" from "not private". The right reference points are:
- Baseline TPR@0.1%FPR ≈ 0.1%. A perfectly non-leaking model yields a flat ROC curve; at 0.1% FPR, the attacker finds 0.1% of members. Any value substantially above this indicates leakage.
- 2–5% TPR@0.1%FPR: moderate leakage. Common on well-regularised models. Usually acceptable if the training data is not particularly sensitive.
- >10% TPR@0.1%FPR: material leakage. At this level, the model is reliably identifying a meaningful fraction of training members under realistic attacker constraints. Investigate overfitting, consider differential privacy.
These thresholds are heuristic. If the training data contains medical records or other high-sensitivity material, even 2% may be unacceptable. If the training data is public web text, 10% may be tolerable. The right answer is always context-specific.
Mitigations
Regularisation (cheapest, imperfect)
L2 weight decay and dropout reduce overfitting, which is the proximate cause of the loss gap. Empirically, heavy regularisation reduces TPR@0.1%FPR by 2–5× on common architectures without significant accuracy cost. It does not provide formal guarantees — it makes the attack harder, not impossible.
Early stopping
Stopping before the model fully converges on the training set prevents the deep memorisation that drives the loss gap. The cost is some training loss — you leave accuracy on the table. The tradeoff depends on how much of your accuracy is coming from memorised samples versus generalised features, which requires ablation to determine.
Differential privacy (formal guarantee, meaningful cost)
$(\varepsilon, \delta)$-differential privacy provides a formal bound on the advantage any membership inference attacker can achieve. Under DP-SGD,3 the mechanism adds calibrated Gaussian noise to per-sample gradients before averaging, making the output distribution of the training algorithm insensitive to any individual training sample:
$$ \Pr[\mathcal{M}(\mathcal{D}) \in S] \;\leq\; e^\varepsilon \cdot \Pr[\mathcal{M}(\mathcal{D}') \in S] + \delta $$for any adjacent datasets $\mathcal{D}$ and $\mathcal{D}'$ differing in one record, and any measurable set $S$ of model parameters. The privacy cost $\varepsilon$ is the core parameter: smaller is more private, but the utility cost at $\varepsilon \leq 3$ is substantial (typically 5–15% accuracy degradation on image classifiers).
DP-SGD is not always the answer. For large language models fine-tuned on small private corpora, the utility degradation at meaningful $\varepsilon$ values can be severe enough to make the model useless. The decision to use DP requires quantifying the sensitivity of the training data and the cost of the accuracy hit — both of which are project-specific.
Conclusion
Membership inference is best understood as a measurement tool, not just an attack. Running the Gaussian LiRA evaluation as a routine audit step — compute per-sample losses, fit the approximation, plot the ROC curve — takes less than an hour on any trained model and tells you something specific and actionable about where you sit on the privacy-utility curve.
The metric that matters is TPR at low FPR, not AUC. A model with AUC of 0.65 and TPR@0.1%FPR of 1.2% is in a different situation from a model with AUC of 0.65 and TPR@0.1%FPR of 18%. The ROC curve is the right instrument; read it in the right region.