Multi-Task Learning as Inductive Bias Under Distribution Shift

A model trained on one domain and deployed in another will lose performance. The question is whether the architecture itself can act as a structural buffer.

14 April 2026

A model trained on one domain and deployed in another will lose performance. This is well known. What is less well understood is why — because "distribution shift" is not one thing. The input vocabulary changes (lexical shift). The label frequencies change (label shift). The label semantics themselves may shift. These effects compound, and most evaluations do not disentangle them.

The standard response is more data, larger models, or domain adaptation. These work, but they treat the symptom. The question this thesis asks is whether the architecture itself — specifically, multi-task learning — can act as a structural buffer against degradation, by forcing the encoder to learn representations that serve more than one objective.

Experimental Setting

Two NLP classification tasks drawn from the SemEval 2025 shared task on persuasion detection: entity framing (assigning roles to named entities in news text) and narrative classification (identifying overarching discourse strategies in articles). Both tasks are multi-label and hierarchical — the challenging end of the classification spectrum.

Two domains — Ukraine conflict and climate change — with verified distributional divergence: Jensen-Shannon divergence of 0.41 on token distributions, cosine distance of 0.64 on sentence embeddings, and non-overlapping narrative label sets across domains.

PCA and t-SNE plots of sentence embeddings for Ukraine and climate change domains. PCA shows moderate overlap with a noticeable central shift. t-SNE shows clearly separated clusters, with most climate change and Ukraine points occupying different regions of the embedding space. — Figure 2. PCA (left) captures linear variation in the embeddings and shows overall structure; there is moderate overlap but a noticeable central shift between domains. t-SNE (right) captures non-linear relationships; the domains are clearly separated, with most climate change (CC) and Ukraine (UA) points occupying different regions of the space.

Three architectures compared: single-task learning (STL), multi-task learning with hard parameter sharing (MTL), and a hybrid variant with task-specific adapter modules (MTL-A). Each configuration evaluated over 20 random seeds with Wilcoxon signed-rank tests for significance.

Diagram showing the three architectures: single-task learning with one task head, multi-task learning with a shared encoder and two task heads, and MTL with adapters adding task-specific modules between shared encoder layers. — Figure 3. The multi-task learning setup (centre) and MTL with task-specific adapters (right, shown in blue). Both fine-tune the shared encoder by optimising a summed loss across tasks. The single-task setup (left) includes only one task head and no adapter modules.

What I Tested

The thesis is structured around four controlled ablations. Each one isolates a different mechanism — to answer not just whether MTL helps, but under what conditions and why.

1. Cross-domain training

One task trained on domain A, the other on domain B. Does the auxiliary task's domain signal transfer to the primary task?

Finding: training entity framing on climate change data while training narrative classification on Ukraine data recovered entity framing performance on Ukraine evaluation metrics to the level of a model trained directly on Ukraine. The auxiliary task acted as an indirect source of domain-relevant signal — a form of cross-task, cross-domain data augmentation.

2. Data duplication

The most important control. MTL models see roughly twice the data of STL models (two tasks' worth of labelled examples passing through a shared encoder). Is the MTL gain just a data volume effect? To test this, I duplicated the STL training data to match the MTL token count.

Finding: duplication does improve performance, particularly on rare labels. But MTL configurations still outperform the duplicated baseline on the tasks that benefit most from multi-task training — confirming that task interaction contributes something beyond additional supervision volume.

3. Label granularity

The label taxonomy was varied across four configurations: both tasks fine-grained, one coarse / one fine, and both coarse.

Finding: when both tasks use coarse labels and the classification problem becomes more balanced, STL outperforms MTL. Hard parameter sharing is most beneficial precisely when the problem is hardest — high label imbalance, data scarcity, and large distributional shift. Under easier conditions, the specialisation of a single-task model wins.

4. Cross-domain evaluation

All models were evaluated on held-out test sets from both domains, regardless of training domain. This surfaces asymmetric transfer: models trained on one domain generalised to the other more effectively than the reverse, suggesting that domain informativeness — not just domain similarity — drives transfer quality.

Key Findings

MTL helps most when things are hardest. Under data scarcity, extreme class imbalance, and large distributional shift, multi-task models recovered performance that single-task models lost. The clearest example: narrative classification trained on Ukraine data alone scored 0.00 micro-F1 on climate change test data under STL — complete failure. Under MTL, the same encoder produced 0.24. Still low, but a meaningful recovery from zero.

MTL is not universally better. In stable, data-rich settings with well-covered label distributions, single-task models performed competitively or better. The negative transfer observed in entity framing (2–3% micro-F1 degradation under MTL) was consistent and statistically significant in several configurations. MTL is a specific tool for specific failure modes, not a universal upgrade.

The data duplication ablation matters. Without it, the MTL results would be ambiguous — the gains could be explained by exposure to more labelled data, not by task interaction. Isolating these two effects is what allows the thesis to make a causal claim about the role of multi-task structure, rather than just reporting correlations.

Label granularity mediates the effect. When the auxiliary task uses fine-grained labels (more imbalanced, harder), MTL provides a stronger benefit. When labels are coarsened (more balanced, easier), the MTL advantage diminishes or reverses. This suggests that MTL's regularisation effect is most valuable when the model is at greatest risk of overfitting to dominant classes.

Methodology Notes

Evaluation over 20 seeds (10 per ablation) with mean ± standard deviation reported for all metrics. Statistical significance assessed via two-sided Wilcoxon signed-rank tests. Three complementary metrics: micro-F1 (overall accuracy, sensitive to frequent classes), macro-F1 (per-class average, sensitive to rare classes), and exact match ratio (strictest — every label correct for a given instance).

The full multi-label, hierarchical label taxonomy was preserved rather than simplified to multi-class. This is a deliberate methodological choice: it makes the problem harder but the results more representative of real-world classification tasks where label spaces are large, imbalanced, and structured.

Post-hoc interpretability via SHAP token attributions was included for completeness but treated with appropriate caution given the known limitations of post-hoc explanations for transformer models.

Relevance to Applied ML

The project addresses a problem that recurs in every production ML system: your training data does not perfectly represent the data you will encounter at deployment. The findings have direct implications for practitioners deciding between single-task and multi-task architectures.

Reach for MTL when you have limited labelled data, severe class imbalance, or reason to expect domain shift at deployment. The inductive bias from a related auxiliary task can act as a structural regulariser that simple data augmentation cannot fully replicate.

Stay with STL when you have sufficient, well-distributed training data and a stable deployment domain. The specialisation advantage of a single-task model outweighs the regularisation benefit of MTL in these conditions.

Design your evaluation to isolate mechanisms. An MTL result without a data-volume control is incomplete. An out-of-domain evaluation without disentangling input shift from label shift conflates two different failure modes. The ablation methodology used here — varying one factor while holding others constant — is transferable to any model evaluation pipeline.

This post is a condensed version of my MSc thesis at the University of Amsterdam (2025). The full thesis is available on request. An earlier version was published on Medium.