Self-supervised Vision Transformer are Scalable Generative Models for Domain Generalization

Based on: Doerrich, S., Salvo, F. D., & Ledig, C. (2024). Self-supervised Vision Transformer are Scalable Generative Models for Domain Generalization (No. arXiv:2407.02900). arXiv. https://doi.org/10.48550/arXiv.2407.02900

Introduction: The Challenge of Generalization

Fig. 1: Toy Example – Real dogs vs. drawn cat

Imagine you’re trying to teach a model to distinguish between cats and dogs. You train a simple classifier using the images shown on the right: real dogs and drawn cats. While this might work well for correctly classifying a real dog (like the first image), it can easily fail when encountering a new dog image with slightly different characteristics, such as the second dog image. Why? Because the model has never seen enough variation. It’s learned to associate realism with dogs and drawn sketches with cats. This is a classic problem of poor generalization, and it closely mirrors the challenges of training classifiers for histopathology images [1].

Now consider a similar scenario in the context of medical imaging. Suppose we collect histopathology images of the same tissue type from three different hospitals. Although the underlying biological structures are comparable, the resulting images can differ substantially due to variations in scanner hardware, staining protocols, image resolution, or post-processing pipelines. These subtle but systematic differences – often referred to as domain shifts – can degrade the performance of a tumor detection model. For example, a model trained exclusively on data from Hospital A may fail to detect tumors in images from Hospital B, not because the tumor morphology is different, but because differences in color distribution, contrast, or texture statistics cause the model’s learned features to generalize poorly.

Fig. 2: Visual example of how systematic variations between hospitals can challenge a model’s ability to generalize

Overcoming Domain Shifts with Synthetic Images

To address this, we need more diverse and representative training data. One effective strategy is to generate synthetic images that enrich the variability in our datasets. Several methods already exist for this purpose, such as augmenting the dataset using HSV color space manipulation (read more [8]), or using stain color normalization techniques to match the color distribution of the test data to that of the training set (read more [2]) [4][5].

However, these techniques have their limitations – particularly when the goal is to generalize to target domains with distribution shifts not captured by the available training data or augmentations. This is where the approach presented in the paper you’re reading about comes in [3].

Rather than manually adjusting colors or applying handcrafted augmentations, the authors take a different approach. They introduce a self-supervised method that learns to separate an image into two distinct types of information:

Anatomical features: These capture the underlying biological structure in the image – things like the shape and arrangement of cells, tissue architecture, and other medically relevant content. This is the part of the image that a pathologist actually cares about.
Characteristic features: These represent everything else that isn’t directly tied to the anatomy. Think scanner settings, staining variations, lighting conditions, resolution differences, or other artifacts introduced during image acquisition and processing.

By disentangling these two components, the model can recombine them in new ways. That means you can take the anatomy from one image and mix it with the characteristics from another, generating entirely new, synthetic images that preserve diagnostic content while introducing natural variation in style.

This opens up a powerful avenue for generating more diverse training data, without needing access to target domain images or hand-crafted augmentation rules.

But now you might be wondering: How exactly is this separation achieved? What kind of representations does the model learn for anatomy and characteristics? And how does the model turn it back into a synthetic image? Let’s break these ideas down step by step.

Fig. 3: Key idea – Split information contained in images into anatomy and style components

Core Idea: Separating Anatomy and Style

Let’s walk through a simple example to understand how this works. Imagine we have just two histopathology images and we want to extract a representation from each that separates anatomical and characteristic features.

To achieve this, each image is first fed into an encoder, in this case, a Vision Transformer (ViT), which we’ll explain shortly. The encoder extracts a feature representation that’s specifically structured: the first half encodes the anatomical information, while the second half encodes the characteristic information. How this separation is enforced is explained in the next sections.

Now comes the clever part: we can mix and match these representations.

Take the anatomical features from Image A and combine them with the characteristic features from Image B
Or vice versa: anatomical features from Image B combined with characteristics from Image A

Then, using a fixed decoder – which can be roughly imagined as multiplying the anatomical and characteristic features – these combined features are transformed back into full images. The resulting synthetic images preserve the original biological structures but appear as if they were captured under different scanner settings or staining protocols. In other words, the anatomy stays the same, but the “style” changes, which helps the model become robust to those domain shifts we mentioned earlier.

Fig. 4: Encoder-decoder pipeline for swapping anatomy and style to boost diversity

The key to making this all work lies in the encoder, which is built using a Vision Transformer (ViT). So let’s quickly shine some light on how a ViT operates.

Vision Transformer: A Quick Overview

Unlike convolutional neural networks (CNNs), which analyze images using sliding filters and build up local feature hierarchies, a Vision Transformer takes a more holistic view. It begins by chopping the input image into small square patches (say, 16×16 pixels each). These patches are flattened into vectors – kind of like turning a tiny image square into a long list of numbers. Each of these vectors is then embedded into a new space using a learnable linear projection, and a positional embedding is added so the model doesn’t lose track of where each patch came from in the original image.

Now comes the core idea: all these patch embeddings are passed as a sequence into a standard Transformer encoder, similar to what’s used in natural language processing, where self-attention mechanisms learn relationships between every patch and every other patch. This allows the model to capture both fine details (like cell shape) and global context (like tissue structure or staining gradients).

In our case, after the Vision Transformer processes the image, the resulting feature representation should be split in half: the first half encodes the anatomical information, and the second half captures the characteristic information – this separation is the key idea behind the approach.

For a deeper technical breakdown of Vision Transformers, you can refer to this detailed post, and if you’re curious about how transformers work more generally, this 3Blue1Brown video is an excellent starting point [6][7].

Fig. 5: Overview of a Vision Transformer used for feature extraction

Guiding the Model: The Three Losses

But we still haven’t answered the crucial question: How does the Vision Transformer know what counts as anatomical and what counts as characteristic features?

Clearly, the model needs some kind of guidance – and if you’re thinking “Loss functions!”, you’re absolutely right. To teach the Vision Transformer how to separate these two types of features, the authors introduce three specific loss functions:

Fig. 6: Three Loss functions that guide the model to keep anatomy and style separate during training

Anatomical Consistency

The first is the Anatomical Consistency loss. The idea here is that the anatomical information in an input image should be preserved in the synthetic output image, even if the style or appearance changes. However, we can’t compare the input and output images directly at the pixel level – since things like brightness or color may have changed. Instead, we compare their internal feature representations.

Specifically, after generating a synthetic image by combining anatomy from one image and characteristics from another, the synthetic image is passed back through the Vision Transformer encoder. This gives a new representation where the anatomical and characteristic features should still be separated. To enforce consistency, we apply a mean squared error (MSE) loss between the anatomical part of the original image’s features and the anatomical part of the synthetic image’s features. This encourages the model to keep the anatomical content stable, even when other parts of the image vary.

Fig. 7: Anatomical Consistency – The synthetic image is passed back through the ViT to align anatomy with the source

Characteristic Consistency

The second loss is Characteristic Consistency. This works in a similar way but focuses on the characteristic features instead. It once again utilizes the synthetic image after it has been processed by the Vision Transformer. This time, we compare the characteristic features of the synthetic image to the ones that were originally taken from a different image during the mixing process. Again, this comparison is done using MSE loss. The goal is to ensure that the style or appearance injected into the synthetic image actually matches what was intended, helping the model to better learn and control image characteristics separately from anatomy.

Self-Reconstruction

The third and final loss is Self-Reconstruction. This ensures that the encoder and decoder together can accurately recreate the original image if no mixing happens. For this, the anatomical and characteristic features from the same image are passed through the decoder without any modification. Since no swapping occurs, the decoder should be able to fully reconstruct the original input image. We measure how close the output is to the input using MSE loss again. This helps guide the Vision Transformer to produce feature representations that are actually usable for image reconstruction and not just abstract embeddings.

With these three losses combined, the Vision Transformer learns to structure its internal feature space in a way that clearly separates anatomy from characteristics, laying the foundation for generating diverse and realistic synthetic images that help deep learning models generalize across domains.

Final Thoughts

Limitations

While the method shows promising results and introduces a clever way to generate more diverse training data, there are still a few open questions worth mentioning. First, although the model outperforms previous approaches on key benchmarks, the improvement is sometimes only slight. This raises questions about how cleanly the model is able to separate anatomical and characteristic features. The assumption that these two components can be evenly split – half anatomy, half style – may be too simplistic for real-world medical images.

Second, there’s a potential bias issue. Some anatomical structures may mostly appear under specific scanner settings or at certain hospitals. This could lead the model to associate anatomy with particular characteristics, making it harder to disentangle the two and potentially limiting generalization.

Conclusion

The challenge of domain generalization in medical imaging remains a difficult one, but this paper offers a creative and scalable approach to tackling it. By using the three loss functions – anatomical consistency, characteristic consistency, and self-reconstruction – the model learns to produce a structured representation of each image. The first half of this representation contains the anatomical features, while the second half holds the characteristic features, such as scanner-specific variations. By recombining these halves from different images, the method can generate new synthetic images that retain real biological structure but exhibit varied styles. These synthetic samples can then be added to the training set, which helps downstream models, such as classifiers, generalize better to previously unseen domains.

List of Abbreviations

Abbreviation	Meaning
ViT	Vision Transformer
CNN	Convolutional Neural Network
HSV	Hue, Saturation, Value (color space)
MSE	Mean Squared Error

References

[1] P. Bándi, et al., “From detection of individual metastases to classification of lymph node status at the patient level: The CAMELYON17 challenge,” IEEE Transactions on Medical Imaging, vol. 38, no. 2, pp. 550–560, Feb. 2019. doi: 10.1109/TMI.2018.2867350.

[2] M.-K. Le, “Harmonizing colors across institutions: An introduction to Vahadane stain normalization in digital pathology,” Medium, Jul. 17, 2025. [Online]. Available: https://medium.com/@minhkhangle.phd/harmonizing-colors-across-institutions-an-introduction-to-vahadane-stain-normalization-in-digital-26dc97f8f31c.

[3] S. Doerrich, F. D. Salvo, and C. Ledig, “Self-supervised Vision Transformers are scalable generative models for domain generalization,” arXiv preprint, arXiv:2407.02900, Jul. 3, 2024. doi: 10.48550/arXiv.2407.02900.

[4] J.-R. Chang, et al., “Stain Mix-Up: Unsupervised domain generalization for histopathology images,” in Medical Image Computing and Computer Assisted Intervention – MICCAI 2021, M. de Bruijne, P. C. Cattin, S. Cotin, N. Padoy, S. Speidel, Y. Zheng, and C. Essert, Eds., Cham: Springer International Publishing, 2021, pp. 117–126. doi: 10.1007/978-3-030-87199-4_11.

[5] A. Vahadane, et al., “Structure-preserving color normalization and sparse stain separation for histological images,” IEEE Transactions on Medical Imaging, vol. 35, no. 8, pp. 1962–1971, Aug. 2016. doi: 10.1109/TMI.2016.2529665.

[6] 3Blue1Brown, “Transformers, the tech behind LLMs | Deep Learning Chapter 5,” YouTube, Apr. 1, 2024. Accessed: Jul. 17, 2025. [Online Video]. Available: https://www.youtube.com/watch?v=wjZofJX0v4M.

[7] G. Boesch, “Vision Transformers (ViT) in image recognition,” viso.ai, Jul. 17, 2025. [Online]. Available: https://viso.ai/deep-learning/vision-transformer-vit/.

[8] R. (MLfast.co), “YOLO data augmentation explained: Turbocharge your object detection model,” Medium, Jul. 17, 2025. [Online]. Available: https://rumn.medium.com/yolo-data-augmentation-explained-turbocharge-your-object-detection-model-94c33278303a.