Transformers have been successfully applied to a wide variety of modalities: natural language, vision, protein modeling, music, robotics, and more. A common trend with using large models is to train a transformer on a large amount of training data, and then finetune it on a downstream task. This enables the models to utilize generalizable high-level embeddings trained on a large dataset to avoid overfitting to a small task-relevant dataset.
We investigate a new setting where instead of transferring the high-level embeddings, we instead transfer the intermediate computation modules – instead of pretraining on a large image dataset and finetuning on a small image dataset, we might instead pretrain on a large language dataset and finetune on a small image dataset. Unlike conventional ideas that suggest the attention mechanism is specific to the training modality, we find that the self-attention layers can generalize to other modalities without finetuning.
To illustrate this, we take a pretrained transformer language model and finetune it on various classification tasks: numerical computation, vision, and protein fold prediction. Then, we freeze all the self-attention blocks except for the layer norm parameters. Finally, we add a new linear input layer to read in the new type of input, and reinitialize a linear output layer to perform classification on the new task. We refer to this as “Frozen Pretrained Transformer”.
Across the tasks, a token fed to the model represents a small amount of information: for example, it could be a single bit, or a 4×4 image patch. In particular, the tokens can only communicate with each other via the self-attention mechanism, which is not being trained at all on the downstream task. We investigate if these mechanisms – learned exclusively from natural language data – can be used for another modality in zero shot.
We show test accuracies for a variety of tasks below. We FPT can match or improve the performance of training
This article is purposely trimmed, please visit the source to read the full article.
The post Pretrained Transformers as Universal Computation Engines appeared first on The Berkeley Artificial Intelligence Research Blog.