Posted by Sneha Kudugunta, Research Software Engineer and Orhan Firat, Research Scientist, Google Research
Scaling large language models has resulted in significant quality improvements natural language understanding (T5), generation (GPT-3) and multilingual neural machine translation (M4). One common approach to building a larger model is to increase the depth (number of layers) and width (layer dimensionality), simply enlarging existing dimensions of the network. Such dense models take an input sequence (divided into smaller components, called tokens) and pass every token through the full network, activating every layer and parameter. While these large, dense models have achieved state-of-the-art results on multiple natural language processing (NLP) tasks, their training cost increases linearly with model size.
An alternative, and increasingly popular, approach is to build sparsely activated models based on a mixture of experts (MoE) (e.g., GShard-M4 or GLaM), where each token passed to the network follows a separate subnetwork by skipping some of the model parameters. The choice of how to distribute the input tokens to each subnetwork (the “experts”) is determined by small router networks that are trained together with the rest of the network. This allows researchers to increase model size (and hence, performance) without a proportional increase in training cost.
While this is an effective strategy at training time, sending tokens of a long sequence to multiple experts, again makes inference computationally expensive because the experts have to be distributed among a large number of accelerators. For example, serving the 1.2T parameter GLaM model requires 256 TPU-v3 chips. Much like dense models, the number of processors needed to serve an MoE model still scales linearly with respect to the model size, increasing compute requirements while also resulting in significant communication overhead and added engineering complexity.
In “Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference”, we introduce a method called Task-level Mixture-of-Experts (TaskMoE), that takes advantage of the quality gains
This article is purposely trimmed, please visit the source to read the full article.