Understanding Content: Paraphrasing wav2vec 2.0

Introduction

One of the crucial points before criticizing a publication is to fully understand its content, which otherwise entails unjust reflections. The language of choice for criticizing is another critical element, being discussed in another dedicated blogpost. The interesting point about language is that it's the sole element of knowledge transfer, the so-called Transfer Learning. Reinvention of knowledge is a consequence of a non-existing Transfer Learning, as was the case before invention of languages.

In neural networks, Transfer Learning refers to pre-training of a model in a supervised fashion, where training data are labeled. The pre-trained model may then be used for solving other tasks, as training a new model from scratch requires a vast amount of computational resources.

In speech recognition, labeled data refers to audio files accompanied by transcriptions, which work as ground truth. Once transcriptions are available, the training of neural networks is straight-forward. However, transcriptions of audio files are not available for the majority of languages spoken worldwide. As an example, one could imagine the many languages spoken in India, all lacking transcriptions for speech recognition applications. Therefore, other means of speech recognition models are necessary in order to accommodate for the lack of labeled data. Hence, self-supervised training has to be introduced.

In contrast to supervised training, self-supervised training has emerged as a desirable training method as unlabeled data are more accessible than the labeled ones. The self-supervised learning relies on data representations from unlabeled examples and, in a second step, fine-tuning is performed through supervised learning with limited amount of labeled data.

The framework of wav2vec 2.0 is based on self-supervised learning of representations from raw audio data. Figure 1 illustrates the wav2vec 2.0 framework. The raw audio waveform (X) passes through the multi-layer convolutional feature encoder, resulting in latent speech representations (Z). Furthermore, these representations are randomly masked before being fed to a Transformer network to build the contextualized representations (C). In parallel, a product quantization is applied on Z (the output of the feature encoder), resulting in target representations (Q). Furthermore, the Transformer learns the structure of Z by minimizing the contrastive loss or more simply, comparing C and Q over the masked parts results in learning representations of speech audio.

One of the unique properties of the wav2vec 2.0 model is the utilization of the Transformer architecture and the quantization module in parallel. This leads to an end-to-end approach where context representations are created directly from the masked latent speech representations fed into the Transformer. Previous models, such as vq-wav2vec, relied on a two-step pipeline where the quantized representations are first aggregated into context representations and in a second step, BERT is applied to the discretized sequence.

Figure 1: The architecture of the wav2vec 2.0 framework, depicting its most important components. Figure courtesy of wav2vec 2.0 paper.

Model

The model architecture is summarized as follows:

Feature encoder (f: X → Z): consisting of several blocks, each containing a temporal convolution followed by layer normalization and a GELU activation. The feature encoder takes the raw audio as input and embeds it into latent speech representations z₁,.., z_T for T time-steps, with an encoder output frequency of 49 Hz and a stride of around 20 ms.

Contextualized representations with Transformer (g: Z → C): using a convolutional layer that acts as relative positional embedding, instead of using fixed positional embeddings. The context network takes the output of the feature encoder as input. This input is partially masked before being fed to the context network.

Quantization module (Z → Q): relying on product quantization, the output of the feature encoder (being in parallel fed to the Transformer) is discretized to a set of speech representations using a quantization module. In product quantization, multiple codebooks are used to first choosing quantized representations and later concatenating them. Considering Figure 2, given only one codebook with V entries, a single latent speech representation is mapped to a logit-vector of size V. Subsequently, Gumbel softmax is applied to calculate the probability for each codebook entry. Finally, the codeword is chosen by argmax on the probability vector. This process is performed for all latent speech representations z₁,.., z_T.
Figure 2: The principle of discretizing the feature encoder output z to a finite set of quantized speech representations, as depicted in vq-wav2vec paper.

Training

The latent speech representation (Z: the feature encoder outputs) is partially masked before being fed to the Transformer network. This is similar to masked language modeling in BERT. The masking is performed by randomly sampling, without replacement, starting points p of the input sequence and then masking the subsequent M consecutive time steps from every sampled index. Due to the random starting points, overlaps between the spans may occur.

Furthermore, the objective of the pre-training is to learn representations of speech audio by solving a contrastive task L_m. This, in order for the model to identify the true quantized latent speech representation in a set of distractors for each masked time step. The distractors are obtained from the same utterance by uniformly sampling its other masked time steps. In addition, a diversity loss L_d is used in order for the model to increase the use of the quantized codebook representations.

Fine-tuning

Once the model is pre-trained, fine-tuning is performed on labeled data using a Connectionist Temporal Classification (CTC) loss. Therefore, a linear layer is added atop the Transformer network with randomly initialized weights, representing the vocabulary of the task. For Librispeech, there are 29 tokens plus a word boundary token, the so-called "blank token", which is used to correctly align the input and the output. The model is then optimized by minimizing the CTC loss. For the interested to get an in-depth understanding of the step-by-step fine-tuning of wav2vec 2.0, it's recommended to read the well-written blogpost by Patrick von Platen.

Results and Conclusion

Successfully, wav2vec 2.0 demonstrates that pre-training on the unlabeled data, followed by fine-tuning on a limited amount of labeled data achieves competitive results to state-of-the-art Automatic Speech Recognition (ASR). Wav2vec 2.0 yields a word error rate (WER) of 7.6 % by just utilizing 1 hour of labeled data, outperforming the previous state-of-the-art model, with a WER of 8.6 % for a 100-h labeled dataset.

Ultimately, wav2vec 2.0 demonstrates that self-supervised learning on unlabeled data makes ultra-low resource speech recognition possible, as the availability of labeled data for the vast majority of languages spoken worldwide is next to none.

Kommentarer

Pascal Sager30 april 2021 kl. 02:46
Dear Mohsen,

Thank you very much for this post. I'm currently working on my first project thesis which deals with the prediction of frames of speech. That's why I'm really intersted in your work and your thoughts. Most state-of-the-art approaches in speaker recognition as well as ASR use self-supervised pre-training, often on the raw audiodata. By doing so, better features can be extracted than for example if a Mel-spectrogram or a MFCC is used. However, this does not really work for my project thesis, since the raw audio contains really a lot of unnecessary information and therefore it is easier to predict spectrograms.

You did a really good job explaining the paper and how wav2vec 2.0 works. I think this model is really helpful to fine-tune on a specific problem. However, it suffers from the same issue most self-supervised deep learning do: It needs a really long pretraining with a lot of resources.

As you know, I'm focusing on Transformer networks in the AI seminar and I have a related question. Do you know if the encoder and the decoder of the Transformer were used? Based on your description, the decoder could be sufficient. If I remember correctly, this was also done for the pretraining of BERT.

Keep up with the good work, see you an Monday
Pascal
SvaraRadera
Svar

Lägg till kommentar

Leta i den här bloggen

EVA1: wav2vec 2.0