Understanding Content: Paraphrasing wav2vec 2.0
Introduction
One of the crucial points before criticizing a publication is to fully understand its content, which otherwise entails unjust reflections. The language of choice for criticizing is another critical element, being discussed in another dedicated blogpost. The interesting point about language is that it's the sole element of knowledge transfer, the so-called Transfer Learning. Reinvention of knowledge is a consequence of a non-existing Transfer Learning, as was the case before invention of languages.
In neural networks, Transfer Learning refers to pre-training of a model in a supervised fashion, where training data are labeled. The pre-trained model may then be used for solving other tasks, as training a new model from scratch requires a vast amount of computational resources.
In speech recognition, labeled data refers to audio files accompanied by transcriptions, which work as ground truth. Once transcriptions are available, the training of neural networks is straight-forward. However, transcriptions of audio files are not available for the majority of languages spoken worldwide. As an example, one could imagine the many languages spoken in India, all lacking transcriptions for speech recognition applications. Therefore, other means of speech recognition models are necessary in order to accommodate for the lack of labeled data. Hence, self-supervised training has to be introduced.
In contrast to supervised training, self-supervised training has emerged as a desirable training method as unlabeled data are more accessible than the labeled ones. The self-supervised learning relies on data representations from unlabeled examples and, in a second step, fine-tuning is performed through supervised learning with limited amount of labeled data.
The framework of wav2vec 2.0 is based on self-supervised learning of representations from raw audio data. Figure 1 illustrates the wav2vec 2.0 framework. The raw audio waveform (X) passes through the multi-layer convolutional feature encoder, resulting in latent speech representations (Z). Furthermore, these representations are randomly masked before being fed to a Transformer network to build the contextualized representations (C). In parallel, a product quantization is applied on Z (the output of the feature encoder), resulting in target representations (Q). Furthermore, the Transformer learns the structure of Z by minimizing the contrastive loss or more simply, comparing C and Q over the masked parts results in learning representations of speech audio.
One of the unique properties of the wav2vec 2.0 model is the utilization of the Transformer architecture and the quantization module in parallel. This leads to an end-to-end approach where context representations are created directly from the masked latent speech representations fed into the Transformer. Previous models, such as vq-wav2vec, relied on a two-step pipeline where the quantized representations are first aggregated into context representations and in a second step, BERT is applied to the discretized sequence.
![]() |
| Figure 1: The architecture of the wav2vec 2.0 framework, depicting its most important components. Figure courtesy of wav2vec 2.0 paper. |
Model
The model architecture is summarized as follows:
- Feature encoder (f: X → Z): consisting of several blocks, each containing a temporal convolution followed by layer normalization and a GELU activation. The feature encoder takes the raw audio as input and embeds it into latent speech representations z1,.., zT for T time-steps, with an encoder output frequency of 49 Hz and a stride of around 20 ms.
- Contextualized representations with Transformer (g: Z → C): using a convolutional layer that acts as relative positional embedding, instead of using fixed positional embeddings. The context network takes the output of the feature encoder as input. This input is partially masked before being fed to the context network.
- Quantization module (Z → Q): relying on product quantization, the output of the feature encoder (being in parallel fed to the Transformer) is discretized to a set of speech representations using a quantization module. In product quantization, multiple codebooks are used to first choosing quantized representations and later concatenating them. Considering Figure 2, given only one codebook with V entries, a single latent speech representation is mapped to a logit-vector of size V. Subsequently, Gumbel softmax is applied to calculate the probability for each codebook entry. Finally, the codeword is chosen by argmax on the probability vector. This process is performed for all latent speech representations z1,.., zT.

Figure 2: The principle of discretizing the feature encoder output z to a finite set of quantized speech representations, as depicted in vq-wav2vec paper.
Training
Fine-tuning

Dear Mohsen,
SvaraRaderaThank you very much for this post. I'm currently working on my first project thesis which deals with the prediction of frames of speech. That's why I'm really intersted in your work and your thoughts. Most state-of-the-art approaches in speaker recognition as well as ASR use self-supervised pre-training, often on the raw audiodata. By doing so, better features can be extracted than for example if a Mel-spectrogram or a MFCC is used. However, this does not really work for my project thesis, since the raw audio contains really a lot of unnecessary information and therefore it is easier to predict spectrograms.
You did a really good job explaining the paper and how wav2vec 2.0 works. I think this model is really helpful to fine-tune on a specific problem. However, it suffers from the same issue most self-supervised deep learning do: It needs a really long pretraining with a lot of resources.
As you know, I'm focusing on Transformer networks in the AI seminar and I have a related question. Do you know if the encoder and the decoder of the Transformer were used? Based on your description, the decoder could be sufficient. If I remember correctly, this was also done for the pretraining of BERT.
Keep up with the good work, see you an Monday
Pascal
Dear Pascal, Thank you for reading my blogpost. It's cool that we have similar projects outside this course as well :) Indeed, it needs lots of resources for the pretraining, but the good point is that it only needs unlabeled data for pre-training, so that makes everything easier.
RaderaBest
Mohsen