wav2vec 2.0: Reviewing and Criticizing content
Introduction
The main prerequisite to review and criticize an article is to have completely understood the content as well as the intent of the author(s) with publishing the content. Following are criteria set by IJCAI (International Joint Conference on Artificial Intelligence):
- Originality,
- Significance,
- Relevance,
- Technical quality,
- Clarity and Quality of writing,
- Scholarship (scientific context).
When criticizing, the strengths and weaknesses found in the publication need to be presented correctly, as well as highlighting the knowledge of the authors and their contribution to the field of interest. Furthermore, any gaps and contradictions found in the article should be underlined. All standpoints of the reviewer must be supported by facts pertinent to that area of knowledge. Noteworthy to mention is that it is the duty of the author(s) to provide the audience with an interpretation and analysis that demonstrate the value of the publication.
Originality (10)
The paper under review, wav2vec 2.0, clearly demonstrates the aim of the research topic and concisely brings forth its importance to the research field, which is speech recognition. The authors clearly state the weaknesses in the current speech recognition systems and set forward a new framework based on self-supervised learning. The introduction of the article provides sufficient background information to enable the audience for a better understanding of the underlying problem. The originality of the research topic is unquestionable, considering it being a sequel and an enhancement to previously published work by the same authors.
Significance (10)
As claimed and proven by results, wav2vec 2.0 demonstrates state-of-the-art speech recognition through "learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech ... ". In addition, this novel method is conceptually simpler than previously established models and outperforms the best semi-supervised methods.
Relevance (9)
As the paper was presented at the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), it is considered to be relevant and in line with the content of the conference. In addition, it is also relevant to the interest of the researchers in the field of speech recognition and neural networks.
Technical Quality (7)
Concisely, the authors have presented their framework in couple of paragraphs, which may be considered too dense, as many details have been left out to the reader to figure out through the many citations in the text. The results are tabulated in several tables, following an extensive section about the experimental setup. This is however hard to follow as the information is very detailed and thus, replicating the results is not straight-forward. Subsequently, further research by the reader needs to be performed for replicating the tabulated results.
The strength of the text lies in the quoted relevant literature, although contemplating this for understanding the model is a weakness. The details presented in the paper are heavily based on previous research. Subsequently, a lack of understanding of previous works will undermine the comprehension of the ideas provided in the current work. This is considered a weakness in the quality of the paper. An improvement to the paper would be to provide an in-text discussion of the details.
Clarity and quality of writing (6)
Moreover, the authors shortly discuss the theoretical background of training their model before outlining the experimental setup. The former being, in contradiction to the latter, very concise and once again relevant literature has to be examined. The experimental setup is densified with technical details without any clear methodology to reproduce the experiments. However, this weakness is common for all "papers with codes", which are impractical without the corresponding codes. Furthermore, the results are tabulated in several tables with sufficient details to reproduce the results once accessing the codes.
Scholarship (8)
The concept of wav2vec 2.0 is briefly compared to previous speech recognition models. Additionally, the obtained results are compared, and clearly presented, to previous state-of-the-art models. The relevant literature is also cited. However, clear discussions about differences between the models are missing and one has to look up the relevant citation for more insight.
Overall Score (8)
Finally, the main body of the paper is concluded with the previously presented data, as well as a prospect on further improvements of the model. The paper presents a balanced survey of the literature through many citations without in-text discussions, the data presented is put into context, as well as its relevance to the previous models.
Dear Mohsen,
SvaraRaderathanks for the review of the paper. I was wondering how you would explain the originality of the previous research work? Maybe I missed it but what was the initially original idea that led to this follow up paper?
Best,
Sebastian
Dear Sebastian,
RaderaThanks for the comment. Basically, the idea is to use raw unlabeled data for training and labeled data for fine-tuning. The motivation is the lack of labeled data, or sufficient amount, for most of the languages in the world since training requires a lot of data.
The idea has been the same but the model has changed couple of times, from wav2vec to vq-wav2vec and now wav2vec 2.0.
I hope it answered your question.
Best
Mohsen
Dear Mohsen,
SvaraRaderaThank you for your review of the paper wav2vec 2.0. I find your introduction in which you describe what is important in a review very interesting.
I can only agree with your criticism: I read this paper a few weeks ago during the prject thesis I'm currently writing. In order to understand it properly, I often missed the details. I then read the previous papers (wav2vec and vq-wav2vec) which helped to understand idea.
Anyway, I completely agree with your review and think you did a good job.
Best, Pascal
Dear Pascal,
RaderaThanks for your comment. Yes, the previous papers are actually necessary in order to understand this one, as a lot of details about the model are not properly explained in the wav2vec 2 paper.
Best,
Mohsen
Dear Mohsen
SvaraRaderaThank you for your detailed review of the paper. Even though I haven’t read the paper yet, by reading your post I have a good overview about the strengths and weaknesses of the paper now. Great job.
All the best,
Afrooz