PhD: Exploration of speech processing and speech recognition methods for subtitling and closed captioning audiovisual documents

Supervisors: Irina Illina and Dominique Fohr,


Multispeech team, LORIA-INRIA, Nancy

Duration: 3 years

Starting time: As soon as possible

Required skills: Knowledge of statistics and programming (Python) and deep learning (Tensorflow or Pytorch).

Motivation and context

More and more audiovisual documents are available on TV and on the Internet. If these documents are in a language that the viewer cannot understand, they must be subtitled or dubbed. In addition, for the hearing-impaired, closed captioning is essential to enable them to watch an audiovisual program. The French law obliges TV channels with a market share greater than 2.5% to caption all their programs.

Subtitling and closed captioning require several steps:

  • Segmentation of speech segments: this step consists in detecting the segments during which people speaks and during which it will be necessary to create and display subtitles.
  • Dialogue transcription: this consists of transcribing the uttered words.
  • Translation: if the audiovisual document is in a language other than the language desired for the subtitling, a translation must be carried out.
  • Adaptation: the text obtained in the previous steps must be modified to meet the constraints of subtitling: there is a maximum number of characters that a viewer is able to read comfortably (approximately 12 characters per second). Rephrasing may be necessary in order to shorten the text to achieve an acceptable caption while retaining the original meaning.

In the case of dubbing, a constraint is added: the words chosen during the translation / adaptation phase must correspond to the movements of the actors’ lips. In addition, since a character from the show or film must be dubbed by the same person, all the sentences spoken by a character must be grouped together (diarization).


Subtitling, close captioning, and dubbing are very expensive because each step is done manually and takes a lot of time. The goal of this thesis is to automate certain steps of subtitling in order to be able to produce subtitles more quickly and at a lower cost. These steps pose many research problems due to the specificities of the audiovisual documents: various background noises, music, songs, accents, etc. In addition, audiovisual documents can be of different nature: TV shows, documentaries, news, sports, political debates, movies, cartoons, etc. The proposed methods must be robust to these different conditions.

Recently developed methodologies for speech processing will be at the center of this thesis. In recent years, methods based on deep learning have shown remarkable performance in many tasks of automatic speech processing: speech recognition, segmentation, etc. In this thesis, the aim is to develop new methodologies based on neural networks to improve the performance of dubbing or subtitling systems. The developed methods will be evaluated on real data: TV shows or movies for which reference subtitles are available. A large corpus of various movies and TV shows will be available to perform supervised learning of the developed models.


The CIFRE PhD thesis will take place in the Multispeech team of the LORIA-INRIA laboratory in Nancy and in a company located in Montpellier.


[1] Thad and Keir Mierle, “Recurrent neural networks for voice activity detection”, ICASSP, 2013.

[2] Joohyung Lee, Youngmoon Jung, Hoirin Kim Dual, “Attention in Time and Frequency Domain for Voice Activity Detection”, Interspeech, 2020.

[3] Cheng Yu, Kuo-Hsuan Hung, I-Fan Lin, Szu-Wei Fu, Yu Tsao, Jeih-weih Hung, “Waveform-based Voice Activity Detection Exploiting Fully Convolutional networks with Multi-Branched Encoders”, arxiv 2020.

[4] Ruixi Lin, Charles Costello, Charles Jankowski, Vishwas Mruthyunjaya, “Optimizing Voice Activity Detection for Noisy Conditions”, Interspeech 2019.