It is now possible to take a talking-head style video, and add, delete or edit the speaker’s words as simply as you’d edit text in a word processor. A new deepfake algorithm can process the audio and video into a new file in which the speaker says more or less whatever you want them to. New Atlas reports:
It’s the work of a collaborative team from Stanford University, Max Planck Institute for Informatics, Princeton University and Adobe Research, who say that in a perfect world the technology would be used to cut down on expensive re-shoots when an actor gets something wrong, or a script needs to be changed. In order to learn the face movements of a speaker, the algorithm requires about 40 minutes of training video, and a transcript of what’s being said, so it’s not something that can be thrown onto a short video snippet and run if you want good results. That 40 minutes of video gives the algorithm the chance to work out exactly what face shapes the subject is making for each phonetic syllable in the original script.
From there, once you edit the script, the algorithm can then create a 3D model of the face making the new shapes required. And from there, a machine learning technique called Neural Rendering can paint the 3D model over with photo-realistic textures to make it look basically indistinguishable from the real thing. Other software such as VoCo can be used if you wish to generate the speaker’s audio as well as video, and it takes the same approach, by breaking down a heap of training audio into phonemes and then using that dataset to generate new words in a familiar voice.