Strumming to the beat: audio-conditioned contrastive video textures

Using representational learning methods, researchers have been able to create images “from scratch”. Even so, video generation is still a difficult task. A recent article on combines teletextures, a video synthesis method for creating simple, repetitive video, with the latest advances in self-paced learning.

The approach synthesizes textures by resampling frames from a single input video. A deep model is trained to learn features that are most appropriate in space and time to input. To synthesize the texture, the video is shown as a diagram in which the individual frames are nodes and the edges represent transition probabilities. Output videos are created by randomly crossing edges with high transition probabilities.

In one of the applications, a video is generated from a source video with associated audio and new conditioning audio. The approach outperforms previous methods for perceptual studies.

We introduce a non-parametric approach to infinite teletexture synthesis using a representation learned through contrastive learning. We take inspiration from teletextures, which showed that plausible new videos can be generated from a single video by stitching the frames together in a novel but consistent order. However, this classic work has been limited by the use of handcrafted distance metrics, which limited its use to simple, repetitive videos. We are relying on newer self-paced learning techniques to learn this distance metric. This allows us to compare frames in a way that can accommodate more sophisticated dynamics and take into account other data such as audio. We learn representations for video images and picture-to-picture transition probabilities by adapting a video-specific model that has been trained with contrastive learning. In order to synthesize a texture, frames with high transition probabilities are randomly sampled in order to produce various videos with smooth time and novel sequences and transitions. The model naturally extends to an audio-conditioned setting without the need for fine-tuning. Our model surpasses the basic values ​​of human perception, can handle a wide variety of input videos, and combine semantic and audiovisual cues to synthesize videos that sync well with an audio signal.

Research paper: Narasimhan, M., Ginosar, S., Owens, A., Efros, A. A. and Darrell, T., "Strumming to the Beat: Audio-Conditioned Contrastive Teletextures", 2021. Link:

<! –

comment these news or articles


Leave a Reply

Your email address will not be published. Required fields are marked *