Taming Visually Guided Sound Generation

Tampere University
Tampere University
British Machine Vision Conference (BMVC), 2021 – Oral Presentation
Long Drum Solo
Overview
Overview of the Visually Guided Sound Generation Task

The generation of visually relevant, high-quality sounds is a longstanding challenge of deep learning. Solving this challenge would allow sound designers to spend less time searching large foley databases for the sound that is relevant to a specific video scene. Despite promising results shown in [1, 2, 3, 4], generation of long (10+ seconds) and high-quality audio samples supporting a large variety of visual scenes remains to be a challenge. The goal of this work is to bridge this gap.

To achieve this, we propose to tame the visually guided sound generation by shrinking a training dataset of audio spectrograms to a set of representative vectors aka. a codebook. Similar to word tokens in language modeling, these codebook vectors can be used by a transformer to sample a representation that can be easily decoded into a spectrogram. To employ visual information during sampling, we represent video frames as tokens and initialize the target sequence with them. This allows us to sample the next codebook token given the visual information and previously generated codebook tokens. Once autoregressive sampling is done, we remove the visual tokens from the sequence and reuse the pre-trained decoder part of the codebook autoencoder to decode the sampled sequence of codebook tokens into a sound spectrogram.

Model

The natural format of audio is a waveform, a 1D signal resampled at tens of thousands of Hz. One option is to train a codebook directly on such waveforms and, then, sample the codes from it. Using this approach, OpenAI Jukebox could generate high-quality musical performances based on lyrics, style, and artist conditioning. Despite truly wonderful results, the sampling speed is rather slow (1 hour per 10 secs of generated audio).

Therefore, for efficiency, we operate on spectrograms, a condensed 2D representation of an audio signal which could be easily obtained from a raw waveform and inverted back to it. This also allows us to operate with sound as with images and draw on architectural elements from conditional image generation. In this work, we train a codebook on spectrograms similar to VQGAN (a variant of VQVAE) that proved to possess strong reconstruction power from smaller-scale codebook representations when applied to RGB images.

Once we can reliably reconstruct a spectrogram from a small-scale codebook representation we can train a model to sample from the pre-trained codebook a novel representation given visual conditioning. This representation can be then decoded into a new spectrogram relevant for visual input.

Next, we describe both the spectrogram codebook and the codebook sampler in detail.

Spectrogram Codebook

The Spectrogram Codebook \(\mathcal{Z}\) is trained in an autoencoder with a quantized bottleneck:

Spectrogram VQGAN

The goal of the autoencoder is to reliably reconstruct the input spectrogram \(x\) from a small-scale codebook (quantized) representation \(z_\mathbf{q}.\) The codebook representation \(z_\mathbf{q}\) is obtained from the encoded representation \(\hat{z}\) just by looking up the closest elements from the codebook \(\mathcal{Z}.\) Both the Codebook Encoder (\(E\)) and Codebook Decoder (\(G\)) are generic 2D Conv stacks.

The training of the codebook is guided with four losses, two of which are traditional VQVAE losses: the codebook and reconstruction. The other two are inherited from the VQGAN architecture: patch-based adversarial and perceptual losses (LPIPS). The latter two losses were proven to allow reconstruction from smaller-scale codebook representations (see VQGAN).

Since the LPIPS loss relies on features of an ImageNet-pretrained classifier (VGG-16) it is not reasonable to expect that it would help to guide the generation of spectrograms. The closest relative of VGG-16 in audio classification is VGGish. However, we cannot make use of it because it: a) operates on 10 times shorter spectrograms than it is in our application, b) lack of depth and, thus, downsampling operations, in VGGish prevents extraction of large-scale features that could be useful in separating real and fake spectrograms. We, therefore, train from scratch a variant of VGGish architecture, referred to as VGGish-ish on VGGSound dataset. The new perceptual loss based on VGGish-ish is called LPAPS.

Vision-based Conditional Cross-modal Transformer

Transformers [1, 2, 3] have shown incredible success in sequence modeling. Once, the codebook representation is reformulated as a sequence, sampling from the codebook might benefit from the expressivity of a transformer. Such an approach has been shown to produce high-quality RGB images based on a depth, a semantic map, or a lower-resolution image in VQGAN. In this work, we propose to train such transformer to bridge two modalities, namely vision, and audio.

We train a variant of GPT-2 to predict the next codebook index given a visual condition in a form of visual tokens (features). The loss (cross-entropy) is calculated by comparing the sampled codebook indices \(\hat{s} = \{\hat{s_j}\}_{j=1}^{K}\) to the indices corresponding to the ground-truth spectrogram codebook representation \(z_\mathbf{q}\) (see the codebook figure ☝️).

On test-time, Transformer \(M\) autoregressively samples a sequence token-by-token primed with visual conditioning (frame-wise video features \(\hat{\mathcal{F}}\)) as follows:

Vision-based Conditional Transformer upto the Decoder

When sampling of tokens is done, we cut out the visual tokens from the generated sequence and replace the predicted codebook indices \(\hat{s}\) with vectors from the codebook \(\mathcal{Z}\) to form the codebook representation \(\hat{z}_\mathbf{q}.\) We reshape the sequence \(\hat{s}\) into the 2D representation \(\hat{z}_\mathbf{q}\) in a column-major way. Then, we can decode this representation into a spectrogram by the pre-trained codebook decoder \(G.\) The generated spectrogram \(\hat{x}_\mathcal{F}\) can finally be transformed to a waveform \(\hat{w}\) with a pre-trained spectrogram vocoder \(V:\)

From Generated Tokens to a Waveform

Overall, the Vision-based Conditional Cross-modal Sampler also includes the pretrained decoder from the codebook autoencoder \(G\) and a pretrained Spectrogram Vocoder \(V:\)

Vision-based Conditional Autoregressive Sampler

The most popular methods for vocoding a spectrogram are the Griffin-Lim algorithm and WaveNet. The Griffin-Lim algorithm is fast and can be applied to an open-domain dataset. However, the quality of the reconstructed waveform is dissatisfactory. At the same time, WaveNet allows to generate high-quality results but at the cost of sampling speed (20+ minutes for a 10-second sample on a GPU). Therefore, we train from scratch MelGAN on an open-domain dataset (VGGSound). This allows us to transform spectrograms to their waveforms in a fraction of a second on a CPU.

Automatic Quality Assessment of Spectrogram Generation

Human evaluation of content generation models is an expensive and tedious procedure. In the image generation field, this problem is bypassed with the automatic evaluation of fidelity using a family of metrics based on an ImageNet-pretrained Inception model. The most popular Inception-based metrics are Inception Score, Fréchet- and Kernel Inception Distance (FID & KID for short). However, automatic evaluation of a sound generation model remains an open question. In addition, our application requires the model to produce not only high-quality but also visually relevant samples. To this end, we propose a family of metrics for fidelity (quality) and relevance evaluation based on a novel architecture called Melception, a variant of Inception, trained as a classifier on a large-scale open-domain dataset (VGGSound).

For Fidelity evaluation of generated spectrograms, Melception can be directly adapted to calculate the whole range of Inception-based metrics including Inception Score as well as FID and KID. For automatic Visual Relevance evaluation, however, there are no well-defined metrics in the existing literature. To design one, we hypothesize that the class distribution of a generated spectrogram given a condition should be close to the class distribution of the original spectrogram for this condition. The class distributions for fake and real spectrograms can be formed by a pre-trained spectrogram classifier. To this end, we rely on Melception-based Kullback–Leibler divergence (MKL) as a measure of "distance" between two class distributions and average it among all samples in the dataset.

Datasets

We demonstrate the generation capabilities of the proposed approach on two datasets with strong audio-visual correspondence: VAS and VGGSound. VAS is a relatively small-scale but manually curated dataset that consists of ~12.5k clips from YouTube. The clips span 8 classes: Dog, Fireworks, Drum, Baby, Gun, Sneeze, Cough, and Hammer.

VGGSound is a large-scale dataset with >190k video clips from YouTube spanning 300+ classes. The classes can be grouped as people, sports, nature, home, tools, vehicles, music, etc. VGGSound is 15 times larger than VAS but it is less curated due to the automatic collecting procedure. To the best of our knowledge, we are the first to apply VGGSound for sound generation.

Results: Spectrogram Reconstruction

Reliable reconstruction of input spectrograms is a necessary condition for high-quality spectrogram generation. We evaluate the reconstruction ability of the spectrogram autoencoder with the proposed set of metrics measuring fidelity (FID) and relevance (average MKL) of reconstructions. Moreover, such an experiment might give us a rough upper bound on the performance of the transformer. When compared to ground truth spectrograms, the reconstructions are expected to have high fidelity (low FID) and to be relevant (low mean MKL):

Trained on Evaluated on FID ↓
VGGSound VGGSound 1.0 0.8
VGGSound VAS 3.2 0.7
VAS VAS 6.0 1.0

The results imply high fidelity and relevance on both VGGSound (test) and VAS (validation) datasets. Notably, the performance of the VGGSound-pretrained codebook is better than of the VAS-pretrained codebook even when applied on the VAS validation set due to larger and more diverse data seen during training.

Next, we show qualitative results by drawing samples randomly from the hold-out sets of VGGSound and VAS and plot the reconstructions along with the ground-truth spectrograms.

Qualitative Results (VGGSound dataset):

Playing Jembe
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Ambulance Siren
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
More Samples
Canary Calling
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Child Speech Kid Speaking
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Chimpanzee Pant-Hooting (silent video)
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Chopping Food
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Civil Defense Siren
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Dog Whimpering
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Door Slamming
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Eating With Cutlery
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Electric Grinder Grinding
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Eletric Blender Running
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Golf Driving
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Ice Cracking
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Magpie Calling
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Mouse Pattering
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Opening Or Closing Drawers
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
People Belly Laughing
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
People Sneezing
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Playing Bagpipes
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Playing Castanets
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Playing Clarinet
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Playing Flute
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Playing Mandolin
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Playing Snare Drum
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Playing Synthesizer
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Playing Tympani
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Playing Vibraphone
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Police Car (Siren)
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Roller Coaster Running
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Shot Football
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Slot Machine
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Spraying Water
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Tap Dancing
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Whale Calling
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Woodpecker Pecking Tree
Ground Truth from VGGSound
Reconstruction with VGGSound Codebook
Qualitative Results (VAS dataset):
Dog
Ground Truth from VAS
Reconstruction with VGGSound Codebook
Reconstruction with VAS Codebook
Baby
Ground Truth from VAS
Reconstruction with VGGSound Codebook
Reconstruction with VAS Codebook
More Samples
Baby
Ground Truth from VAS
Reconstruction with VGGSound Codebook
Reconstruction with VAS Codebook
Baby
Ground Truth from VAS
Reconstruction with VGGSound Codebook
Reconstruction with VAS Codebook
Cough
Ground Truth from VAS
Reconstruction with VGGSound Codebook
Reconstruction with VAS Codebook
Dog
Ground Truth from VAS
Reconstruction with VGGSound Codebook
Reconstruction with VAS Codebook
Drum
Ground Truth from VAS
Reconstruction with VGGSound Codebook
Reconstruction with VAS Codebook
Drum
Ground Truth from VAS
Reconstruction with VGGSound Codebook
Reconstruction with VAS Codebook
Fireworks
Ground Truth from VAS
Reconstruction with VGGSound Codebook
Reconstruction with VAS Codebook
Gun
Ground Truth from VAS
Reconstruction with VGGSound Codebook
Reconstruction with VAS Codebook
Hammer
Ground Truth from VAS
Reconstruction with VGGSound Codebook
Reconstruction with VAS Codebook
Sneeze
Ground Truth from VAS
Reconstruction with VGGSound Codebook
Reconstruction with VAS Codebook
Results: Visually Guided Sound Generation

We benchmark our visually guided generative model using three different settings:
a) the transformer is trained on VGGSound to sample from the VGGSound-pretrained codebook,
b) the transformer is trained on VAS to sample from the VGGSound codebook,
c) the transformer is trained on VAS to sample from the VAS codebook.
We additionally compare different ImageNet-pretrained features: BN-Inception (RGB + flow) and ResNet-50 (RGB). As for spectrogram reconstruction evaluation, we rely on FID and average MKL as our metrics:

Condition FID ↓ FID ↓
FID ↓
No Features 13.5 9.7 33.7 9.6 28.7 9.2 7.7
ResNet 1 Feature 11.5 7.3 26.5 6.7 25.1 6.3 7.7
5 Features 11.3 7.0 22.3 6.5 20.9 6.1 7.9
212 Features 10.5 6.9 20.8 6.2 22.6 5.8 11.8
Inception 1 Feature 8.6 7.7 38.6 7.3 25.1 6.6 7.7
5 Features 9.4 7.0 29.1 6.9 24.8 6.2 7.9
212 Features 9.6 6.8 20.5 6.0 25.4 5.9 11.8
Codebook VGGSound
VGGSound VAS
Sampling for VGGSound VAS VAS
Setting (a) (b)
(c)

We observe that:
1) In general, the more features from a corresponding video are used, the better the result in terms of relevance. However, there is a trade-off imposed by the sampling speed which decreases with the size of the conditioning.
2) A large gap (log-scale) in mean MKL between visual and “empty” conditioning suggests the importance of visual conditioning in producing relevant samples.
3) When the sampler and codebook are trained on the same dataset—settings (a) and (c)—the fidelity remains on a similar level if visual conditioning is used. This suggests that it is easier for the model to learn “features-codebook” (visual-audio) correspondence even from just a few features. However, if trained on different datasets (b), the sampler benefits from more visual information.
4) Both BN-Inception and ResNet-50 features achieve comparable performance, with BN-Inception being slightly better on VGGSound and with longer conditioning in each setting. Notably, the ResNet-50 features are RGB-only which significantly eases practical applications. We attribute the small difference between the RGB + flow features and RGB-only features to the fact that ResNet-50 is a stronger architecture than BN-Inception on the ImageNet benchmark.

Generated Samples

Next, we show samples from all three different settings:

(a) – Trained on VGGSound to sample from VGGSound codebook
Trained on VGGSound to sample from VGGSound codebook
•••
female speech,
woman speaking

– VGGSound
Generated Sample Class:
female speech, woman speaking 0.99
child speech, kid speaking 0.00
people whispering 0.00
eating with cutlery 0.00
Trained on VGGSound to sample from VGGSound codebook
•••
chainsawing trees
– VGGSound
Generated Sample Class:
chainsawing trees 0.98
hedge trimmer running 0.01
driving motorcycle 0.00
chopping wood 0.00
Trained on VGGSound to sample from VGGSound codebook
•••
baby
crying

– VGGSound
Generated Sample Class:
baby crying 0.66
people sobbing 0.23
baby babbling 0.08
people babbling 0.03
More Samples
Trained on VGGSound to sample from VGGSound codebook
•••
crowd
cheering

– VGGSound
Generated Sample Class:
people cheering 0.46
people crowd 0.14
people booing 0.13
singing choir 0.13
Trained on VGGSound to sample from VGGSound codebook
•••
playing
bongo

– VGGSound
Generated Sample Class:
playing bongo 0.85
playing congas 0.12
underwater bubbling 0.01
playing banjo 0.01
Trained on VGGSound to sample from VGGSound codebook
•••
playing
bongo

– VGGSound
Generated Sample Class:
ripping paper 0.91
squishing water 0.06
lighting firecrackers 0.01
forging swords 0.00
Trained on VGGSound to sample from VGGSound codebook
•••
playing
bongo

– VGGSound
Generated Sample Class:
wind chime 0.84
playing glockenspiel 0.11
singing bowl 0.02
playing marimba, xylophone 0.01

Use our pre-trained model to generate samples for a custom video: Open In Colab

(b) – Trained on VAS to sample from VGGSound codebook
Trained on VAS to sample from VGGSound codebook
•••
gun
– VAS
Generated Sample Class:
cap gun shooting 0.74
machine gun shooting 0.26
fireworks banging 0.00
chopping wood 0.00
Trained on VAS to sample from VGGSound codebook
•••
dog
– VAS
Generated Sample Class:
dog barking 0.85
dog bow-wow 0.15
dog baying 0.00
coyote howling 0.00
Trained on VAS to sample from VGGSound codebook
•••
cough
– VAS
Generated Sample Class:
people coughing 0.64
people sneezing 0.28
baby laughter 0.06
people belly laughing 0.01

Note: the cough class in the VAS dataset has as few as 314 training videos videos

(c) – Trained on VAS to sample from VAS codebook
Trained on VAS to sample from VAS codebook
•••
drum
– VAS
Generated Sample Class:
playing cymbal 0.57
playing bass drum 0.28
playing drum kit 0.10
playing double bass 0.04
Trained on VAS to sample from VAS codebook
•••
fireworks
– VAS
Generated Sample Class:
fireworks banging 1.00
machine gun shooting 0.00
lighting firecrackers 0.00
playing hockey 0.00
Trained on VAS to sample from VAS codebook
•••
hammer
– VAS
Generated Sample Class:
hammering nails 0.63
arc welding 0.23
forging swords 0.04
playing guiro 0.03

Note: the hammer class in the VAS dataset has as few as 318 training videos videos

Results: Comparison with the State-of-the-art

We compare our model with RegNet, which is currently the strongest baseline in generating relevant sounds for a visual sequence. Since the RegNet approach requires training one model per class in a dataset, it explicitly shrinks the sampling space. On the contrary, our model learns to sample for all classes in a dataset at the same time which is a harder task. For a fair comparison, we will pass a class label (cls) along with the visual features into the model conditioning:

Model Parameters FID ↓
Ours 379M 20.5 6.0 12
Ours 377M 25.4 5.9 12
RegNet 105M 78.8 5.7 1500
Ours + cls 379M 20.2 5.7 12
Ours + cls 377M 24.9 5.5 12

According to the results, our model significantly outperforms the baseline in terms of fidelity (FID) while being on par or better in generating relevant samples. Moreover, the models without the cls token have competitive performance with RegNet. This suggests that our model is able to learn the mapping between the visual features and the class distribution through the "features-codebook" correspondence. We highlight that our model is trained on all dataset classes at once which is a much harder requirement than the baseline.

Comparison of Generated Samples
•••
baby
– VAS
Generated Sample Class (VGGSound):
RegNet:
people sobbing 0.67
baby crying 0.27
cat growling 0.01
Ours:
people sobbing 0.84
baby crying 0.15
people babbling 0.00
RegNet (Generation Time: 25 minutes on a )
Ours (Generation Time: 10 seconds on a )
More Samples
•••
drum
– VAS
Generated Sample Class (VGGSound):
RegNet:
playing cymbal 0.31
playing snare drum 0.29
playing bass drum 0.14
Ours:
playing bass drum 0.46
playing drum kit 0.34
playing cymbal 0.12
RegNet (Generation Time: 25 minutes on a )
Ours (Generation Time: 10 seconds on a )
•••
dog
– VAS
Generated Sample Class (VGGSound):
RegNet:
dog bow-wow 0.66
dog baying 0.20
dog barking 0.03
Ours:
dog barking 0.72
dog bow-wow 0.27
fox barking 0.00
RegNet (Generation Time: 25 minutes on a )
Ours (Generation Time: 10 seconds on a )
•••
fireworks
– VAS
Generated Sample Class (VGGSound):
RegNet:
fireworks banging 0.94
lighting firecrackers 0.06
footsteps on snow 0.00
Ours:
fireworks banging 0.94
lighting firecrackers 0.03
machine gun shooting 0.02
RegNet (Generation Time: 25 minutes on a )
Ours (Generation Time: 10 seconds on a )
Results: Sampling without Visual Conditioning

The following results show the capability of a model to generate samples without visual condition which, essentially, depicts how well the model captures the distribution of the training set. Both the codebook and the transformer are trained on the same dataset and the samples are not cherry-picked.

Random Set of Generated Samples for VGGSound
Generated Sample Class (VGGSound):
playing bass guitar 0.50
playing electric guitar 0.39
female singing 0.04
male singing 0.01
railroad car, train wagon 0.37
train wheels squealing 0.28
train horning 0.17
train whistling 0.15
horse clip-clop 0.57
golf driving 0.05
people running 0.04
playing tennis 0.01
More Samples
playing trombone 0.92
playing cello 0.04
playing trumpet 0.01
playing double bass 0.00
playing violin, fiddle 0.32
playing cello 0.17
people humming 0.16
playing theremin 0.15
playing drum kit 0.26
people marching 0.21
playing cymbal 0.10
horse clip-clop 0.08
wind noise 0.35
wind rustling leaves 0.16
rowboat, canoe, ..., rowing 0.08
horse clip-clop 0.06
train whistling 0.51
lathe spinning 0.17
people crowd 0.12
rairoad car, train wagon 0.06
⚠️ WARNING: it might be a bit loud ⚠️
playing lacrosse 0.45
roller coaster running 0.09
people sniggering 0.09
playing volleyball 0.08
penguins braying 0.18
elk bugling 0.18
baby crying 0.07
horse neighing 0.06
pheasant crowing 0.21
sliding door 0.10
woodpecker pecking tree 0.10
sharpen knife 0.06
people eating apple 0.45
baby babbling 0.12
people eating 0.03
people babbling 0.03
people marching 0.27
playing washboard 0.22
playing trombone 0.07
orchestra 0.06
using sewing machines 0.22
stream burbling 0.12
hail 0.12
pigeon, dove cooing 0.11
bowling impact 0.52
playing steelpan 0.06
people shuffling 0.06
people slapping 0.05
⚠️ WARNING: it might be a bit loud ⚠️
roller coaster running 0.58
train whistling 0.15
fireworks banging 0.05
machine gun shooting 0.04
⚠️ WARNING: it might be a bit loud ⚠️
eletric blender running 0.49
engine accelerating ... 0.17
car engine starting 0.06
vacuum cleaner cleaning 0.06
⚠️ WARNING: it might be a bit loud ⚠️
sharpen knife 0.56
pig oinking 0.10
rowboat, canoe, ... rowing 0.03
hair dryer drying 0.03
Random Set of Generated Samples for VAS
Generated Sample Class (VGGSound):
dog bow-wow 0.54
dog barking 0.35
dog baying 0.03
dog whimpering 0.03
people marching 0.97
playing bugle 0.01
people battle cry 0.00
playing trombone 0.00
⚠️ WARNING: it might be a bit loud ⚠️
magpie calling 0.32
francolin calling 0.25
turkey gobbling 0.08
goose honking 0.08
More Samples
skateboarding 0.78
striking bowling 0.11
bowling impact 0.04
fireworks banging 0.03
baby laughter 0.39
people belly laughing 0.31
people giggling 0.12
francolin calling 0.06
⚠️ WARNING: it might be a bit loud ⚠️
playing drum kit 0.41
playing bass drum 0.29
roller coaster running 0.11
playing bass guitar 0.02
chicken clucking 0.46
pheasant crowing 0.22
zebra braying 0.06
people belly laughing 0.04
pheasant crowing 0.85
francolin calling 0.05
people belly laughing 0.04
people giggling 0.01
fireworks banging 0.99
lighting firecrackers 0.00
machine gun shooting 0.00
firing cannon 0.00
volcano explosion 0.67
missile launch 0.07
engine accelerating, ... 0.04
train wheels squealing 0.04
squishing water 0.34
chopping wood 0.22
ice cracking 0.12
cap gun shooting 0.11
playing bass drum 0.23
lighting firecrackers 0.13
basketball bounce 0.07
firing muskets 0.06
roller coaster running 0.48
missile launch 0.19
dinosaurs bellowing 0.19
volcano explosion 0.03
roller coaster running 0.27
skiing 0.20
dinosaurs bellowing 0.18
otter growling 0.12
dog barking 0.59
dog bow-wow 0.19
dog growling 0.03
pig oinking 0.02
playing drum kit 0.65
playing timbales 0.20
playing bass drum 0.10
playing cymbal 0.00
lathe spinning 0.32
printer printing 0.29
playing cymbal 0.16
popping popcorn 0.04
machine gun shooting 0.46
car engine starting 0.34
lighting firecrackers 0.04
dinosaurs bellowing 0.03
⚠️ WARNING: it might be a bit loud ⚠️
Results: Priming Sampling with a Ground Truth Part

The transformer, given the previously generated part of the sequence, samples the next token. Here, we show the ability of the model to seamlessly continue a sequence of codebook indices from the original audio. The samples are drawn randomly from the hold-out sets and are not cherry-picked.

Random Set of Samples for VGGSound
codebook codes from the ground truth audio
generated codebook codes
baby crying
bouncing on trampoline
cat meowing
More Samples
codebook codes from the ground truth audio
generated codebook codes
francolin calling
owl hooting
penguins braying
pheasant crowing
playing accordion
playing erhu
playing zither
slot machine
spraying water
Random Set of Samples for VAS
codebook codes from the ground truth audio
generated codebook codes
baby
dog
drum
More Samples
codebook codes from the ground truth audio
generated codebook codes
baby
dog
cough
cough

Note: the cough class in the VAS dataset has as few as 314 training videos videos

drum
fireworks
fireworks
gun
gun
Results: Controlling for Sample Diversity

Temporal diversity is an important factor of high-fidelity audio. At the same time, it is challenging to generate a diverse but relevant sample which imposes a trade-off. The on-the-surface approach to sample the next item in the sequence is to always pick the codebook item that was predicted with the highest probability of being the next. This approach, however, will result in generating relevant but unpleasant, unnatural, and low-diversity samples.

To avoid it, we can use the distribution for the whole vocabulary (codebook) instead of the top one alone. These distributions can form weights for a multinomial distribution. This simply means that the larger the probability for the item, the more likely it will be picked as the next one.

However, we found that allowing the transformer to sample from all available codebook items indeed improves diversity but, at the same time, deteriorates relevance. Therefore, we end up in the "relevance-diversity" trade-off. To somewhat mitigate this issue, we control the trade-off by clipping the distribution. In particular, instead of either picking only the top one or use all available codebook items, we clip the set of available items to the Top-\(X\) according to their predicted probabilities.

Next, we show how this observation can be used to control the sample diversity:

The next codebook index is sampled from the Top-\(X\) codes
•••
playing accordion
– VGGSound
Generated Sample Class
(VGGSound)
:
Ground Truth
playing accordion 1.00
playing clarinet 0.00
playing saxophone 0.00
playing harmonica 0.00
\(X=1024\) (all codes)
playing accordion 1.00
playing harmonica 0.00
playing saxophone 0.00
playing clarinet 0.00
512
playing accordion 1.00
playing harmonica 0.00
playing violin, fiddle 0.00
playing bagpipes 0.00
256
playing accordion 1.00
playing harmonica 0.00
playing violin, fiddle 0.00
vehicle horn, ... 0.00
128
playing accordion 1.00
playing harmonica 0.00
playing trumpet 0.00
vehicle horn, ... 0.00
64
playing saxophone 0.47
playing accordion 0.17
female singing 0.09
playing violin, fiddle 0.07
32
playing accordion 1.00
playing trumpet 0.00
playing trombone 0.00
vehicle horn, ... 0.00
16
playing accordion 0.88
playing clarinet 0.11
playing violin, fiddle 0.00
playing saxophone 0.00
8
playing oboe 0.96
playing electronic organ 0.01
playing erhu 0.01
singing bowl 0.01
4
playing trumpet 0.95
playing cornet 0.04
playing clarinet 0.00
playing saxophone 0.00
2
civil defense siren 0.60
vehicle horn, ... 0.18
train horning 0.08
playing accordion 0.04
1 (always the top one)
playing trumpet 0.85
donkey, ass braying 0.10
playing accordion 0.02
playing cornet 0.02

or quantitatively relying on Melception-based FID and average MKL (the lower the better):

Fidelity-Relevance Trade-off
Results: Relevance per Class

An ideal conditional generative model is expected to produce relevant examples for every class in a dataset. Here we show how relevant the generated samples are across every class in the VGGSound dataset:

Fidelity-Relevance Trade-off

Despite that model performance on a majority of the classes fall into [7 ± 0.7] interval of the MKL yet there is still room for improvement in the capabilities of a model to handle multiple classes which we hope to see in future research.

Results: Variability of Samples

Since the generation of a relevant sound given a set of visual features is an ill-posted problem, a model is expected to produce a variety of relevant samples for the same condition. We show that our model is capable of generating a variety of relevant samples for the same visual condition:

•••
male speech,
man speaking

– VGGSound
Generated Sample Class
(VGGSound)
:
male speech, man speaking 0.67
..., woman speaking 0.14
people booing 0.01
opening/closing drawers 0.01
people booing 0.46
male speech, man speaking 0.12
people marching 0.09
playing volleyball 0.05
people cheering 0.45
people crowd 0.34
playing lacrosse 0.08
people shuffling 0.00
More Samples
people booing 0.45
people cheering 0.11
playing hockey 0.11
people crowd 0.10
child singing 0.81
female singing 0.07
..., woman speaking 0.03
children shouting 0.02
..., woman speaking 0.96
male speech, man speaking 0.02
playing ukulele 0.00
child speech, kid speaking 0.00

Given the provided visual sequence ("a person is talking with a crowd on the background"), it is difficult to guess why the person turned back to the crowd, e.g. because they were "cheering" or "booing". At the same time, we also notice the limitation of the model, i. e. sometimes it confuses the gender or age of a person. However, we believe these are reasonable mistakes considering the difficulty of the scene.

Results: Spectrogram VQGAN as a Neural Audio Codec

A recent (July, 2021) ArXiv submission, show-cased a VQVAE with the adversarial loss, called SoundStream, on lossy compression of a waveform and reported the state-of-the-art results on the 3 kbps bitrate, which is designed for music and speech datasets. Since our approach includes sampling from a pre-trained codebook, we can employ our Spectrogram VQGAN pre-trained on an open-domain dataset as a neural audio codec without a change. Our approach allows encoding at, approximately, 0.27 kbps bitrate with the VGGSound codebook and 0.19 kbps with the VAS codebook.

We provide a small qualitative comparison of reconstructions of Lyra (only speech), SoundStream (only speech and music), and Spectrogram VQGAN (open-domain). We will use the same 3-second samples as provided on the SoundStream project page since the source code for the SoundStream has not been released to the public (checked on 20 October, 2021; the authors promised to release it as a part of Lyra toolbox).

Reference
Spectrogram VQGAN (0.27 kbps)
SoundStream (3 kbps)
Lyra (3 kbps)
Music
Speech

As a result, despite having one order of magnitude smaller bitrate budget, Spectrogram VQGAN achieves comparable performance with SoundStream in reconstruction quality on music data and produces significantly better reconstructions than Lyra. However, as we observed before (Section 4.2 in the paper), Spectrogram VQGAN struggles with the fine details of human speech due to the audio preprocessing (mel-scale spectrogram) and absence of narrow domain pre-training as in Lyra and SoundStream. We highlight that Spectrogram VQGAN is trained on an open-domain hundred-class dataset (VGGSound) while SoundStream is trained on music and speech datasets separately.

Try our model on a custom audio on Google Colab: Open In Colab

Or with an even simplier interface without any exposed code: Open In Hugging Face Spaces