voice

Table of Contents

1. GENERATION

  • FoundationTTS: Text-to-Speech for ASR Custmization with Generative Language Model (automatic phonems, coerse and fine composition)
  • artificial tongue-throat
  • Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale (20 times faster than valle)
  • Open sourcing AudioCraft: Generative AI for audio made simple and available to all
    • MusicGen, AudioGen, and EnCodec

1.1. MAGNET

  • MAGNeT: Masked Audio Generation using a Single Non-Autoregressive Transformer =best=
    • =comparison of them all=
    • trained: predict spans of masked tokens
    • single non-autoregressive model, for text-to-music and text-to-sound generation
    • SOTA models, while being 7x faster

1.2. AUDIO DIFFUSION

  • AUDIO DIFFUSION (SOUND MUSIC VOICE)
  • parent: diffusion
  • NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers
  • From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion
    • multi-band diffusion, generates any type of audio
  • music diffusion https://www.arxiv-vanity.com/papers/2301.11757/
    • JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models
      • text-guided music generation, music inpainting, and continuation
  • Re-AudioLDM: Retrieval-Augmented Text-to-Audio Generation (CLAP, audio clip), complex scenes
  • Stable Audio Tools: audio training =by stable diffusion=
  • Controllable Music Production with Diffusion Models and Guidance Gradients
    • continuation, inpainting and regeneration; style transfer
  • StyleTTS2: ElevenLabs quality =best=
    • E3 TTS: Easy End-to-End Diffusion-based Text to Speech
  • Music ControlNet: Multiple Time-varying Controls for Music Generation
    • melody, dynamics, and rhythm controls, 35x fewer parameters, 11x less data
  • Mustango: Toward Controllable Text-to-Music Generation
    • conditioned on prompts and various musical features
  • Fast Timing-Conditioned Latent Audio Diffusion
    • conditioned on text prompts as well as timing embeddings, can generate structure and stereo sounds

1.2.1. OUTPERFORMED?

  • Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis
    • issue: noisy representation (little information of the generation target)
      • solution: Bridge-TTS: strong structural information of the target
        • Schrodinger bridge between latent from text input and the ground-truth mel-spectrogram
    • better synthesis quality and sampling efficiency

1.3. TTS GPT

  • AudioLM: a Language Modeling Approach to Audio Generation
    • actually BERT, and using soundstream
    • also tts, and extended to valle,
    • SoundStorm: Efficient Parallel Audio Generation
      • 2 times faster than AudioLM, 50 fps, 30 seconds of speech continuation within 2 seconds
  • bark =best so far= not just voices
  • Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias
    • decomposed, uses spectrograms, wild-big dataset, phase reconstructed, best zero shot
  • UniAudio: An Audio Foundation Model Toward Universal Audio Generation
    • transformer, LMs techniques, simple fine-tuning =best=

1.3.1. TORTOISE LIKE

1.4. ALIGNMENT

  • Improving Joint Speech-Text Representations Without Alignment
    • sequence-length mismatch naturally fix, simply assuming the best alignment

2. INSTRUMENT LIKE

  • WavJourney: Compositional Audio Creation with Large Language Models
    • script compiler: encompassing speech, music, effects, guided by instructions; creative control
  • Audio Style Transfer (using a dsp - a daw plugin)
    • gradient estimation instead of having to replace the plugin with a proxy network
  • SpeechX: Neural Codec Language Model as a Versatile Speech Transformer
    • phoneme intrinsics; choose-task voice transform (like voice transfer)
  • Text-to-Sing: melody, then with your own lyrics
  • ChatMusician: Understanding and Generating Music Intrinsically with LLM
    • music-notation is treated as a second language
    • also excellent compressor for music
  • MusicLang: Llama 2 based Music generation model
    • trained from scratch; runs on cpu
    • using chords

3. AUDIO CODEC

3.1. LANGUAGE ENCODING

  • Natural Language Supervision for General-Purpose Audio Representations
    • audio representations trained on 22 tasks, instead of the of sound event classification
    • language autoregressive decoder-only, then joint with Contrastive Learning
  • CLARA: Multilingual Contrastive Learning for Audio Representation Acquisition
    • contrastive audio-text model, with understanding of implicit aspects of speech: emotions
  • HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis

3.2. SUPER-RESOLUTION

  • AudioSR: Versatile Audio Super-resolution at Scale (upsample, enhance)

4. VOICE CONVERSION

  • end2endvc: End-to-End Voice Conversion with Information Perturbation (=better mos than nvc=) (better MOS than freevc)
  • QuickVC (5000 kHz fastest) =vits=
  • TriAAN-VC: Triple Adaptive Attention Normalization for Any-to-Any Voice Conversion
    • best similarity and close naturality, speaker encoding
  • 1 (best similarity, semantic tokens)

4.1. FEW SHOT

  • FREEVC: TOWARDS HIGH-QUALITY TEXT-FREE ONE-SHOT VOICE CONVERSION (=vits=)
  • CONTROLVC: ZERO-SHOT VOICE CONVERSION WITH TIME-VARYING CONTROLS ON PITCH AND SPEED
  • StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models (phonems)
  • HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer
    • end-to-end zero-shot VST model (better than DiffVC)

4.2. ZERO SHOT

  • VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild =best=
    • to clone or edit an unseen voice, voicecraft needs only a few seconds of reference

5. STYLE CONVERSION

5.1. ATOMIC TRANSFERS

Author: Tekakutli

Created: 2024-04-07 Sun 13:58