voice

1. GENERATION
2. INSTRUMENT LIKE
3. AUDIO CODEC
- 3.1. LANGUAGE ENCODING
- 3.2. SUPER-RESOLUTION
4. VOICE CONVERSION
- 4.1. FEW SHOT
- 4.2. ZERO SHOT
5. STYLE CONVERSION
- 5.1. ATOMIC TRANSFERS

parent: domain
AUDIO VISION
models: https://rentry.org/AIVoiceStuff
- tortoise dvae
MusicHiFi: Fast High-Fidelity Stereo Vocoding
- from mel-spectrogram to higher quality mono and stereo

1. GENERATION

FoundationTTS: Text-to-Speech for ASR Custmization with Generative Language Model (automatic phonems, coerse and fine composition)
artificial tongue-throat
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale (20 times faster than valle)
Open sourcing AudioCraft: Generative AI for audio made simple and available to all
- MusicGen, AudioGen, and EnCodec

1.1. MAGNET

MAGNeT: Masked Audio Generation using a Single Non-Autoregressive Transformer =best=
- =comparison of them all=
- trained: predict spans of masked tokens
- single non-autoregressive model, for text-to-music and text-to-sound generation
- SOTA models, while being 7x faster

1.2. AUDIO DIFFUSION

AUDIO DIFFUSION (SOUND MUSIC VOICE)
parent: diffusion
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion
- multi-band diffusion, generates any type of audio
music diffusion https://www.arxiv-vanity.com/papers/2301.11757/
- JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models
  - text-guided music generation, music inpainting, and continuation
Re-AudioLDM: Retrieval-Augmented Text-to-Audio Generation (CLAP, audio clip), complex scenes
Stable Audio Tools: audio training =by stable diffusion=
Controllable Music Production with Diffusion Models and Guidance Gradients
- continuation, inpainting and regeneration; style transfer
StyleTTS2: ElevenLabs quality =best=
- E3 TTS: Easy End-to-End Diffusion-based Text to Speech
Music ControlNet: Multiple Time-varying Controls for Music Generation
- melody, dynamics, and rhythm controls, 35x fewer parameters, 11x less data
Mustango: Toward Controllable Text-to-Music Generation
- conditioned on prompts and various musical features
Fast Timing-Conditioned Latent Audio Diffusion
- conditioned on text prompts as well as timing embeddings, can generate structure and stereo sounds

1.2.1. OUTPERFORMED?

Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis
- issue: noisy representation (little information of the generation target)
  - solution: Bridge-TTS: strong structural information of the target
    - Schrodinger bridge between latent from text input and the ground-truth mel-spectrogram
- better synthesis quality and sampling efficiency

1.3. TTS GPT

AudioLM: a Language Modeling Approach to Audio Generation
- actually BERT, and using soundstream
- also tts, and extended to valle,
- SoundStorm: Efficient Parallel Audio Generation
  - 2 times faster than AudioLM, 50 fps, 30 seconds of speech continuation within 2 seconds
bark =best so far= not just voices
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias
- decomposed, uses spectrograms, wild-big dataset, phase reconstructed, best zero shot
UniAudio: An Audio Foundation Model Toward Universal Audio Generation
- transformer, LMs techniques, simple fine-tuning =best=

1.3.1. TORTOISE LIKE

tortoise finetuning
OpenAI’s Text to Speech TTS
EmotiVoice: a Multi-Voice and Prompt-Controlled TTS Engine
Piper A fast, local neural text to speech system =best=

1.4. ALIGNMENT

Improving Joint Speech-Text Representations Without Alignment
- sequence-length mismatch naturally fix, simply assuming the best alignment

2. INSTRUMENT LIKE

WavJourney: Compositional Audio Creation with Large Language Models
- script compiler: encompassing speech, music, effects, guided by instructions; creative control
Audio Style Transfer (using a dsp - a daw plugin)
- gradient estimation instead of having to replace the plugin with a proxy network
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer
- phoneme intrinsics; choose-task voice transform (like voice transfer)
Text-to-Sing: melody, then with your own lyrics
ChatMusician: Understanding and Generating Music Intrinsically with LLM
- music-notation is treated as a second language
- also excellent compressor for music
MusicLang: Llama 2 based Music generation model
- trained from scratch; runs on cpu
- using chords

3. AUDIO CODEC

Disen: Disentangled Feature Learning for Real-Time Neural Speech Coding
- voice conversion in real-time communications
- =Codec= codebook each for speaker and content
valle concept: modeling (building up decoder)
- VALL-E X: Multilingual Text-to-Speech Synthesis and Voice Cloning
  - clone with only 3 seconds of reference audio
- 1.2
  - High-Fidelity Audio Compression with Improved RVQGAN (8kbps)
    - 3 times better than facebook encodec =best codec=
LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models
- Soundstream encoder implementation
MusicGen: vs google musiclm, inteweaving of discrete sound tokens conditioned on text input
Accelerating Transducers through Adjacent Token Merging
- reduce 57% of tokens, improve speed by 70%

3.1. LANGUAGE ENCODING

Natural Language Supervision for General-Purpose Audio Representations
- audio representations trained on 22 tasks, instead of the of sound event classification
- language autoregressive decoder-only, then joint with Contrastive Learning
CLARA: Multilingual Contrastive Learning for Audio Representation Acquisition
- contrastive audio-text model, with understanding of implicit aspects of speech: emotions
HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis

3.2. SUPER-RESOLUTION

AudioSR: Versatile Audio Super-resolution at Scale (upsample, enhance)

4. VOICE CONVERSION

end2endvc: End-to-End Voice Conversion with Information Perturbation (=better mos than nvc=) (better MOS than freevc)
QuickVC (5000 kHz fastest) =vits=
TriAAN-VC: Triple Adaptive Attention Normalization for Any-to-Any Voice Conversion
- best similarity and close naturality, speaker encoding
1 (best similarity, semantic tokens)

4.1. FEW SHOT

FREEVC: TOWARDS HIGH-QUALITY TEXT-FREE ONE-SHOT VOICE CONVERSION (=vits=)
CONTROLVC: ZERO-SHOT VOICE CONVERSION WITH TIME-VARYING CONTROLS ON PITCH AND SPEED
StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models (phonems)
HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer
- end-to-end zero-shot VST model (better than DiffVC)

4.2. ZERO SHOT

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild =best=
- to clone or edit an unseen voice, voicecraft needs only a few seconds of reference

5. STYLE CONVERSION

LVC-VC: Voice Conversion with Location-Variable Convolutions
- simultaneously performing voice conversion while generating audio
- smaller than NVC-Net
- =has charts=
NVC-Net: End-to-End Adversarial Voice Conversion (=SONY=)
- voice conversion directly on the raw audio waveform
- =best one= 3600 kHz fastest
- https://github.com/sony/ai-research-code nvc voices

5.1. ATOMIC TRANSFERS

Speech Representation Extractor: divide in parts, voice, pitch, context; zero shot Nvidia
Pits: vits with pitch control (monotomic alignment)
SpeechSplit: disentangling speech into content, timbre, rhythm and pitch.
- AutoVC implementation
  - speaker embeddings: https://github.com/yistLin/dvector
  - FragmentVC Timber transfer (better than AutoVC) keeps frecuency
    - RGSM: better than Fragment
MFC-StyleVC: DELIVERING SPEAKING STYLE IN LOW-RESOURCE CONVERSION WITH MULTI-FACTOR CONSTRAINTS
- repeat the utterance; different training objective for adaptation, normalizing
- content, speaker, style on/off
Nonparallel Emotional Voice Conversion For Unseen Speaker-Emotion Pairs Using Dual Domain Adversarial Network & Virtual Domain Pairing (=SONY=)
- emotion transfer

voice

Table of Contents

1. GENERATION

1.1. MAGNET

1.2. AUDIO DIFFUSION

1.2.1. OUTPERFORMED?

1.3. TTS GPT

1.3.1. TORTOISE LIKE

1.4. ALIGNMENT

2. INSTRUMENT LIKE

3. AUDIO CODEC

3.1. LANGUAGE ENCODING

3.2. SUPER-RESOLUTION

4. VOICE CONVERSION

4.1. FEW SHOT

4.2. ZERO SHOT

5. STYLE CONVERSION

5.1. ATOMIC TRANSFERS