voice
Table of Contents
- parent: domain
- AUDIO VISION
- models: https://rentry.org/AIVoiceStuff
- tortoise dvae
- MusicHiFi: Fast High-Fidelity Stereo Vocoding
- from mel-spectrogram to higher quality mono and stereo
1. GENERATION
- FoundationTTS: Text-to-Speech for ASR Custmization with Generative Language Model (automatic phonems, coerse and fine composition)
- artificial tongue-throat
- Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale (20 times faster than valle)
- Open sourcing AudioCraft: Generative AI for audio made simple and available to all
- MusicGen, AudioGen, and EnCodec
1.1. MAGNET
1.2. AUDIO DIFFUSION
- AUDIO DIFFUSION (SOUND MUSIC VOICE)
- parent: diffusion
- NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers
- From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion
- multi-band diffusion, generates any type of audio
- music diffusion https://www.arxiv-vanity.com/papers/2301.11757/
- JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models
- text-guided music generation, music inpainting, and continuation
- JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models
- Re-AudioLDM: Retrieval-Augmented Text-to-Audio Generation (CLAP, audio clip), complex scenes
- Stable Audio Tools: audio training
=by stable diffusion=
- Controllable Music Production with Diffusion Models and Guidance Gradients
- continuation, inpainting and regeneration; style transfer
- StyleTTS2: ElevenLabs quality
=best=
- E3 TTS: Easy End-to-End Diffusion-based Text to Speech
- Music ControlNet: Multiple Time-varying Controls for Music Generation
- melody, dynamics, and rhythm controls, 35x fewer parameters, 11x less data
- Mustango: Toward Controllable Text-to-Music Generation
- conditioned on prompts and various musical features
- Fast Timing-Conditioned Latent Audio Diffusion
- conditioned on text prompts as well as timing embeddings, can generate structure and stereo sounds
1.2.1. OUTPERFORMED?
- Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis
- issue: noisy representation (little information of the generation target)
- solution: Bridge-TTS: strong structural information of the target
- Schrodinger bridge between latent from text input and the ground-truth mel-spectrogram
- solution: Bridge-TTS: strong structural information of the target
- better synthesis quality and sampling efficiency
- issue: noisy representation (little information of the generation target)
1.3. TTS GPT
- AudioLM: a Language Modeling Approach to Audio Generation
- actually BERT, and using soundstream
- also tts, and extended to valle,
- SoundStorm: Efficient Parallel Audio Generation
- 2 times faster than AudioLM, 50 fps, 30 seconds of speech continuation within 2 seconds
- bark
=best so far=
not just voices - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias
- decomposed, uses spectrograms, wild-big dataset, phase reconstructed, best zero shot
- UniAudio: An Audio Foundation Model Toward Universal Audio Generation
- transformer, LMs techniques, simple fine-tuning
=best=
- transformer, LMs techniques, simple fine-tuning
1.3.1. TORTOISE LIKE
- tortoise finetuning
- OpenAI’s Text to Speech TTS
- EmotiVoice: a Multi-Voice and Prompt-Controlled TTS Engine
- Piper A fast, local neural text to speech system
=best=
1.4. ALIGNMENT
- Improving Joint Speech-Text Representations Without Alignment
- sequence-length mismatch naturally fix, simply assuming the best alignment
2. INSTRUMENT LIKE
- WavJourney: Compositional Audio Creation with Large Language Models
- script compiler: encompassing speech, music, effects, guided by instructions; creative control
- Audio Style Transfer (using a dsp - a daw plugin)
- gradient estimation instead of having to replace the plugin with a proxy network
- SpeechX: Neural Codec Language Model as a Versatile Speech Transformer
- phoneme intrinsics; choose-task voice transform (like voice transfer)
- Text-to-Sing: melody, then with your own lyrics
- ChatMusician: Understanding and Generating Music Intrinsically with LLM
- music-notation is treated as a second language
- also excellent compressor for music
- MusicLang: Llama 2 based Music generation model
- trained from scratch; runs on cpu
- using chords
3. AUDIO CODEC
- Disen: Disentangled Feature Learning for Real-Time Neural Speech Coding
- voice conversion in real-time communications
=Codec=
codebook each for speaker and content
- valle concept: modeling (building up decoder)
- VALL-E X: Multilingual Text-to-Speech Synthesis and Voice Cloning
- clone with only 3 seconds of reference audio
- 1.2
- High-Fidelity Audio Compression with Improved RVQGAN (8kbps)
- 3 times better than facebook encodec
=best codec=
- 3 times better than facebook encodec
- High-Fidelity Audio Compression with Improved RVQGAN (8kbps)
- VALL-E X: Multilingual Text-to-Speech Synthesis and Voice Cloning
- LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models
- MusicGen: vs google musiclm, inteweaving of discrete sound tokens conditioned on text input
- Accelerating Transducers through Adjacent Token Merging
- reduce 57% of tokens, improve speed by 70%
3.1. LANGUAGE ENCODING
- Natural Language Supervision for General-Purpose Audio Representations
- audio representations trained on 22 tasks, instead of the of sound event classification
- language autoregressive decoder-only, then joint with Contrastive Learning
- CLARA: Multilingual Contrastive Learning for Audio Representation Acquisition
- contrastive audio-text model, with understanding of implicit aspects of speech: emotions
- HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis
3.2. SUPER-RESOLUTION
- AudioSR: Versatile Audio Super-resolution at Scale (upsample, enhance)
4. VOICE CONVERSION
- end2endvc: End-to-End Voice Conversion with Information Perturbation (
=better mos than nvc=
) (better MOS than freevc) - QuickVC (5000 kHz fastest)
=vits=
- TriAAN-VC: Triple Adaptive Attention Normalization for Any-to-Any Voice Conversion
- best similarity and close naturality, speaker encoding
- 1 (best similarity, semantic tokens)
4.1. FEW SHOT
- FREEVC: TOWARDS HIGH-QUALITY TEXT-FREE ONE-SHOT VOICE CONVERSION (
=vits=
) - CONTROLVC: ZERO-SHOT VOICE CONVERSION WITH TIME-VARYING CONTROLS ON PITCH AND SPEED
- StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models (phonems)
- HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer
- end-to-end zero-shot VST model (better than DiffVC)
5. STYLE CONVERSION
- LVC-VC: Voice Conversion with Location-Variable Convolutions
- simultaneously performing voice conversion while generating audio
- smaller than NVC-Net
=has charts=
- NVC-Net: End-to-End Adversarial Voice Conversion (
=SONY=
)- voice conversion directly on the raw audio waveform
=best one=
3600 kHz fastest- https://github.com/sony/ai-research-code nvc voices
5.1. ATOMIC TRANSFERS
- Speech Representation Extractor: divide in parts, voice, pitch, context; zero shot Nvidia
- Pits: vits with pitch control (monotomic alignment)
- SpeechSplit: disentangling speech into content, timbre, rhythm and pitch.
- AutoVC implementation
- speaker embeddings: https://github.com/yistLin/dvector
- FragmentVC Timber transfer (better than AutoVC) keeps frecuency
- RGSM: better than Fragment
- AutoVC implementation
- MFC-StyleVC: DELIVERING SPEAKING STYLE IN LOW-RESOURCE CONVERSION WITH MULTI-FACTOR CONSTRAINTS
- repeat the utterance; different training objective for adaptation, normalizing
- content, speaker, style on/off
- Nonparallel Emotional Voice Conversion For Unseen Speaker-Emotion Pairs Using Dual Domain Adversarial Network & Virtual Domain Pairing (
=SONY=
)- emotion transfer