• Yoach Lacombe's avatar
    Add Seamless M4T model (#25693) · cb45f71c
    Yoach Lacombe authored
    
    
    * first raw commit
    
    * still POC
    
    * tentative convert script
    
    * almost working speech encoder conversion scripts
    
    * intermediate code for encoder/decoders
    
    * add modeling code
    
    * first version of speech encoder
    
    * make style
    
    * add new adapter layer architecture
    
    * add adapter block
    
    * add first tentative config
    
    * add working speech encoder conversion
    
    * base model convert works now
    
    * make style
    
    * remove unnecessary classes
    
    * remove unecessary functions
    
    * add modeling code speech encoder
    
    * rework logics
    
    * forward pass of sub components work
    
    * add modeling codes
    
    * some config modifs and modeling code modifs
    
    * save WIP
    
    * new edits
    
    * same output speech encoder
    
    * correct attention mask
    
    * correct attention mask
    
    * fix generation
    
    * new generation logics
    
    * erase comments
    
    * make style
    
    * fix typo
    
    * add some descriptions
    
    * new state
    
    * clean imports
    
    * add tests
    
    * make style
    
    * make beam search and num_return_sequences>1 works
    
    * correct edge case issue
    
    * correct SeamlessM4TConformerSamePadLayer copied from
    
    * replace ACT2FN relu by nn.relu
    
    * remove unecessary return variable
    
    * move back a class
    
    * change name conformer_attention_mask ->conv_attention_mask
    
    * better nit code
    
    * add some Copied from statements
    
    * small nits
    
    * small nit in dict.get
    
    * rename t2u model -> conditionalgeneration
    
    * ongoing refactoring of structure
    
    * update models architecture
    
    * remove SeamlessM4TMultiModal classes
    
    * add tests
    
    * adapt tests
    
    * some non-working code for vocoder
    
    * add seamlessM4T vocoder
    
    * remove buggy line
    
    * fix some hifigan related bugs
    
    * remove hifigan specifc config
    
    * change
    
    * add WIP tokenization
    
    * add seamlessM4T working tokenzier
    
    * update tokenization
    
    * add tentative feature extractor
    
    * Update converting script
    
    * update working FE
    
    * refactor input_values -> input_features
    
    * update FE
    
    * changes in generation, tokenizer and modeling
    
    * make style and add t2u_decoder_input_ids
    
    * add intermediate outputs for ToSpeech models
    
    * add vocoder to speech models
    
    * update valueerror
    
    * update FE with languages
    
    * add vocoder convert
    
    * update config docstrings and names
    
    * update generation code and configuration
    
    * remove todos and update config.pad_token_id to generation_config.pad_token_id
    
    * move block vocoder
    
    * remove unecessary code and uniformize tospeech code
    
    * add feature extractor import
    
    * make style and fix some copies from
    
    * correct consistency + make fix-copies
    
    * add processor code
    
    * remove comments
    
    * add fast tokenizer support
    
    * correct pad_token_id in M4TModel
    
    * correct config
    
    * update tests and codes  + make style
    
    * make some suggested correstion - correct comments and change naming
    
    * rename some attributes
    
    * rename some attributes
    
    * remove unecessary sequential
    
    * remove option to use dur predictor
    
    * nit
    
    * refactor hifigan
    
    * replace normalize_mean and normalize_var with do_normalize + save lang ids to generation config
    
    * add tests
    
    * change tgt_lang logic
    
    * update generation ToSpeech
    
    * add support import SeamlessM4TProcessor
    
    * fix generate
    
    * make tests
    
    * update integration tests, add option to only return text and update tokenizer fast
    
    * fix wrong function call
    
    * update import and convert script
    
    * update integration tests + update repo id
    
    * correct paths and add first test
    
    * update how new attention masks are computed
    
    * update tests
    
    * take first care of batching in vocoder code
    
    * add batching with the vocoder
    
    * add waveform lengths to model outputs
    
    * make style
    
    * add generate kwargs + forward kwargs of M4TModel
    
    * add docstrings forward methods
    
    * reformate docstrings
    
    * add docstrings t2u model
    
    * add another round of modeling docstrings + reformate speaker_id -> spkr_id
    
    * make style
    
    * fix check_repo
    
    * make style
    
    * add seamlessm4t to toctree
    
    * correct check_config_attributes
    
    * write config docstrings + some modifs
    
    * make style
    
    * add docstrings tokenizer
    
    * add docstrings to processor, fe and tokenizers
    
    * make style
    
    * write first version of model docs
    
    * fix FE + correct FE test
    
    * fix tokenizer + add correct integration tests
    
    * fix most tokenization tests
    
    * make style
    
    * correct most processor test
    
    * add generation tests and fix num_return_sequences > 1
    
    * correct integration tests -still one left
    
    * make style
    
    * correct position embedding
    
    * change numbeams to 1
    
    * refactor some modeling code and correct one test
    
    * make style
    
    * correct typo
    
    * refactor intermediate fnn
    
    * refactor feedforward conformer
    
    * make style
    
    * remove comments
    
    * make style
    
    * fix tokenizer tests
    
    * make style
    
    * correct processor tests
    
    * make style
    
    * correct S2TT integration
    
    * Apply suggestions from Sanchit code review
    
    Co-authored-by: default avatarSanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
    
    * correct typo
    
    * replace torch.nn->nn + make style
    
    * change Output naming (waveforms -> waveform) and ordering
    
    * nit renaming and formating
    
    * remove return None when not necessary
    
    * refactor SeamlessM4TConformerFeedForward
    
    * nit typo
    
    * remove almost copied from comments
    
    * add a copied from comment and remove an unecessary dropout
    
    * remove inputs_embeds from speechencoder
    
    * remove backward compatibiliy function
    
    * reformate class docstrings for a few components
    
    * remove unecessary methods
    
    * split over 2 lines smthg hard to read
    
    * make style
    
    * replace two steps offset by one step as suggested
    
    * nice typo
    
    * move warnings
    
    * remove useless lines from processor
    
    * make generation non-standard test more robusts
    
    * remove torch.inference_mode from tests
    
    * split integration tests
    
    * enrich md
    
    * rename control_symbol_vocoder_offset->vocoder_offset
    
    * clean convert file
    
    * remove tgt_lang and src_lang from FE
    
    * change generate docstring of ToText models
    
    * update generate docstring of tospeech models
    
    * unify how to deal withtext_decoder_input_ids
    
    * add default spkr_id
    
    * unify tgt_lang for t2u_model
    
    * simplify tgt_lang verification
    
    * remove a todo
    
    * change config docstring
    
    * make style
    
    * simplify t2u_tgt_lang_id
    
    * make style
    
    * enrich/correct comments
    
    * enrich .md
    
    * correct typo in docstrings
    
    * add torchaudio dependency
    
    * update tokenizer
    
    * make style and fix copies
    
    * modify SeamlessM4TConverter with new tokenizer behaviour
    
    * make style
    
    * correct small typo docs
    
    * fix import
    
    * update docs and add requirement to tests
    
    * add convert_fairseq2_to_hf in utils/not_doctested.txt
    
    * update FE
    
    * fix imports and make style
    
    * remove torchaudio in FE test
    
    * add seamless_m4t.md to utils/not_doctested.txt
    
    * nits and change the way docstring dataset is loaded
    
    * move checkpoints from ylacombe/ to facebook/ orga
    
    * refactor warning/error to be in the 119 line width limit
    
    * round overly precised floats
    
    * add stereo audio behaviour
    
    * refactor .md and make style
    
    * enrich docs with more precised architecture description
    
    * readd undocumented models
    
    * make fix-copies
    
    * apply some suggestions
    
    * Apply suggestions from code review
    
    Co-authored-by: default avatarSanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
    Co-authored-by: default avatarArthur <48595927+ArthurZucker@users.noreply.github.com>
    
    * correct bug from previous commit
    
    * refactor a parameter allowing to clean the code + some small nits
    
    * clean tokenizer
    
    * make style and fix
    
    * make style
    
    * clean tokenizers arguments
    
    * add precisions for some tests
    
    * move docs from not_tested to slow
    
    * modify tokenizer according to last comments
    
    * add copied from statements in tests
    
    * correct convert script
    
    * correct parameter docstring style
    
    * correct tokenization
    
    * correct multi gpus
    
    * make style
    
    * clean modeling code
    
    * make style
    
    * add copied from statements
    
    * add copied statements
    
    * add support with ASR pipeline
    
    * remove file added inadvertently
    
    * fix docstrings seamlessM4TModel
    
    * add seamlessM4TConfig to OBJECTS_TO_IGNORE due of unconventional markdown
    
    * add seamlessm4t to assisted generation ignored models
    
    ---------
    
    Co-authored-by: default avatarSanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
    Co-authored-by: default avatarArthur <48595927+ArthurZucker@users.noreply.github.com>
    cb45f71c