mirror of
https://github.com/ace-step/ACE-Step-1.5.git
synced 2026-07-02 16:37:04 +00:00
* refactor(flow-edit): redesign as cover-overlay sampler, drop "edit" task (#1156) The previous design (PRs #1162–#1167) shipped a standalone ``task_type="edit"`` that fed the user's source audio into ``prepare_condition`` for both branches. In ACE-Step 1.5 ``prepare_condition`` builds ``context_latents`` from ``src_latents`` (line 1701 of the base modeling), which becomes the dominant audio-self-conditioning at the decoder. Both branches receiving the same source latents meant V_src ≈ V_tar regardless of how we tweaked the text/lyric encoder inputs — the overlay had no signal to integrate and produced near-identical output to the baseline at every (n_min, n_max, n_avg) we tried. Architecturally correct shape (from user feedback): flow-edit is a sampler-level technique that layers on top of an existing task, not a new task. The cover/cover-nofsq dispatch already pairs ref-audio context with new caption/lyrics; the overlay simply adds a paired *source* branch (encoded from new ``flow_edit_source_caption`` / ``flow_edit_source_lyrics``) so V_delta = V_tar - V_src has meaning. Backend changes --------------- * Drop ``task_type == "edit"`` everywhere (constants, TASK_INSTRUCTIONS, inference skip_lm_tasks, generate_music_request _src_audio_required_tasks, generate_music guard rails, task_utils generate_instruction). * Replace ``edit_target_*`` GenerationParams with ``flow_edit_morph`` + ``flow_edit_source_caption/lyrics`` + ``flow_edit_n_min/max/avg`` so the user's ``caption``/``lyrics`` keep cover's existing target semantics. * ``service_generate`` builds ``flow_edit_ctx`` only when ``flow_edit_morph=True and task_type in (cover, cover-nofsq)``. * ``_execute_service_generate_diffusion`` dispatches to the new ``dispatch_flow_edit_overlay`` (renamed from ``dispatch_flow_edit``) when the overlay is active; otherwise the regular cover dispatch runs unchanged. * ``service_generate_flow_edit_target.py`` renamed to ``service_generate_flow_edit_source.py`` — same helpers, but now they encode the *source* side (the existing payload already carries the user's caption/lyrics as the target). * ``service_generate_flow_edit.py`` rewritten to drive ``flow_edit_pipeline.flowedit_generate_audio`` with the freshly encoded source side + payload's target side. UI changes ---------- * ``build_flow_edit_morph_controls()`` adds a Smooth-morph checkbox + source caption/lyrics + n_min/n_max/n_avg sliders inside a group that ``mode_ui`` toggles visible only on Remix (cover) mode. v1 leaves the controls visible for inspection but does NOT thread them through ``generation_run_wiring`` / ``batch_management_*`` yet — the end-to-end UI run path lands in the follow-up PR. Smoke testing goes through the Python API for now (see ``scripts/flow_edit_overlay_smoke.py``). Tests ----- * Rewrite ``service_generate_flow_edit_test.py`` for the overlay shape: 9 tests now exercise dispatch source-side tokenization, target-side pass-through, missing-method guard, device alignment, default-window fallbacks, dict-meta parsing, and retake_seed forwarding. * Update ``_flow_edit_dispatch_test_support.py`` fakes accordingly. * All 36 flow-edit / pipeline / helper tests pass locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(flow-edit): re-target overlay onto text2music + 1.0 default hyperparams (#1156) User feedback after listening to the cover-overlay v1 outputs: * drop cover task as the underlying carrier — use text2music instead so ``prepare_condition`` produces silence-derived context for both src and tar branches (the cleanest condition shape, identical between branches, so V_delta is purely text-driven); * fall back to ACE-Step 1.0's flow-edit defaults: ``n_min=0, n_max=1.0, n_avg=1, infer_steps=60``. Implementation -------------- * ``flow_edit_pipeline.flowedit_generate_audio`` gains a ``ctx_src_latents`` param. When the caller wants text2music-style context but still needs the real ``src_latents`` for ``zt_src``/ ``zt_tar`` formation in the sampling loop, it passes ``ctx_src_latents=silence``. Both ``prepare_condition`` calls then feed silence as both ``hidden_states`` and ``src_latents`` while the loop continues to use the real audio for trajectory formation. * ``service_generate_flow_edit.dispatch_flow_edit_overlay`` rewritten: builds the silence tile, passes ``is_covers=zeros`` (text2music), and forwards real ``src_latents`` for the loop. * ``service_generate.flow_edit_ctx`` now activates only on ``task_type == "text2music"`` (was cover/cover-nofsq). * ``generate_music_request._prepare_reference_and_source_audio`` accepts ``flow_edit_morph`` so text2music + morph encodes ``src_audio`` instead of silently ignoring it. * ``generate_music`` locks ``audio_duration`` to source audio for the morph case and warns if morph is enabled on a non-text2music task. * ``inference.generate_music`` no longer forces ``src_audio=None`` for text2music when ``flow_edit_morph=True``. * ``flow_edit_overlay_smoke.py`` now drives a single 1.0-default run (``n_min=0, n_max=1.0, n_avg=1, infer_steps=60``) on the SFT model. Tests ----- * Update fakes/fixtures so ``task_type='text2music'`` and ``is_covers=0``; all 36 flow-edit / pipeline / helper tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(flow-edit): slice silence_latent properly when tiling for prepare_condition The previous expand() call assumed silence_latent was (1, 1, C) but it's actually (1, available, C) — for example (1, 15000, C) — so expanding to (bsz, 4000, C) blew up with a shape mismatch. Mirror ``conditioning_target._get_silence_latent_slice``: slice the first ``seq`` frames if available, otherwise tile and slice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(flow-edit): add shift=3.0 vs shift=1.0 A/B for overlay smoke ACE-Step 1.0's ``FlowMatchEulerDiscreteScheduler`` defaults to shift=3.0, which front-loads the schedule near t=0 (more dense steps at the clean end). Our smoke wasn't setting shift explicitly so it ran at shift=1.0 (uniform). Add a paired run so we can A/B the two on the same source and confirm whether the shift mismatch explains the v1 distortion the user reported. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(flow-edit): wire morph UI to text2music + run-handler chain (#1156) User found that the Gradio demo didn't surface a working morph control because the UI pieces weren't actually connected end-to-end: * ``mode_ui`` made ``flow_edit_morph_group`` visible only on Remix mode, but the backend dispatches morph for ``task_type == text2music`` (Custom mode). Net: morph never fired. Fix: gate visibility on ``is_custom`` and add Custom to ``show_src_audio`` so users can upload the source audio for the overlay. * ``generation_run_wiring`` didn't pass the 6 ``flow_edit_*`` UI components to the handler. Add them to the click-event ``inputs``. * ``generate_with_batch_management`` and ``generate_with_progress`` signatures gain the same 6 params and forward them into the ``GenerationParams`` constructor. * ``generate_music_request`` now hard-errors when ``flow_edit_morph`` is True without a ``src_audio`` (was a silent no-op). * ``flow_edit_pipeline._warn_about_disabled_v1_tricks`` log no longer references the removed ``task_type='edit'`` — overlay is the right noun now. * Drop the "API preview" caveat from the morph checkbox info text. Hygiene: * Add ``*.pkf`` and ``flow_edit_test_outputs/`` / ``retake_test_outputs/`` to ``.gitignore``; remove the 28 .pkf binaries that a previous ``git add -A`` accidentally pushed. Limitations (follow-up PRs): * save/restore for the 6 morph params isn't threaded through batch_queue / metadata yet — restoring a session resets morph to defaults. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ui): combine retake + smooth-morph into one accordion (#1155, #1156) Per user feedback after first manual demo round: * The two variation knobs (Retake = #1155 noise blend, Smooth morph = #1156 V_delta overlay) are conceptually paired — both produce a controlled deviation from the seeded baseline. Stack them in one ``gr.Accordion("Variation & Smooth Morph")`` with two top-line checkboxes; checking a box reveals only that subsystem's inputs. * Compact retake — variance + seed sit on one row inside their panel. * Morph controls (source caption / source lyrics / n_min / n_max / n_avg) get their own subpanel with explanatory help text below. * Both subsystems are now visible in Custom / Remix / Repaint modes; ``mode_ui`` gates the outer accordion via ``is_custom or is_cover or is_repaint``. Backend dispatch still honours morph only in Custom (text2music) for v1; the morph panel info text flags this. * Send-to-Remix / Send-to-Repaint pre-fills ``flow_edit_source_caption`` + ``flow_edit_source_lyrics`` with the previous run's prompt so the user can flip on morph and edit the top-level caption / lyrics as the target without re-typing the source description. Implementation -------------- * New ``generation_tab_variation_morph_controls.py`` (renamed from ``generation_tab_retake_controls.py``) houses the combined builder. * ``build_flow_edit_morph_controls`` removed from ``generation_tab_secondary_controls.py`` (back under the 200 LOC cap). * ``_MODE_UI_OUTPUT_KEYS`` swaps ``flow_edit_morph_group`` → ``variation_accordion``; ``mode_ui`` outputs gr.update for the new key. Existing UI tests (``context_test.py``) updated. * ``send_audio_to_remix`` / ``send_audio_to_repaint`` return shape grows by 2 (caption + lyrics for ``flow_edit_source_*``); ``results_aux_wiring`` adds them to the click outputs list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ui): retake/edit panels side-by-side; rename to plain Retake / Edit (#1155, #1156) Three issues reported after the manual demo round: 1. Panels were stacked vertically — when both Retake and Edit were checked the layout collapsed top-to-bottom. Fix: two ``gr.Column``s side-by-side inside a single ``gr.Row``; each checkbox sits on top of its own panel, expanded independently. 2. Verbose checkbox labels. Drop the parentheticals: "Retake (variation)" → "Retake"; "Smooth morph (flow-edit)" → "Edit". Accordion title becomes "Retake & Edit". 3. The trailing ``gr.Markdown`` with retake explainer was wider than its column and got clipped at the right edge. Replace with a slim ``gr.HTML`` line under the Edit panel only (the morph-only caveat), and shorten the per-input ``info=`` blurbs so they fit inside the slider/textbox without overlap. Also: place the accordion right under "LM codes Hints" in the layout so the variation knobs are next to the source-audio inputs they apply to. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ui): help modal + Copy-from-current button for Retake & Edit (#1155, #1156) User reported two issues from manual demo: 1. The inline ``gr.HTML`` help block under the Edit panel rendered as a black band (z-index / overflow conflict with the surrounding column / accordion). Replace with the project-standard ``create_help_button`` pattern — a (?) button next to each checkbox that opens a modal with full markdown help. Both Retake and Edit now have their own modals. 2. There was no quick way to bootstrap the Edit ``source caption / lyrics`` from the user-level fields. Add a ``📋 Copy from current`` button inside the Edit panel that copies ``captions`` → ``flow_edit_source_caption`` and ``lyrics`` → ``flow_edit_source_lyrics``. Wired in ``generation_text_format_wiring`` next to the existing Format buttons. Help modal content (en.json): * ``help.generation_retake`` — explains the variance-preserving sin/cos blend (``mixed = cos(v·π/2)·base + sin(v·π/2)·retake``), variance-band recommendations, retake_seed semantics, and links issue #1155. * ``help.generation_edit`` — explains the V_delta integration math (``z_edit_{t-Δt} = z_edit_t + (V_tar − V_src)·Δt``), the paired-CFG method, recommended hyperparams (Custom mode, shift=3, n_min=0, n_max=1, n_avg=1), the Copy-from-current workflow, and cites the FlowEdit paper (Kulikov et al., CVPR 2025, arXiv:2412.08629) plus issue #1156. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ui): compact Copy button + side-by-side source caption/lyrics in Edit panel After the previous round the Copy-from-current button was rendering as a full-width dark grey bar (Gradio expanded the button to fill the column, so the label was pushed to one side and the empty fill looked like a broken help block). Wrap the button in a ``gr.Row`` with ``scale=0`` + ``min_width=180`` so it claims only the space its label needs. Per follow-up feedback, ``source caption`` and ``source lyrics`` now sit side-by-side inside the Edit panel — the previous top-bottom stack wasted vertical space and made the panel scroll for no reason. Both textboxes now share a 4-line minimum on a single row and split the width 50/50 via ``scale=1`` each. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(ui): drop checkbox ``info`` blurbs; help (?) button is enough The Gradio-auto ``ⓘ`` icon next to the checkbox label and the (?) help button were both rendered next to each other, looking like two redundant info indicators (and one of them was a no-op tooltip). Drop the short ``info=`` strings on the Retake / Edit checkboxes — the (?) modal carries the full explanation already. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(ui): make issue references clickable in Retake/Edit help modals The ``_md_to_html`` renderer already converts ``[text](url)`` markdown links to anchor tags, but the issue references in the help text were plain ``Issue #1156`` strings. Wrap them as proper markdown links pointing at the GitHub issues so the modal renders them as ``<a href=...>Issue #1156</a>`` (matching the existing FlowEdit paper arXiv link). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(ui): drop the outer "Retake & Edit" accordion The accordion added an unnecessary collapse layer — its title is self-evident from the two checkboxes underneath, and one extra click to expand was friction. Replace with a plain ``gr.Group()`` so the two checkboxes sit directly in the layout. Mode-UI visibility still hides/shows the whole block via the renamed ``variation_group`` key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ui): warn user when Retake + Think collide (#1155) Retake's noise-blend variation is only meaningful when every other condition matches the baseline run. In particular, with Think on the LM regenerates audio codes per call, so Retake's variance gets layered on top of an already-different starting point and the result mixes "LM drift" with "noise drift" — confusing and rarely what the user wants. Two surfaces, one message: * ``help.generation_retake`` modal gains a "⚠️ Consistency requirement" section explaining the seed / Think / knob-locking caveats and the recommended A/B-comparison workflow. * A live ``gr.Markdown`` warning sits inside the Retake panel and becomes visible only when both ``retake_enabled`` and ``think_checkbox`` are simultaneously on. Wired in ``generation_text_format_wiring.py`` next to the existing Copy-from-current handler. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(ui): document Think→Retake workflow via LM-codes pinning (#1155) User feedback: when the baseline you want to retake was generated with Think on, the cleanest way to lock the LM-side starting point is to copy the result's LM Codes and paste them into the LM Codes Hints field, then uncheck Think. Add this workflow: * ``help.generation_retake`` modal gains a "Recommended workflow for retaking a Think-mode result" section walking through the 5 steps (open Score & LRC & LM Codes accordion → copy LM Codes → paste into LM Codes Hints → uncheck Think → enable Retake / lock seed / set retake_seed / adjust variance). * The inline Retake×Think warning becomes actionable: it now points the user at the exact panels and fields they need to use, with the full step-by-step left to the (?) modal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(flow-edit): extend overlay to cover / cover-nofsq tasks (#1156) User reported Edit didn't fire when generating with LM Codes (cover scenario) on Gradio. Root cause: the v1 backend gate restricted morph to ``task_type == "text2music"`` and silently ignored everything else. Extend the dispatch: * ``service_generate.flow_edit_ctx`` now activates on ``task_type in (text2music, cover, cover-nofsq)``. * ``dispatch_flow_edit_overlay`` branches on task: - ``text2music`` keeps the silence-context behaviour (the verified clean text-driven V_delta path). - ``cover`` / ``cover-nofsq`` pass the payload's real ``src_latents`` and ``is_covers`` straight through, so cover's natural LM-codes context flows into both branches via ``prepare_condition``'s ``is_covers > 0`` lm_hints branch. Both branches share the same codes, so V_delta is still text-driven. * ``generate_music`` warning relaxed to flag only repaint / extract / lego (which need paired-CFG derivation per task shape). * ``mode_ui`` row 35 comment updated to note Edit's mode coverage. Help text refreshed: * ``help.generation_edit`` now describes both text2music and Remix paths, and the "ignored on Repaint/Extract/Lego" caveat replaces the old "Custom-only" note. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(flow-edit): add flowedit_generate_audio delegate to turbo / xl-turbo (#1156) User hit ``RuntimeError: Flow-edit overlay requires a base DiT variant`` when toggling Edit on a turbo model. Add the same thin delegate the 4 base variants carry, with one turbo-specific guard: * Force ``diffusion_guidance_scale=1.0`` because turbo / xl-turbo are CFG-distilled — paired-CFG over a delta that the model wasn't trained to produce just amplifies noise. Log an INFO when overriding so the user sees why their guidance_scale slider didn't take effect. Both variants share the exact same delegate body; resisted the urge to factor it out because the 4 base variants don't have the gs override and a single shared helper would muddle that. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(flow-edit): drop silence-context override; use task-natural context (#1156) User reported turbo + Custom + Edit produced pure noise on Gradio. Root cause: ``dispatch_flow_edit_overlay`` forced ``is_covers=zeros`` and a silence-tiled context for text2music, but turbo's velocity head was trained on real audio / LM-codes context. 8-step turbo on silence is OOD — V predictions are unstable and the V_delta integration over a short schedule accumulates into garbage latents. Fix: stop overriding the audio context. Pass the payload's real ``src_latents`` and ``is_covers`` straight through to ``prepare_condition`` for every task type. Both branches still share the SAME context (whatever the task naturally built — LM- codes hints when Think is on / cover task, src-latents auto- tokenization otherwise), so V_delta is still purely text-driven, but the velocity head stays in distribution. Verified by inspection of the user's run log: * ``Using precomputed LM hints`` printed 3× (LM Phase 1 ran with Think on; both flow-edit prepare_condition calls + the downstream one for auto-LRC saw the precomputed tensor) * ``infer_steps=8, guidance_scale=1.00`` → turbo dispatch path * The forced ``is_covers=zeros`` in our dispatch was discarding the precomputed LM-codes hints and falling back to silence-context. Help text updated to drop the silence-context language and add a note that turbo's 8-step budget is sufficient for flow-edit (the V_delta integration uses the same paired-forward count regardless of variant). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(flow-edit): restore silence-context for text2music; keep cover's natural Empirical sweep (sft60 / tb60 / tb8w05 / tb8s1, all on the user's provided audio + LM codes) showed every variant collapsed to peak ≈ 0.007 the moment the dispatch passed real ``src_latents`` / LM-codes context to ``prepare_condition``. The previous "transparent payload" fix was wrong: text2music's training distribution is (clean target, silence audio context), so feeding it either real source latents or one-sided LM-codes hints pushes the velocity head OOD — V_delta accumulates noise, the latent collapses, and VAE decode + auto-normalisation amplifies the residual to full scale. That's the "纯噪音" the user reports. Branch dispatch: * text2music — force silence-context (proven working: peak=1.0 on the earlier jieyue sft+60 run that used silence-context). * cover / cover-nofsq — keep the payload's real ``src_latents`` and ``is_covers``. The cover task IS trained on LM-codes-derived audio context shared by both branches, so V_delta stays clean while staying in distribution. Documented the diagnostic and the design choice inline so the next edit doesn't re-do the same loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(flow-edit): drop precomputed LM hints in text2music branch The empirical sweep showed silence-context worked when ``audio_codes`` was absent (peak=0.92) but collapsed when codes were present (peak=0.007), even with ``is_covers=0`` forced. Even though ``prepare_condition``'s ``where(is_covers > 0, lm_hints, src_latents)`` keeps the silence src_latents unmodified, the codes-derived ``lm_hints_25Hz`` tensor itself lingers in downstream tensor paths and empirically collapses the latent. Force ``precomputed_lm_hints_25Hz=None`` for the text2music silence- context branch so ``prepare_condition`` tokenizes silence afresh, matching the working no-codes path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(flow-edit): strip audio_codes for text2music morph (#1156) Root cause of the user's "纯噪音" repro: when ``audio_codes`` was present, ``conditioning_target._prepare_target_latents_and_wavs`` replaced ``target_wavs`` with zeros and put ``_decode_audio_codes_to_latents(codes)`` into ``target_latents``. That decoded-from-codes tensor sits at a different distribution than VAE-encoded audio, so flow-edit's ``zt_edit = src_latents.clone()`` started OOD and the V_delta integration collapsed to a near-silent latent (peak ≈ 0.007), which auto-normalisation then amplified to full scale as audible noise. Verified by toggling ``USE_CODES`` in the repro script: * with codes → peak 0.0076 (broken) * no codes → peak 0.9258 (clean output) Strip ``audio_code_string`` for ``text2music`` + ``flow_edit_morph`` specifically. The downstream pipeline then VAE-encodes the uploaded mp3 normally, and zt_edit starts at a real audio latent — flow-edit's math behaves as intended. Cover / cover-nofsq + morph keeps codes intact (cover IS trained on codes-derived context, both branches share it, no OOD). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(flow-edit): skip LM Phase 1 for text2music + morph (#1156) Earlier fix only zeroed ``audio_code_string_to_use`` at the top of ``generate_music``. But when Think is on (UI default), LM Phase 1 runs anyway and overwrites ``audio_code_string_to_use`` with freshly-generated codes — the same codes path then bites ``conditioning_target`` and zt_edit starts OOD again. Add ``text2music + flow_edit_morph`` to the LM-skip path alongside cover / cover-nofsq / repaint / extract. Think / CoT both silently no-op for the morph case so the downstream pipeline VAE-encodes the src_audio cleanly regardless of whether Think was checked in the UI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(ui): rewrite Retake/Edit help to match current behaviour + i18n (#1155, #1156) The previous help text for Edit was written when the dispatch still used silence-context unconditionally for text2music. After the codes-context fix (commits4028a00+80f8c9c) the Custom path now silently drops Think / LM Codes Hints / LM Phase 1 and VAE-encodes the user's Source Audio directly. The old help didn't mention this. Update help.generation_edit (en): * Add an explicit "Workflow" section with concrete UI steps (which mode, where to upload, which fields to fill, recommended shift=3.0, what button does what). * Add a "Mode behaviour" section that documents the silent-drop semantics for Custom (drops Think + LM codes), the cover-natural context for Remix, and the unsupported / fall-through behaviour for Repaint / Extract / Lego. * Trim the recommended-settings stanza so the per-variant step guidance is one line instead of three. Trim help.generation_retake (en): * Drop redundant tip lines that duplicated the consistency requirement; keep the workflow + reference. * Tighten the Think-collision warning so it points straight at the full workflow. Translate both entries into zh / ja / pt / he so the (?) modal shows native-language help everywhere instead of falling back to en. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * debug(flow-edit): log src_audio shape at the morph guard User reports the "Flow-edit morph requires a source audio" guard fires even when they uploaded an mp3 to the Source Audio component. Need to see what gradio is actually handing the backend (None / empty str / list / dict?) before adding the right normalisation. Also widen the missing-check to cover empty strings and empty list/tuple in case gradio hands ``""`` instead of ``None`` for cleared components. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ui): rename LM-Codes-Hints' inner audio upload label to disambiguate User report: there were two ``Source Audio`` upload boxes on the generation page — the real ``src_audio`` (in ``src_audio_row``, used by the Generate Music handler) and the small one inside the LM Codes Hints accordion (used only by the ``Convert to Codes`` utility button). Both shared the same ``generation.source_audio`` i18n key, so users dropped their mp3 into the wrong one and the Generate handler saw ``src_audio=None`` → "Flow-edit morph requires a source audio" hard-error. Give the inner upload its own label key ``generation.lm_codes_audio_upload_label`` ("Audio → Codes (utility)" in en, equivalent in zh / ja / pt / he). No layout change — only the label disambiguates intent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ui): stop validate_uploaded_audio_file from silently clearing valid mp3 User repro: they uploaded an mp3 to ``Source Audio`` and saw the 3:18 waveform render, but the Generate handler got ``src_audio=None``. Root cause: ``validate_uploaded_audio_file`` runs ``soundfile.info()`` on every upload and returns ``gr.update(value=None)`` (silently clearing the component) on ``OSError / RuntimeError / ValueError``. ``libsndfile``'s mp3 support is spotty across platforms — on jieyue it refuses files that torchaudio / ``process_src_audio`` decode cleanly. The cleared value left the waveform visible (Gradio's player keeps its display cache) so the user had no signal that the component state had been zeroed. Drop the auto-clear path. The validator now returns ``gr.skip()`` on soundfile errors too. If the file is genuinely unreadable, the backend's own decode path raises a clearer error from the dispatch layer (the morph guard already covers the empty case). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * debug(ui): log key inputs at the wrapper-call level User keeps hitting ``src_audio=None`` despite the waveform rendering in the UI. Static analysis says the inputs list and the wrapper signature are aligned (78=78, src_audio at position 14 in both). Need to see what gradio actually hands the wrapper at click time to differentiate between (a) wrong slot = some other component bleeding into src_audio or (b) gradio component state genuinely None despite the visible waveform. Log only at the wrapper entry — backend already logs at the morph guard, so we get a comparison from both sides of the chain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ui): preserve src_audio for text2music + flow_edit_morph (#1156) Root cause of the user's "Flow-edit morph requires a source audio" error chain. In ``generation_progress.generate_with_progress`` (the layer right above ``GenerationParams``): if task_type == "text2music": src_audio = None This pre-overlay defensive guard zeroed ``src_audio`` for every Custom-mode run. When flow_edit_morph was enabled the backend then saw ``src_audio=None`` and bailed via the morph guard ("Flow-edit morph requires a source audio"). Earlier debug rounds — soundfile / torchaudio / ffmpeg checks, label disambiguation, validation non-clearing, inference's own ``src_audio`` ternary — were all chasing downstream symptoms. This wrapper-level zeroing was the real source. Gate the zeroing on ``not flow_edit_morph`` so the morph path keeps ``src_audio`` and Custom-mode without morph still drops it (no behaviour change for that case). Also remove the now-stale wrapper-level debug log added while hunting; the src_audio guard log in the request layer is enough. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(ui): correct Edit workflow ordering (original first, then target) The previous workflow had user filling top-level caption/lyrics with the target before clicking Copy current → source — which would just snapshot the target into the source fields, defeating the whole point of the button. Reorder so the user: 1. Fills top-level fields with the **original** description (V_src). 2. Clicks Copy current → source to snapshot it. 3. Then edits top-level fields to define the **target** (V_tar). Updated en + zh + ja + pt + he in lockstep so every language describes the same correct sequence. Also added a tip noting that Send-to-Remix / Send-to-Repaint pre-fills source automatically, so steps 3 and 5 are skipped in that path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: remove debug logs added during the src_audio=None investigation The wrapper-level "[wrapper] inputs at click time" log and the morph-guard "[generate_music] flow_edit_morph guard:" log were both added while hunting commit1b0a95c. Now that the root cause (``generation_progress`` zeroing src_audio for text2music unconditionally) is fixed, drop the verbose logging. The empty-string + empty-list normalization in the guard stays — that's a real defensive check, not debug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: remove .claude worktrees + scheduled_tasks.lock from tracking These got into the branch via an earlier ``git add -A`` and lingered through history. They're Claude Code's local dev artefacts (per-agent git worktrees + scheduling lock) that should never be in repo history. Untrack them via ``git rm --cached`` and add explicit .gitignore lines to prevent re-introduction. After the squash-merge to main these files won't be in main's tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
612835bd58
commit
e34ab47eed
36 changed files with 834 additions and 380 deletions
5
.gitignore
vendored
5
.gitignore
vendored
|
|
@ -8,6 +8,9 @@ Folder.jpg
|
|||
data/
|
||||
*.mp3
|
||||
*.wav
|
||||
*.pkf
|
||||
flow_edit_test_outputs/
|
||||
retake_test_outputs/
|
||||
|
||||
# Byte-compiled / optimized / DLL files
|
||||
__pycache__/
|
||||
|
|
@ -246,6 +249,8 @@ lokr_output/
|
|||
.claude/skills/acestep/scripts/config.json
|
||||
.claude/skills/acestep/scripts/.first_gen_done
|
||||
.claude/skills/acestep-simplemv/scripts/public/
|
||||
.claude/scheduled_tasks.lock
|
||||
.claude/worktrees/
|
||||
config.json
|
||||
|
||||
acestep_output/
|
||||
|
|
|
|||
|
|
@ -70,8 +70,10 @@ VALID_TIME_SIGNATURES = [2, 3, 4, 6]
|
|||
# Task Type Constants
|
||||
# ==============================================================================
|
||||
|
||||
# All supported generation tasks across different model variants
|
||||
TASK_TYPES = ["text2music", "repaint", "cover", "cover-nofsq", "extract", "lego", "complete", "edit"]
|
||||
# All supported generation tasks across different model variants.
|
||||
# Flow-edit is NOT a task — it's a sampler overlay that can be enabled
|
||||
# on top of cover/cover-nofsq via ``GenerationParams.flow_edit_morph``.
|
||||
TASK_TYPES = ["text2music", "repaint", "cover", "cover-nofsq", "extract", "lego", "complete"]
|
||||
|
||||
# Task types available for turbo models (optimized subset for speed)
|
||||
# - text2music: Generate from text descriptions
|
||||
|
|
@ -84,15 +86,6 @@ TASK_TYPES_TURBO = ["text2music", "repaint", "cover", "cover-nofsq"]
|
|||
# - extract: Separate individual tracks/stems from audio
|
||||
# - lego: Multi-track generation (add layers)
|
||||
# - complete: Automatic completion of partial audio
|
||||
# Note: ``edit`` is a base-only task (#1156) but intentionally NOT
|
||||
# advertised here — the HTTP /release_task surface and Gradio dropdowns
|
||||
# both read TASK_TYPES_BASE. Until PR-C wires the API/UI fields
|
||||
# (edit_target_caption / lyrics / window) through, keep edit out of
|
||||
# discovery so clients aren't told "edit is supported" while still
|
||||
# unable to pass the required target params. Python callers using
|
||||
# ``GenerationParams(task_type="edit")`` directly are unaffected — the
|
||||
# dispatch only checks TASK_TYPES (which does include edit) for
|
||||
# validation.
|
||||
TASK_TYPES_BASE = ["text2music", "repaint", "cover", "cover-nofsq", "extract", "lego", "complete"]
|
||||
|
||||
|
||||
|
|
@ -143,7 +136,6 @@ TASK_INSTRUCTIONS = {
|
|||
"lego_default": "Generate the track based on the audio context:",
|
||||
"complete": "Complete the input track with {TRACK_CLASSES}:",
|
||||
"complete_default": "Complete the input track:",
|
||||
"edit": "Edit the audio toward the target conditions:",
|
||||
}
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -1,4 +1,4 @@
|
|||
"""Shared test fixtures for flow-edit dispatch tests (#1156 PR-B).
|
||||
"""Shared test fixtures for flow-edit overlay dispatch tests (#1156).
|
||||
|
||||
Underscored module name keeps it out of unittest discovery.
|
||||
"""
|
||||
|
|
@ -10,7 +10,7 @@ import torch
|
|||
|
||||
|
||||
class FakeHandler:
|
||||
"""Minimal handler stand-in exposing the surface ``dispatch_flow_edit`` needs."""
|
||||
"""Minimal handler stand-in exposing the surface the overlay needs."""
|
||||
|
||||
def __init__(self, model_has_flowedit: bool = True):
|
||||
self.device = torch.device("cpu")
|
||||
|
|
@ -23,7 +23,7 @@ class FakeHandler:
|
|||
"time_costs": {},
|
||||
}
|
||||
# Sentinel tensors so tests can verify the dispatch returns
|
||||
# ``prepare_condition`` outputs (not the raw embeddings).
|
||||
# ``prepare_condition`` outputs (not raw embeddings).
|
||||
self.prepared_enc_hs = torch.full((1, 4, 16), 7.0)
|
||||
self.prepared_enc_am = torch.full((1, 4), 7.0)
|
||||
self.prepared_ctx = torch.full((1, 8, 32), 7.0)
|
||||
|
|
@ -39,8 +39,9 @@ class FakeHandler:
|
|||
def _prepare_text_conditioning_inputs(self, *, batch_size, instructions,
|
||||
captions, lyrics, parsed_metas,
|
||||
vocal_languages, audio_cover_strength):
|
||||
# Capture parsed_metas so tests can verify _parse_metas ran first.
|
||||
self.captured_parsed_metas = parsed_metas
|
||||
self.captured_captions = captions
|
||||
self.captured_lyrics = lyrics
|
||||
seq = 4
|
||||
return (
|
||||
["fake-text-input"] * batch_size,
|
||||
|
|
@ -52,12 +53,9 @@ class FakeHandler:
|
|||
)
|
||||
|
||||
def _parse_metas(self, metas):
|
||||
"""Stub mirroring the real handler's contract: dict -> formatted string."""
|
||||
out = []
|
||||
for m in metas:
|
||||
if isinstance(m, dict):
|
||||
# Real handler emits "- bpm: 120\n- key: ..." style; stub uses
|
||||
# a sentinel prefix so tests can detect parse-vs-raw-dict.
|
||||
out.append("PARSED:" + ",".join(f"{k}={v}" for k, v in m.items()))
|
||||
else:
|
||||
out.append(str(m))
|
||||
|
|
@ -73,6 +71,7 @@ class FakeHandler:
|
|||
def make_payload(bsz: int = 1, seq: int = 8, ch: int = 16):
|
||||
return {
|
||||
"src_latents": torch.randn(bsz, seq, ch),
|
||||
# Target side (the user's caption/lyrics already encoded).
|
||||
"text_hidden_states": torch.randn(bsz, 4, ch),
|
||||
"text_attention_mask": torch.ones(bsz, 4),
|
||||
"lyric_hidden_states": torch.randn(bsz, 4, ch),
|
||||
|
|
@ -80,20 +79,21 @@ def make_payload(bsz: int = 1, seq: int = 8, ch: int = 16):
|
|||
"refer_audio_acoustic_hidden_states_packed": torch.zeros(0, 4, ch),
|
||||
"refer_audio_order_mask": torch.zeros(0, dtype=torch.long),
|
||||
"chunk_mask": torch.ones(bsz, seq, ch),
|
||||
"is_covers": torch.zeros(bsz, dtype=torch.long),
|
||||
"is_covers": torch.zeros(bsz, dtype=torch.long), # text2music
|
||||
"precomputed_lm_hints_25Hz": None,
|
||||
}
|
||||
|
||||
|
||||
def make_edit_ctx(target_caption="my new caption", target_lyrics="new lyrics"):
|
||||
def make_flow_edit_ctx(source_caption="anime pop", source_lyrics="original"):
|
||||
return {
|
||||
"task_type": "edit",
|
||||
"edit_target_caption": target_caption,
|
||||
"edit_target_lyrics": target_lyrics,
|
||||
"morph": True,
|
||||
"task_type": "text2music",
|
||||
"source_caption": source_caption,
|
||||
"source_lyrics": source_lyrics,
|
||||
"vocal_languages": ["en"],
|
||||
"metas": [""],
|
||||
"instructions": ["Fill the audio semantic mask based on the given conditions:"],
|
||||
"edit_n_min": 0.2,
|
||||
"edit_n_max": 0.8,
|
||||
"edit_n_avg": 1,
|
||||
"n_min": 0.2,
|
||||
"n_max": 0.8,
|
||||
"n_avg": 1,
|
||||
}
|
||||
|
|
|
|||
|
|
@ -225,11 +225,12 @@ class GenerateMusicMixin:
|
|||
repaint_strength: float = 0.5,
|
||||
retake_seed: Optional[Union[str, float, int]] = None,
|
||||
retake_variance: float = 0.0,
|
||||
edit_target_caption: str = "",
|
||||
edit_target_lyrics: str = "",
|
||||
edit_n_min: float = 0.0,
|
||||
edit_n_max: float = 1.0,
|
||||
edit_n_avg: int = 1,
|
||||
flow_edit_morph: bool = False,
|
||||
flow_edit_source_caption: str = "",
|
||||
flow_edit_source_lyrics: str = "",
|
||||
flow_edit_n_min: float = 0.0,
|
||||
flow_edit_n_max: float = 1.0,
|
||||
flow_edit_n_avg: int = 1,
|
||||
progress=None,
|
||||
) -> Dict[str, Any]:
|
||||
"""Generate audio from text/reference inputs and return response payload.
|
||||
|
|
@ -326,30 +327,28 @@ class GenerateMusicMixin:
|
|||
audio_code_string=audio_code_string,
|
||||
actual_batch_size=actual_batch_size,
|
||||
task_type=task_type,
|
||||
flow_edit_morph=flow_edit_morph,
|
||||
)
|
||||
if audio_error is not None:
|
||||
return audio_error
|
||||
|
||||
# Cover/repaint/lego/extract/edit: lock duration to source audio.
|
||||
if processed_src_audio is not None and task_type in (
|
||||
"cover", "cover-nofsq", "repaint", "lego", "extract", "edit",
|
||||
# Cover/repaint/lego/extract + text2music+morph: lock duration to source audio.
|
||||
if processed_src_audio is not None and (
|
||||
task_type in ("cover", "cover-nofsq", "repaint", "lego", "extract")
|
||||
or (task_type == "text2music" and flow_edit_morph)
|
||||
):
|
||||
audio_duration = processed_src_audio.shape[-1] / self.sample_rate
|
||||
|
||||
# Flow-edit guards: paired CFG ≈ 4× decoder forwards/step, so
|
||||
# warn when a user combines edit with knobs that don't apply.
|
||||
if task_type == "edit":
|
||||
if repainting_start or (repainting_end is not None and repainting_end >= 0):
|
||||
logger.info(
|
||||
"[generate_music] task_type='edit' is whole-song; ignoring "
|
||||
"repainting_start={} / repainting_end={}.",
|
||||
repainting_start, repainting_end,
|
||||
)
|
||||
if not edit_target_caption and not edit_target_lyrics:
|
||||
logger.warning(
|
||||
"[generate_music] task_type='edit' with empty edit_target_caption "
|
||||
"and edit_target_lyrics — output will closely match the source.",
|
||||
)
|
||||
# Flow-edit overlay v1: text2music (silence-context) and
|
||||
# cover / cover-nofsq (shared LM-codes context). Repaint /
|
||||
# extract / lego need paired-CFG derivation per task shape
|
||||
# and are left for follow-up.
|
||||
if flow_edit_morph and task_type not in ("text2music", "cover", "cover-nofsq"):
|
||||
logger.warning(
|
||||
"[generate_music] flow_edit_morph=True but task_type={!r}; "
|
||||
"v1 overlay only applies to text2music / cover / cover-nofsq, ignoring.",
|
||||
task_type,
|
||||
)
|
||||
|
||||
service_inputs = self._prepare_generate_music_service_inputs(
|
||||
actual_batch_size=actual_batch_size,
|
||||
|
|
@ -411,11 +410,12 @@ class GenerateMusicMixin:
|
|||
task_type=task_type,
|
||||
actual_retake_seed_list=actual_retake_seed_list,
|
||||
retake_variance=retake_variance,
|
||||
edit_target_caption=edit_target_caption,
|
||||
edit_target_lyrics=edit_target_lyrics,
|
||||
edit_n_min=edit_n_min,
|
||||
edit_n_max=edit_n_max,
|
||||
edit_n_avg=edit_n_avg,
|
||||
flow_edit_morph=flow_edit_morph,
|
||||
flow_edit_source_caption=flow_edit_source_caption,
|
||||
flow_edit_source_lyrics=flow_edit_source_lyrics,
|
||||
flow_edit_n_min=flow_edit_n_min,
|
||||
flow_edit_n_max=flow_edit_n_max,
|
||||
flow_edit_n_avg=flow_edit_n_avg,
|
||||
)
|
||||
outputs = service_run["outputs"]
|
||||
infer_steps_for_progress = service_run["infer_steps_for_progress"]
|
||||
|
|
|
|||
|
|
@ -46,11 +46,12 @@ class GenerateMusicExecuteMixin:
|
|||
task_type: str = "",
|
||||
actual_retake_seed_list: Optional[List[int]] = None,
|
||||
retake_variance: float = 0.0,
|
||||
edit_target_caption: str = "",
|
||||
edit_target_lyrics: str = "",
|
||||
edit_n_min: float = 0.0,
|
||||
edit_n_max: float = 1.0,
|
||||
edit_n_avg: int = 1,
|
||||
flow_edit_morph: bool = False,
|
||||
flow_edit_source_caption: str = "",
|
||||
flow_edit_source_lyrics: str = "",
|
||||
flow_edit_n_min: float = 0.0,
|
||||
flow_edit_n_max: float = 1.0,
|
||||
flow_edit_n_avg: int = 1,
|
||||
) -> Dict[str, Any]:
|
||||
"""Invoke ``service_generate`` while maintaining background progress estimation.
|
||||
|
||||
|
|
@ -112,11 +113,12 @@ class GenerateMusicExecuteMixin:
|
|||
task_type=task_type,
|
||||
retake_seed=actual_retake_seed_list,
|
||||
retake_variance=retake_variance,
|
||||
edit_target_caption=edit_target_caption,
|
||||
edit_target_lyrics=edit_target_lyrics,
|
||||
edit_n_min=edit_n_min,
|
||||
edit_n_max=edit_n_max,
|
||||
edit_n_avg=edit_n_avg,
|
||||
flow_edit_morph=flow_edit_morph,
|
||||
flow_edit_source_caption=flow_edit_source_caption,
|
||||
flow_edit_source_lyrics=flow_edit_source_lyrics,
|
||||
flow_edit_n_min=flow_edit_n_min,
|
||||
flow_edit_n_max=flow_edit_n_max,
|
||||
flow_edit_n_avg=flow_edit_n_avg,
|
||||
)
|
||||
except Exception as exc:
|
||||
_error["exc"] = exc
|
||||
|
|
|
|||
|
|
@ -105,6 +105,7 @@ class GenerateMusicRequestMixin:
|
|||
audio_code_string: Union[str, List[str]],
|
||||
actual_batch_size: int,
|
||||
task_type: str,
|
||||
flow_edit_morph: bool = False,
|
||||
) -> Tuple[Optional[List[List[torch.Tensor]]], Optional[torch.Tensor], Optional[Dict[str, Any]]]:
|
||||
"""Prepare reference/source audio tensors and return early error payload when invalid."""
|
||||
if reference_audio is not None:
|
||||
|
|
@ -126,11 +127,35 @@ class GenerateMusicRequestMixin:
|
|||
refer_audios = [[torch.zeros(2, 30 * self.sample_rate)] for _ in range(actual_batch_size)]
|
||||
|
||||
processed_src_audio = None
|
||||
# Flow-edit (#1156) needs the source audio just like cover/repaint/etc.
|
||||
_src_audio_required_tasks = {"cover", "cover-nofsq", "repaint", "lego", "extract", "edit"}
|
||||
if task_type == "text2music":
|
||||
_src_audio_required_tasks = {"cover", "cover-nofsq", "repaint", "lego", "extract"}
|
||||
if task_type == "text2music" and not flow_edit_morph:
|
||||
if src_audio is not None:
|
||||
logger.info("[generate_music] text2music task does not use src_audio, ignoring")
|
||||
elif task_type == "text2music" and flow_edit_morph:
|
||||
# Treat empty string / empty list as missing too — gradio
|
||||
# occasionally hands ``""`` instead of ``None`` for cleared
|
||||
# components.
|
||||
src_audio_missing = (
|
||||
src_audio is None
|
||||
or (isinstance(src_audio, str) and not src_audio.strip())
|
||||
or (isinstance(src_audio, (list, tuple)) and not src_audio)
|
||||
)
|
||||
if src_audio_missing:
|
||||
return None, None, {
|
||||
"audios": [],
|
||||
"status_message": "Flow-edit morph requires a source audio. Please upload one or disable Smooth morph.",
|
||||
"extra_outputs": {}, "success": False,
|
||||
"error": "flow_edit_morph=True requires src_audio",
|
||||
}
|
||||
logger.info("[generate_music] text2music + flow_edit_morph: encoding src_audio for V_delta integration")
|
||||
processed_src_audio = self.process_src_audio(src_audio)
|
||||
if processed_src_audio is None:
|
||||
return None, None, {
|
||||
"audios": [],
|
||||
"status_message": "Flow-edit morph: source audio is invalid, unreadable, or silent.",
|
||||
"extra_outputs": {}, "success": False,
|
||||
"error": "Invalid source audio for flow_edit_morph",
|
||||
}
|
||||
elif src_audio is not None:
|
||||
if self._has_non_empty_audio_codes(audio_code_string):
|
||||
logger.info("[generate_music] Audio codes provided, ignoring src_audio and using codes instead")
|
||||
|
|
|
|||
|
|
@ -58,11 +58,12 @@ class ServiceGenerateMixin:
|
|||
task_type: str = "",
|
||||
retake_seed: Optional[Union[int, List[int]]] = None,
|
||||
retake_variance: float = 0.0,
|
||||
edit_target_caption: str = "",
|
||||
edit_target_lyrics: str = "",
|
||||
edit_n_min: float = 0.0,
|
||||
edit_n_max: float = 1.0,
|
||||
edit_n_avg: int = 1,
|
||||
flow_edit_morph: bool = False,
|
||||
flow_edit_source_caption: str = "",
|
||||
flow_edit_source_lyrics: str = "",
|
||||
flow_edit_n_min: float = 0.0,
|
||||
flow_edit_n_max: float = 1.0,
|
||||
flow_edit_n_avg: int = 1,
|
||||
) -> Dict[str, Any]:
|
||||
"""Generate music latents and metadata from text/audio conditioning inputs.
|
||||
|
||||
|
|
@ -70,9 +71,8 @@ class ServiceGenerateMixin:
|
|||
the contract on each input. Notable groups:
|
||||
``captions``/``lyrics``/``metas``/``vocal_languages`` are per-sample
|
||||
conditioning; ``cfg_interval_*`` / ``sampler_mode`` /
|
||||
``velocity_*`` / ``dcw_*`` are sampler tweaks; ``task_type`` selects
|
||||
the generation branch (``"edit"`` activates the flow-edit dispatch
|
||||
via ``edit_ctx`` in :func:`_execute_service_generate_diffusion`).
|
||||
``velocity_*`` / ``dcw_*`` are sampler tweaks; ``flow_edit_morph``
|
||||
layers the V_delta overlay on top of cover/cover-nofsq dispatch.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: Service output payload containing generated latents,
|
||||
|
|
@ -143,18 +143,31 @@ class ServiceGenerateMixin:
|
|||
retake_seed=retake_seed,
|
||||
retake_variance=retake_variance,
|
||||
)
|
||||
# edit_ctx activates the flow-edit branch when task_type=="edit".
|
||||
edit_ctx = {
|
||||
"task_type": task_type, "edit_target_caption": edit_target_caption,
|
||||
"edit_target_lyrics": edit_target_lyrics, "vocal_languages": normalized.get("vocal_languages"),
|
||||
"metas": normalized.get("metas"), "instructions": normalized.get("instructions"),
|
||||
"edit_n_min": edit_n_min, "edit_n_max": edit_n_max, "edit_n_avg": edit_n_avg,
|
||||
# flow_edit_ctx activates the V_delta overlay. Supported on
|
||||
# text2music (silence-derived context, clean text-driven V_delta)
|
||||
# and cover / cover-nofsq (cover's LM-codes context shared by both
|
||||
# branches, V_delta still text-driven because the codes are the
|
||||
# same on both sides). Repaint / extract / lego have task-shape-
|
||||
# specific conditioning that needs paired-CFG derivation — left
|
||||
# for follow-up.
|
||||
flow_edit_ctx = {
|
||||
"morph": flow_edit_morph and task_type in ("text2music", "cover", "cover-nofsq"),
|
||||
"task_type": task_type,
|
||||
"source_caption": flow_edit_source_caption,
|
||||
"source_lyrics": flow_edit_source_lyrics,
|
||||
"n_min": flow_edit_n_min,
|
||||
"n_max": flow_edit_n_max,
|
||||
"n_avg": flow_edit_n_avg,
|
||||
"vocal_languages": normalized.get("vocal_languages"),
|
||||
"metas": normalized.get("metas"),
|
||||
"instructions": normalized.get("instructions"),
|
||||
}
|
||||
outputs, encoder_hidden_states, encoder_attention_mask, context_latents = (
|
||||
self._execute_service_generate_diffusion(
|
||||
payload=payload, generate_kwargs=generate_kwargs, seed_param=seed_param,
|
||||
infer_method=infer_method, shift=shift, audio_cover_strength=audio_cover_strength,
|
||||
retake_seed=retake_seed, retake_variance=retake_variance, edit_ctx=edit_ctx,
|
||||
retake_seed=retake_seed, retake_variance=retake_variance,
|
||||
flow_edit_ctx=flow_edit_ctx,
|
||||
)
|
||||
)
|
||||
return self._attach_service_generate_outputs(
|
||||
|
|
|
|||
|
|
@ -145,15 +145,15 @@ class ServiceGenerateExecuteMixin:
|
|||
audio_cover_strength: float,
|
||||
retake_seed: Any = None,
|
||||
retake_variance: float = 0.0,
|
||||
edit_ctx: Optional[Dict[str, Any]] = None,
|
||||
flow_edit_ctx: Optional[Dict[str, Any]] = None,
|
||||
) -> Tuple[Dict[str, Any], torch.Tensor, torch.Tensor, torch.Tensor]:
|
||||
"""Execute condition preparation and diffusion using MLX or PyTorch backend."""
|
||||
if edit_ctx is not None and edit_ctx.get("task_type") == "edit":
|
||||
from .service_generate_flow_edit import dispatch_flow_edit
|
||||
if flow_edit_ctx is not None and flow_edit_ctx.get("morph"):
|
||||
from .service_generate_flow_edit import dispatch_flow_edit_overlay
|
||||
|
||||
return dispatch_flow_edit(
|
||||
return dispatch_flow_edit_overlay(
|
||||
self, payload=payload, generate_kwargs=generate_kwargs,
|
||||
seed_param=seed_param, edit_ctx=edit_ctx,
|
||||
seed_param=seed_param, flow_edit_ctx=flow_edit_ctx,
|
||||
)
|
||||
dit_backend = (
|
||||
"MLX (native)" if (self.use_mlx_dit and self.mlx_decoder is not None) else f"PyTorch ({self.device})"
|
||||
|
|
|
|||
|
|
@ -1,14 +1,20 @@
|
|||
"""Flow-edit dispatch path for ``task_type == "edit"`` (issue #1156).
|
||||
"""Flow-edit overlay dispatch on text2music task (issue #1156).
|
||||
|
||||
The regular generation path (``_execute_service_generate_diffusion``)
|
||||
calls ``model.generate_audio`` with a single set of text + lyric
|
||||
embeddings. Flow-edit needs *paired* conditioning (source + target),
|
||||
so we build a fresh set of target text/lyric embeddings here using the
|
||||
handler's tokenizer + encoder, then call
|
||||
``model.flowedit_generate_audio`` with both sets.
|
||||
The user's ``caption`` / ``lyrics`` already flowed through the regular
|
||||
text2music preprocessing pipeline and ended up in ``payload`` as the
|
||||
*target* text + lyric embeddings. Flow-edit overlay adds:
|
||||
|
||||
Target tokenization + embedding helpers live in
|
||||
``service_generate_flow_edit_target.py`` (split per the 200 LOC cap).
|
||||
* a *source* branch encoded from ``flow_edit_source_caption`` /
|
||||
``flow_edit_source_lyrics`` — describes the original audio;
|
||||
* paired ``prepare_condition`` calls fed ``silence_latent`` for the
|
||||
audio context (the text2music shape, identical for both branches),
|
||||
so V_src and V_tar differ only in encoder text/lyric;
|
||||
* the user's encoded ``src_audio`` (already in ``payload['src_latents']``
|
||||
because we let it through for ``flow_edit_morph=True``) drives the
|
||||
sampling-loop ``zt_src`` / ``zt_tar`` formation.
|
||||
|
||||
Source tokenization + embedding helpers live in
|
||||
``service_generate_flow_edit_source.py`` (split per the 200 LOC cap).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
|
@ -18,99 +24,127 @@ from typing import Any, Dict, Tuple
|
|||
import torch
|
||||
from loguru import logger
|
||||
|
||||
from .service_generate_flow_edit_target import embed_target, tokenize_target
|
||||
from .service_generate_flow_edit_source import embed_source, tokenize_source
|
||||
|
||||
|
||||
def dispatch_flow_edit(
|
||||
def dispatch_flow_edit_overlay(
|
||||
handler,
|
||||
*,
|
||||
payload: Dict[str, Any],
|
||||
generate_kwargs: Dict[str, Any],
|
||||
seed_param: Any,
|
||||
edit_ctx: Dict[str, Any],
|
||||
flow_edit_ctx: Dict[str, Any],
|
||||
) -> Tuple[Dict[str, Any], torch.Tensor, torch.Tensor, torch.Tensor]:
|
||||
"""Run the flow-edit branch and return the same 4-tuple as the regular path.
|
||||
"""Run paired flow-edit on top of text2music dispatch.
|
||||
|
||||
Builds target text/lyric embeddings via the handler's tokenizer +
|
||||
encoder, runs ``prepare_condition`` on the source side so the
|
||||
returned context has the right shape for downstream scoring/LRC,
|
||||
then calls ``model.flowedit_generate_audio``.
|
||||
Builds source text/lyric embeddings from
|
||||
``flow_edit_ctx['source_caption']`` / ``['source_lyrics']``, then
|
||||
calls :func:`flow_edit_pipeline.flowedit_generate_audio` which
|
||||
handles the two ``prepare_condition`` calls (one per branch, both
|
||||
fed silence-context like text2music) and the sampling-loop V_delta
|
||||
integration.
|
||||
"""
|
||||
if not hasattr(handler.model, "flowedit_generate_audio"):
|
||||
raise RuntimeError(
|
||||
"Flow-edit (task_type='edit') requires a base DiT variant — "
|
||||
"the loaded model does not expose flowedit_generate_audio. "
|
||||
"Supported variants: xl_base, xl_sft, sft, base."
|
||||
"Flow-edit overlay requires a base DiT variant — the loaded "
|
||||
"model does not expose flowedit_generate_audio. Supported "
|
||||
"variants: xl_base, xl_sft, sft, base."
|
||||
)
|
||||
real_src_latents = payload["src_latents"]
|
||||
bsz, seq, ch = real_src_latents.shape
|
||||
task_type = flow_edit_ctx.get("task_type") or "text2music"
|
||||
n_min = float(flow_edit_ctx.get("n_min", 0.0))
|
||||
n_max = float(flow_edit_ctx.get("n_max", 1.0))
|
||||
n_avg = int(flow_edit_ctx.get("n_avg", 1))
|
||||
src_caption = flow_edit_ctx.get("source_caption") or ""
|
||||
src_lyrics = flow_edit_ctx.get("source_lyrics") or ""
|
||||
if not src_caption and not src_lyrics:
|
||||
logger.warning(
|
||||
"[flow_edit_overlay] both source_caption and source_lyrics are "
|
||||
"empty — V_src ≈ V_tar so the overlay will be a near no-op.",
|
||||
)
|
||||
src_latents = payload["src_latents"]
|
||||
bsz = src_latents.shape[0]
|
||||
edit_n_min = float(edit_ctx.get("edit_n_min", 0.0))
|
||||
edit_n_max = float(edit_ctx.get("edit_n_max", 1.0))
|
||||
edit_n_avg = int(edit_ctx.get("edit_n_avg", 1))
|
||||
logger.info(
|
||||
"[flow_edit] dispatch — task=edit, bsz={}, n_min={}, n_max={}, n_avg={}",
|
||||
bsz, edit_n_min, edit_n_max, edit_n_avg,
|
||||
"[flow_edit_overlay] dispatch — task={}, bsz={}, n_min={}, n_max={}, n_avg={}",
|
||||
task_type, bsz, n_min, n_max, n_avg,
|
||||
)
|
||||
|
||||
# Target tokens + embeddings (handler's tokenizer / encoder).
|
||||
tar_text_ids, tar_text_am, tar_lyric_ids, tar_lyric_am = tokenize_target(
|
||||
# Source side text/lyric embeddings (target side is already in payload).
|
||||
src_text_ids, src_text_am, src_lyric_ids, src_lyric_am = tokenize_source(
|
||||
handler,
|
||||
target_caption=edit_ctx.get("edit_target_caption") or "",
|
||||
target_lyrics=edit_ctx.get("edit_target_lyrics") or "",
|
||||
vocal_languages=edit_ctx.get("vocal_languages"),
|
||||
metas=edit_ctx.get("metas"),
|
||||
instructions=edit_ctx.get("instructions"),
|
||||
source_caption=src_caption,
|
||||
source_lyrics=src_lyrics,
|
||||
vocal_languages=flow_edit_ctx.get("vocal_languages"),
|
||||
metas=flow_edit_ctx.get("metas"),
|
||||
instructions=flow_edit_ctx.get("instructions"),
|
||||
batch_size=bsz,
|
||||
)
|
||||
tar_text_hs, tar_lyric_hs = embed_target(handler, tar_text_ids, tar_lyric_ids)
|
||||
src_text_hs, src_lyric_hs = embed_source(handler, src_text_ids, src_lyric_ids)
|
||||
|
||||
device, dtype = src_latents.device, src_latents.dtype
|
||||
tar_text_am = tar_text_am.to(device=device, dtype=dtype)
|
||||
tar_lyric_am = tar_lyric_am.to(device=device, dtype=dtype)
|
||||
device, dtype = real_src_latents.device, real_src_latents.dtype
|
||||
src_text_am = src_text_am.to(device=device, dtype=dtype)
|
||||
src_lyric_am = src_lyric_am.to(device=device, dtype=dtype)
|
||||
|
||||
# Audio context for ``prepare_condition``: choose by task.
|
||||
#
|
||||
# text2music — force silence-context. text2music's training
|
||||
# distribution is (clean target, silence audio context). Sharing
|
||||
# the source audio's encoded latents (or LM codes derived from one
|
||||
# prompt) as the context is OOD: the velocity head produces unstable
|
||||
# predictions, V_delta integration accumulates into a near-zero
|
||||
# latent, and VAE decode + auto-normalization amplifies the residual
|
||||
# noise to full scale. Verified empirically by a 4-way sweep
|
||||
# (sft60, tb60, tb8w05, tb8s1) — every variant collapsed to peak
|
||||
# ≈ 0.007 the moment the natural context was used.
|
||||
#
|
||||
# cover / cover-nofsq — keep payload's natural context. The cover
|
||||
# task IS trained on (clean target, LM-codes-derived audio context),
|
||||
# so both branches share an in-distribution context and V_delta
|
||||
# captures the text-only delta cleanly.
|
||||
if task_type == "text2music":
|
||||
sil = handler.silence_latent.to(device=device, dtype=dtype)
|
||||
available = sil.shape[1]
|
||||
if seq <= available:
|
||||
sil_slice = sil[0, :seq, :]
|
||||
else:
|
||||
repeats = (seq + available - 1) // available
|
||||
sil_slice = sil[0].repeat(repeats, 1)[:seq, :]
|
||||
ctx_input = sil_slice.unsqueeze(0).expand(bsz, seq, ch).contiguous()
|
||||
is_covers_arg = torch.zeros(bsz, dtype=torch.long, device=device)
|
||||
# Drop the precomputed LM hints — they were generated from the
|
||||
# user's caption/lyrics, not silence; even though is_covers=0
|
||||
# leaves them unused in the where-clause, the tensor lingers in
|
||||
# downstream paths and empirically collapses the latent (peak
|
||||
# 0.007 in the user's repro). Force ``prepare_condition`` to
|
||||
# tokenize silence afresh, matching the no-codes case that
|
||||
# produced peak 0.92.
|
||||
precomputed_lm_hints_arg = None
|
||||
else:
|
||||
ctx_input = real_src_latents
|
||||
is_covers_arg = payload["is_covers"]
|
||||
precomputed_lm_hints_arg = payload.get("precomputed_lm_hints_25Hz")
|
||||
|
||||
with torch.inference_mode():
|
||||
with handler._load_model_context("model"):
|
||||
# prepare_condition on the source so the 4-tuple's encoder /
|
||||
# context outputs have the post-condition shape downstream
|
||||
# auto-LRC / DiT alignment scoring expects.
|
||||
attn = torch.ones(
|
||||
src_latents.shape[0], src_latents.shape[1],
|
||||
device=device, dtype=dtype,
|
||||
)
|
||||
src_enc_hs, src_enc_am, src_ctx = handler.model.prepare_condition(
|
||||
text_hidden_states=payload["text_hidden_states"],
|
||||
text_attention_mask=payload["text_attention_mask"],
|
||||
lyric_hidden_states=payload["lyric_hidden_states"],
|
||||
lyric_attention_mask=payload["lyric_attention_mask"],
|
||||
refer_audio_acoustic_hidden_states_packed=payload["refer_audio_acoustic_hidden_states_packed"],
|
||||
refer_audio_order_mask=payload["refer_audio_order_mask"],
|
||||
hidden_states=src_latents,
|
||||
attention_mask=attn,
|
||||
silence_latent=handler.silence_latent,
|
||||
src_latents=src_latents,
|
||||
chunk_masks=payload["chunk_mask"],
|
||||
is_covers=payload["is_covers"],
|
||||
precomputed_lm_hints_25Hz=payload.get("precomputed_lm_hints_25Hz"),
|
||||
)
|
||||
outputs = handler.model.flowedit_generate_audio(
|
||||
text_hidden_states=payload["text_hidden_states"],
|
||||
text_attention_mask=payload["text_attention_mask"],
|
||||
lyric_hidden_states=payload["lyric_hidden_states"],
|
||||
lyric_attention_mask=payload["lyric_attention_mask"],
|
||||
# Target = the user's caption/lyrics already in payload.
|
||||
target_text_hidden_states=payload["text_hidden_states"],
|
||||
target_text_attention_mask=payload["text_attention_mask"],
|
||||
target_lyric_hidden_states=payload["lyric_hidden_states"],
|
||||
target_lyric_attention_mask=payload["lyric_attention_mask"],
|
||||
# Source = the freshly encoded original-prompt side.
|
||||
text_hidden_states=src_text_hs,
|
||||
text_attention_mask=src_text_am,
|
||||
lyric_hidden_states=src_lyric_hs,
|
||||
lyric_attention_mask=src_lyric_am,
|
||||
# Audio context: silence for both branches (text2music shape).
|
||||
refer_audio_acoustic_hidden_states_packed=payload["refer_audio_acoustic_hidden_states_packed"],
|
||||
refer_audio_order_mask=payload["refer_audio_order_mask"],
|
||||
src_latents=src_latents,
|
||||
src_latents=real_src_latents, # for zt_src/zt_tar formation
|
||||
ctx_src_latents=ctx_input, # silence (text2music) or real (cover)
|
||||
chunk_masks=payload["chunk_mask"],
|
||||
is_covers=payload["is_covers"],
|
||||
is_covers=is_covers_arg, # 0 (text2music) or payload (cover)
|
||||
silence_latent=handler.silence_latent,
|
||||
target_text_hidden_states=tar_text_hs,
|
||||
target_text_attention_mask=tar_text_am,
|
||||
target_lyric_hidden_states=tar_lyric_hs,
|
||||
target_lyric_attention_mask=tar_lyric_am,
|
||||
seed=seed_param,
|
||||
# Retake seed (#1157) drives flowedit's per-step
|
||||
# forward-noise generators so variation/reproducibility
|
||||
# works the same way it does in regular generation.
|
||||
retake_seed=generate_kwargs.get("retake_seed"),
|
||||
infer_steps=generate_kwargs.get("infer_steps"),
|
||||
timesteps=generate_kwargs.get("timesteps"),
|
||||
|
|
@ -120,14 +154,31 @@ def dispatch_flow_edit(
|
|||
shift=generate_kwargs.get("shift", 1.0),
|
||||
velocity_norm_threshold=generate_kwargs.get("velocity_norm_threshold", 0.0),
|
||||
velocity_ema_factor=generate_kwargs.get("velocity_ema_factor", 0.0),
|
||||
edit_n_min=edit_n_min,
|
||||
edit_n_max=edit_n_max,
|
||||
edit_n_avg=edit_n_avg,
|
||||
precomputed_lm_hints_25Hz=payload.get("precomputed_lm_hints_25Hz"),
|
||||
edit_n_min=n_min,
|
||||
edit_n_max=n_max,
|
||||
edit_n_avg=n_avg,
|
||||
precomputed_lm_hints_25Hz=precomputed_lm_hints_arg,
|
||||
# v1-disabled tricks — pipeline logs them and bypasses.
|
||||
sampler_mode=generate_kwargs.get("sampler_mode", "euler"),
|
||||
use_adg=generate_kwargs.get("use_adg", False),
|
||||
dcw_enabled=generate_kwargs.get("dcw_enabled", False),
|
||||
)
|
||||
|
||||
return outputs, src_enc_hs, src_enc_am, src_ctx
|
||||
# Return target-side encoder/context for downstream auto-LRC + scoring.
|
||||
attn = torch.ones(bsz, seq, device=device, dtype=dtype)
|
||||
enc_hs, enc_am, ctx = handler.model.prepare_condition(
|
||||
text_hidden_states=payload["text_hidden_states"],
|
||||
text_attention_mask=payload["text_attention_mask"],
|
||||
lyric_hidden_states=payload["lyric_hidden_states"],
|
||||
lyric_attention_mask=payload["lyric_attention_mask"],
|
||||
refer_audio_acoustic_hidden_states_packed=payload["refer_audio_acoustic_hidden_states_packed"],
|
||||
refer_audio_order_mask=payload["refer_audio_order_mask"],
|
||||
hidden_states=ctx_input,
|
||||
attention_mask=attn,
|
||||
silence_latent=handler.silence_latent,
|
||||
src_latents=ctx_input,
|
||||
chunk_masks=payload["chunk_mask"],
|
||||
is_covers=is_covers_arg,
|
||||
precomputed_lm_hints_25Hz=precomputed_lm_hints_arg,
|
||||
)
|
||||
return outputs, enc_hs, enc_am, ctx
|
||||
|
|
|
|||
|
|
@ -1,13 +1,13 @@
|
|||
"""Build the *target* text/lyric conditioning for flow-edit (#1156 PR-B).
|
||||
"""Build the *source* text/lyric conditioning for flow-edit overlay (#1156).
|
||||
|
||||
Source-side conditioning is already in the payload built by
|
||||
``preprocess_batch``. Flow-edit needs a paired *target* condition
|
||||
(``edit_target_caption`` / ``edit_target_lyrics``); we tokenize and
|
||||
encode it here using the handler's existing helpers so SFT prompt
|
||||
formatting, lyric language handling, and padding stay consistent with
|
||||
the source side.
|
||||
|
||||
Split out from ``service_generate_flow_edit.py`` per the 200 LOC cap.
|
||||
The user's ``caption`` / ``lyrics`` go through the regular cover dispatch
|
||||
and become the *target* condition. Flow-edit overlay also needs a
|
||||
*source* condition (``flow_edit_source_caption`` /
|
||||
``flow_edit_source_lyrics``) describing the original audio so we can
|
||||
compute V_delta = V_tar - V_src. We tokenize and encode that source
|
||||
side here using the handler's existing helpers so SFT prompt formatting,
|
||||
lyric-language handling, and padding stay consistent with the target
|
||||
side.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
|
@ -27,32 +27,25 @@ def _pad_to_batch(values: Optional[List[Any]], default: Any, batch_size: int) ->
|
|||
return out
|
||||
|
||||
|
||||
def tokenize_target(
|
||||
def tokenize_source(
|
||||
handler,
|
||||
*,
|
||||
target_caption: str,
|
||||
target_lyrics: str,
|
||||
source_caption: str,
|
||||
source_lyrics: str,
|
||||
vocal_languages: Optional[List[str]],
|
||||
metas: Optional[List[Any]],
|
||||
instructions: Optional[List[str]],
|
||||
batch_size: int,
|
||||
):
|
||||
"""Build padded target text/lyric token tensors via handler helpers.
|
||||
"""Build padded source text/lyric token tensors via handler helpers.
|
||||
|
||||
Reuses ``_prepare_text_conditioning_inputs`` so the SFT prompt format,
|
||||
lyric language formatting, and padding stay consistent with the
|
||||
source-side preparation.
|
||||
lyric-language formatting, and padding stay consistent with the
|
||||
target-side preparation already done by the regular cover dispatch.
|
||||
"""
|
||||
captions = [target_caption] * batch_size
|
||||
lyrics = [target_lyrics] * batch_size
|
||||
captions = [source_caption] * batch_size
|
||||
lyrics = [source_lyrics] * batch_size
|
||||
langs = _pad_to_batch(vocal_languages, "unknown", batch_size)
|
||||
# ``metas`` arrives here in whatever shape ``service_generate``'s
|
||||
# ``_normalize_inputs`` left it — usually a list of dicts when the
|
||||
# request flows through ``generate_music`` ->
|
||||
# ``prepare_batch_data``. The source path normalises with
|
||||
# ``_parse_metas`` before tokenizing; if we skip it the target
|
||||
# prompt gets a raw ``{'bpm': ...}`` repr instead of the proper
|
||||
# ``- bpm: ...`` block, so conditioning silently drifts.
|
||||
raw_metas = _pad_to_batch(metas, "", batch_size)
|
||||
parsed_metas_list = handler._parse_metas(raw_metas)
|
||||
instr_list = _pad_to_batch(instructions, DEFAULT_DIT_INSTRUCTION, batch_size)
|
||||
|
|
@ -72,22 +65,21 @@ def tokenize_target(
|
|||
lyrics=lyrics,
|
||||
parsed_metas=parsed_metas_list,
|
||||
vocal_languages=langs,
|
||||
audio_cover_strength=1.0, # disable non-cover branch — not needed for edit
|
||||
audio_cover_strength=1.0,
|
||||
)
|
||||
return text_token_idss, text_attention_masks, lyric_token_idss, lyric_attention_masks
|
||||
|
||||
|
||||
def embed_target(
|
||||
def embed_source(
|
||||
handler,
|
||||
text_token_idss: torch.Tensor,
|
||||
lyric_token_idss: torch.Tensor,
|
||||
) -> Tuple[torch.Tensor, torch.Tensor]:
|
||||
"""Run text + lyric encoders on the target tokens, return embedding tensors.
|
||||
"""Run text + lyric encoders on the source tokens.
|
||||
|
||||
Tokens come back from the tokenizer on CPU; the regular batch path
|
||||
moves them to ``handler.device`` before encoding (see
|
||||
``preprocess_batch``), so we mirror that here. Without the move
|
||||
text-encoder runs on CUDA / MPS / XPU would hit a device mismatch.
|
||||
``preprocess_batch``), so we mirror that here.
|
||||
"""
|
||||
device = handler.device
|
||||
text_token_idss = text_token_idss.to(device=device)
|
||||
|
|
@ -1,10 +1,10 @@
|
|||
"""Unit tests for the flow-edit dispatch helper (#1156 PR-B).
|
||||
"""Unit tests for the flow-edit overlay dispatch (#1156).
|
||||
|
||||
Regression coverage for codex round-1 P1 + 2×P2 findings:
|
||||
* device-mismatch on CUDA / MPS / XPU (token IDs not moved off CPU)
|
||||
* dispatch returning raw embeddings instead of ``prepare_condition``
|
||||
outputs (breaks downstream auto-LRC / DiT alignment scoring)
|
||||
* LM Phase 1 still ran in edit mode (replaced src_audio with codes)
|
||||
The overlay is a sampler-level technique layered on cover/cover-nofsq:
|
||||
the user's ``caption`` / ``lyrics`` are the *target*; the overlay adds a
|
||||
*source* branch encoded from ``flow_edit_source_caption`` /
|
||||
``flow_edit_source_lyrics``. V_delta = V_tar - V_src is integrated by
|
||||
:mod:`flow_edit_pipeline`.
|
||||
"""
|
||||
|
||||
import unittest
|
||||
|
|
@ -12,47 +12,56 @@ import unittest
|
|||
import torch
|
||||
|
||||
from acestep.core.generation.handler.service_generate_flow_edit import (
|
||||
dispatch_flow_edit,
|
||||
dispatch_flow_edit_overlay,
|
||||
)
|
||||
from acestep.core.generation.handler._flow_edit_dispatch_test_support import (
|
||||
FakeHandler,
|
||||
make_edit_ctx,
|
||||
make_flow_edit_ctx,
|
||||
make_payload,
|
||||
)
|
||||
|
||||
|
||||
class DispatchFlowEditTests(unittest.TestCase):
|
||||
class FlowEditOverlayDispatchTests(unittest.TestCase):
|
||||
|
||||
def test_calls_flowedit_with_paired_conditions(self):
|
||||
handler = FakeHandler()
|
||||
payload = make_payload()
|
||||
dispatch_flow_edit(
|
||||
dispatch_flow_edit_overlay(
|
||||
handler, payload=payload, generate_kwargs={"infer_steps": 4},
|
||||
seed_param=42, edit_ctx=make_edit_ctx(),
|
||||
seed_param=42, flow_edit_ctx=make_flow_edit_ctx(),
|
||||
)
|
||||
self.assertEqual(handler.model.flowedit_generate_audio.call_count, 1)
|
||||
call_kwargs = handler.model.flowedit_generate_audio.call_args.kwargs
|
||||
# Source side comes from payload.
|
||||
self.assertIs(call_kwargs["text_hidden_states"], payload["text_hidden_states"])
|
||||
self.assertIs(call_kwargs["src_latents"], payload["src_latents"])
|
||||
# Target side was built fresh.
|
||||
self.assertIsNotNone(call_kwargs["target_text_hidden_states"])
|
||||
self.assertIsNotNone(call_kwargs["target_lyric_hidden_states"])
|
||||
# Window params propagated.
|
||||
# Target side comes from payload (the user's caption/lyrics).
|
||||
self.assertIs(call_kwargs["target_text_hidden_states"], payload["text_hidden_states"])
|
||||
self.assertIs(call_kwargs["target_lyric_hidden_states"], payload["lyric_hidden_states"])
|
||||
# Source side was built fresh from flow_edit_source_caption/lyrics.
|
||||
self.assertIsNotNone(call_kwargs["text_hidden_states"])
|
||||
self.assertIsNotNone(call_kwargs["lyric_hidden_states"])
|
||||
# Window params propagated under the model's edit_n_* kwargs.
|
||||
self.assertEqual(call_kwargs["edit_n_min"], 0.2)
|
||||
self.assertEqual(call_kwargs["edit_n_max"], 0.8)
|
||||
self.assertEqual(call_kwargs["edit_n_avg"], 1)
|
||||
self.assertEqual(call_kwargs["seed"], 42)
|
||||
|
||||
def test_returns_prepare_condition_tensors_not_raw_embeddings(self):
|
||||
"""Regression for codex P2 round-1 finding."""
|
||||
def test_source_caption_lyrics_threaded_into_tokenizer(self):
|
||||
handler = FakeHandler()
|
||||
outputs, enc_hs, enc_am, ctx = dispatch_flow_edit(
|
||||
ctx = make_flow_edit_ctx(source_caption="orig anime pop",
|
||||
source_lyrics="orig lyrics text")
|
||||
dispatch_flow_edit_overlay(
|
||||
handler, payload=make_payload(), generate_kwargs={"infer_steps": 4},
|
||||
seed_param=None, edit_ctx=make_edit_ctx(),
|
||||
seed_param=None, flow_edit_ctx=ctx,
|
||||
)
|
||||
self.assertIn("orig anime pop", handler.captured_captions)
|
||||
self.assertIn("orig lyrics text", handler.captured_lyrics)
|
||||
|
||||
def test_returns_prepare_condition_tensors_for_downstream(self):
|
||||
handler = FakeHandler()
|
||||
outputs, enc_hs, enc_am, ctx = dispatch_flow_edit_overlay(
|
||||
handler, payload=make_payload(), generate_kwargs={"infer_steps": 4},
|
||||
seed_param=None, flow_edit_ctx=make_flow_edit_ctx(),
|
||||
)
|
||||
self.assertIn("target_latents", outputs)
|
||||
# Sentinel-7 tensors come from the mocked prepare_condition.
|
||||
self.assertIs(enc_hs, handler.prepared_enc_hs)
|
||||
self.assertIs(enc_am, handler.prepared_enc_am)
|
||||
self.assertIs(ctx, handler.prepared_ctx)
|
||||
|
|
@ -60,15 +69,14 @@ class DispatchFlowEditTests(unittest.TestCase):
|
|||
def test_missing_flowedit_method_raises(self):
|
||||
handler = FakeHandler(model_has_flowedit=False)
|
||||
with self.assertRaises(RuntimeError) as cm:
|
||||
dispatch_flow_edit(
|
||||
dispatch_flow_edit_overlay(
|
||||
handler, payload=make_payload(),
|
||||
generate_kwargs={"infer_steps": 4}, seed_param=None,
|
||||
edit_ctx=make_edit_ctx(),
|
||||
flow_edit_ctx=make_flow_edit_ctx(),
|
||||
)
|
||||
self.assertIn("flowedit_generate_audio", str(cm.exception))
|
||||
|
||||
def test_token_ids_moved_to_handler_device(self):
|
||||
"""Regression for codex P1 round-1 finding."""
|
||||
handler = FakeHandler()
|
||||
captured_device = []
|
||||
|
||||
|
|
@ -78,9 +86,9 @@ class DispatchFlowEditTests(unittest.TestCase):
|
|||
|
||||
handler.infer_text_embeddings = _capture
|
||||
handler.infer_lyric_embeddings = _capture
|
||||
dispatch_flow_edit(
|
||||
dispatch_flow_edit_overlay(
|
||||
handler, payload=make_payload(), generate_kwargs={"infer_steps": 4},
|
||||
seed_param=None, edit_ctx=make_edit_ctx(),
|
||||
seed_param=None, flow_edit_ctx=make_flow_edit_ctx(),
|
||||
)
|
||||
self.assertEqual(len(captured_device), 2)
|
||||
for d in captured_device:
|
||||
|
|
@ -88,9 +96,9 @@ class DispatchFlowEditTests(unittest.TestCase):
|
|||
|
||||
def test_default_window_params_when_missing(self):
|
||||
handler = FakeHandler()
|
||||
dispatch_flow_edit(
|
||||
dispatch_flow_edit_overlay(
|
||||
handler, payload=make_payload(), generate_kwargs={"infer_steps": 4},
|
||||
seed_param=None, edit_ctx={"task_type": "edit"},
|
||||
seed_param=None, flow_edit_ctx={"morph": True, "task_type": "text2music"},
|
||||
)
|
||||
kwargs = handler.model.flowedit_generate_audio.call_args.kwargs
|
||||
self.assertEqual(kwargs["edit_n_min"], 0.0)
|
||||
|
|
@ -98,58 +106,43 @@ class DispatchFlowEditTests(unittest.TestCase):
|
|||
self.assertEqual(kwargs["edit_n_avg"], 1)
|
||||
|
||||
def test_dict_metas_parsed_before_tokenization(self):
|
||||
"""Regression for codex P2 round-3 finding.
|
||||
|
||||
Pre-fix the dispatch passed raw metas (often dicts from
|
||||
``prepare_batch_data``) straight into
|
||||
``_prepare_text_conditioning_inputs``, so target prompts got a
|
||||
``{'bpm': 120}`` repr instead of the parsed ``- bpm: 120`` block
|
||||
the source path produced. Post-fix the dispatch calls
|
||||
``handler._parse_metas`` first, matching the source pipeline.
|
||||
"""
|
||||
handler = FakeHandler()
|
||||
edit_ctx = make_edit_ctx()
|
||||
# Real handler hands us a list of dicts here.
|
||||
edit_ctx["metas"] = [{"bpm": 120, "key": "C minor"}]
|
||||
dispatch_flow_edit(
|
||||
ctx = make_flow_edit_ctx()
|
||||
ctx["metas"] = [{"bpm": 120, "key": "C minor"}]
|
||||
dispatch_flow_edit_overlay(
|
||||
handler, payload=make_payload(), generate_kwargs={"infer_steps": 4},
|
||||
seed_param=None, edit_ctx=edit_ctx,
|
||||
seed_param=None, flow_edit_ctx=ctx,
|
||||
)
|
||||
self.assertTrue(any(
|
||||
isinstance(m, str) and m.startswith("PARSED:")
|
||||
for m in handler.captured_parsed_metas
|
||||
), f"target metas were not parsed: {handler.captured_parsed_metas}")
|
||||
), f"source metas were not parsed: {handler.captured_parsed_metas}")
|
||||
|
||||
def test_retake_seed_forwarded_from_generate_kwargs(self):
|
||||
"""Regression for codex P2 round-2 finding.
|
||||
|
||||
Pre-fix the dispatch only forwarded the main ``seed`` and dropped
|
||||
``retake_seed`` from ``generate_kwargs``, so retake variation /
|
||||
reproducibility silently fell back to the main seed under
|
||||
``task_type="edit"``.
|
||||
"""
|
||||
handler = FakeHandler()
|
||||
dispatch_flow_edit(
|
||||
dispatch_flow_edit_overlay(
|
||||
handler, payload=make_payload(),
|
||||
generate_kwargs={"infer_steps": 4, "retake_seed": [99, 100]},
|
||||
seed_param=42, edit_ctx=make_edit_ctx(),
|
||||
seed_param=42, flow_edit_ctx=make_flow_edit_ctx(),
|
||||
)
|
||||
kwargs = handler.model.flowedit_generate_audio.call_args.kwargs
|
||||
self.assertEqual(kwargs["seed"], 42)
|
||||
self.assertEqual(kwargs["retake_seed"], [99, 100])
|
||||
|
||||
|
||||
class EditSkipsLmPhaseTests(unittest.TestCase):
|
||||
"""Verify ``inference.generate_music`` skips LM Phase 1 for ``edit``."""
|
||||
class CoverStillRunsLmTests(unittest.TestCase):
|
||||
"""The overlay layers on cover; cover stays in skip_lm_tasks (LM Phase 1
|
||||
skipped because cover already extracts codes from the ref audio)."""
|
||||
|
||||
def test_edit_in_skip_lm_tasks_set(self):
|
||||
def test_no_edit_in_skip_lm_tasks(self):
|
||||
from pathlib import Path
|
||||
src = (Path(__file__).resolve().parents[3] / "inference.py").read_text()
|
||||
self.assertIn(
|
||||
'skip_lm_tasks = {"cover", "cover-nofsq", "repaint", "extract", "edit"}',
|
||||
'skip_lm_tasks = {"cover", "cover-nofsq", "repaint", "extract"}',
|
||||
src,
|
||||
"edit must be in skip_lm_tasks so LM Phase 1 doesn't replace src_audio",
|
||||
"edit task is removed; skip_lm_tasks should no longer mention it",
|
||||
)
|
||||
self.assertNotIn('"edit"', src.split("skip_lm_tasks")[1].split("\n")[0])
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
|
|
|||
|
|
@ -96,8 +96,6 @@ class TaskUtilsMixin:
|
|||
TRACK_CLASSES=" | ".join(track_classes_upper)
|
||||
)
|
||||
return TASK_INSTRUCTIONS["complete_default"]
|
||||
if task_type == "edit":
|
||||
return TASK_INSTRUCTIONS["edit"]
|
||||
return TASK_INSTRUCTIONS["text2music"]
|
||||
|
||||
def determine_task_type(self, task_type, audio_code_string):
|
||||
|
|
|
|||
|
|
@ -165,16 +165,22 @@ class GenerationParams:
|
|||
# retake_variance=0 is a no-op; the retake_seed is only consumed when variance>0.
|
||||
retake_seed: Optional[Union[str, int]] = None
|
||||
retake_variance: float = 0.0
|
||||
# Flow-edit (issue #1156): morph the source toward a new prompt/lyrics by
|
||||
# integrating V_delta = V_tar - V_src over [edit_n_min, edit_n_max].
|
||||
# Activated only when ``task_type == "edit"``; supported on the four
|
||||
# base DiT variants (xl_base / xl_sft / sft / base). v1 disables
|
||||
# DCW / heun / ADG inside the loop — see #1156 for the follow-up plan.
|
||||
edit_target_caption: str = ""
|
||||
edit_target_lyrics: str = ""
|
||||
edit_n_min: float = 0.0
|
||||
edit_n_max: float = 1.0
|
||||
edit_n_avg: int = 1
|
||||
# Flow-edit overlay (issue #1156): when True on a cover/cover-nofsq
|
||||
# task, paint the source audio toward the user's caption/lyrics by
|
||||
# integrating V_delta = V_tar(caption, lyrics) - V_src(source_caption,
|
||||
# source_lyrics) over [n_min, n_max]. The overlay layers on top of
|
||||
# the existing cover dispatch — there is no standalone "edit" task
|
||||
# type. ``flow_edit_source_caption`` / ``flow_edit_source_lyrics``
|
||||
# describe the *original* audio (what V_src is conditioned on);
|
||||
# ``caption`` / ``lyrics`` are the *target* (what V_tar morphs toward).
|
||||
# v1 disables DCW / heun / ADG inside the loop; see #1156 for the
|
||||
# follow-up plan.
|
||||
flow_edit_morph: bool = False
|
||||
flow_edit_source_caption: str = ""
|
||||
flow_edit_source_lyrics: str = ""
|
||||
flow_edit_n_min: float = 0.0
|
||||
flow_edit_n_max: float = 1.0
|
||||
flow_edit_n_avg: int = 1
|
||||
audio_cover_strength: float = 1.0
|
||||
cover_noise_strength: float = 0.0 # 0=pure noise (no cover), 1=closest to src audio
|
||||
|
||||
|
|
@ -379,7 +385,19 @@ def generate_music(
|
|||
"""
|
||||
try:
|
||||
# Phase 1: LM-based metadata and code generation (if enabled)
|
||||
audio_code_string_to_use = params.audio_codes
|
||||
# Flow-edit overlay on text2music must use the *VAE encoding* of
|
||||
# ``src_audio`` as the V_delta integration's starting latent, not
|
||||
# codes-decoded latents. ``conditioning_target._prepare_target_latents_and_wavs``
|
||||
# otherwise replaces target_wavs with zeros and drops in
|
||||
# ``_decode_audio_codes_to_latents(codes)`` whose output sits at a
|
||||
# different distribution than the VAE encoder produces — zt_edit
|
||||
# starts OOD and the integration collapses to a near-silent latent
|
||||
# (peak ~0.007 in the user's repro). Drop the codes here so the
|
||||
# downstream pipeline VAE-encodes the user's mp3 cleanly.
|
||||
if params.task_type == "text2music" and params.flow_edit_morph:
|
||||
audio_code_string_to_use = ""
|
||||
else:
|
||||
audio_code_string_to_use = params.audio_codes
|
||||
lm_generated_metadata = None
|
||||
lm_generated_audio_codes_list = []
|
||||
lm_total_time_costs = {
|
||||
|
|
@ -439,12 +457,20 @@ def generate_music(
|
|||
# and don't need LM to generate audio codes or metadata.
|
||||
# For extract tasks, LLM-generated captions can conflict with the extract instruction
|
||||
# and cause the DiT model to reconstruct input audio instead of extracting stems.
|
||||
# Flow-edit (#1156) needs the user's source audio verbatim; running
|
||||
# LM Phase 1 would replace ``audio_code_string_to_use`` and the
|
||||
# handler would feed the LM-generated codes as source instead of
|
||||
# ``src_audio``. Add ``edit`` here to keep the source path clean.
|
||||
skip_lm_tasks = {"cover", "cover-nofsq", "repaint", "extract", "edit"}
|
||||
|
||||
skip_lm_tasks = {"cover", "cover-nofsq", "repaint", "extract"}
|
||||
# Flow-edit overlay on text2music must NOT trigger LM Phase 1.
|
||||
# Even if Think is on, the LM-generated codes would be routed
|
||||
# into ``conditioning_target`` which replaces target_wavs with
|
||||
# zeros and uses ``_decode_audio_codes_to_latents(codes)`` for
|
||||
# target_latents — flow-edit's ``zt_edit = src_latents.clone()``
|
||||
# then starts at a codes-decoded latent (different distribution
|
||||
# than VAE encode) and the V_delta integration collapses to a
|
||||
# near-silent latent. Treat morph-on-text2music like a skip
|
||||
# task so Think / CoT both no-op.
|
||||
morph_on_text2music = (
|
||||
params.task_type == "text2music" and params.flow_edit_morph
|
||||
)
|
||||
|
||||
# Determine if we should use LLM
|
||||
# LLM is needed for:
|
||||
# 1. thinking=True: generate audio codes via LM
|
||||
|
|
@ -452,11 +478,13 @@ def generate_music(
|
|||
# 3. use_cot_language=True: detect vocal language via CoT
|
||||
# 4. use_cot_metas=True: fill missing metadata via CoT
|
||||
need_lm_for_cot = params.use_cot_caption or params.use_cot_language or params.use_cot_metas
|
||||
use_lm = (params.thinking or need_lm_for_cot) and llm_handler is not None and llm_handler.llm_initialized and params.task_type not in skip_lm_tasks
|
||||
skip_lm = params.task_type in skip_lm_tasks or morph_on_text2music
|
||||
use_lm = (params.thinking or need_lm_for_cot) and llm_handler is not None and llm_handler.llm_initialized and not skip_lm
|
||||
lm_status = []
|
||||
|
||||
if params.task_type in skip_lm_tasks:
|
||||
logger.info(f"Skipping LM for task_type='{params.task_type}' - using DiT directly")
|
||||
|
||||
if skip_lm:
|
||||
reason = params.task_type if params.task_type in skip_lm_tasks else f"{params.task_type}+flow_edit_morph"
|
||||
logger.info(f"Skipping LM for task_type='{reason}' - using DiT directly")
|
||||
|
||||
logger.info(f"[generate_music] LLM usage decision: thinking={params.thinking}, "
|
||||
f"use_cot_caption={params.use_cot_caption}, use_cot_language={params.use_cot_language}, "
|
||||
|
|
@ -656,9 +684,14 @@ def generate_music(
|
|||
"reference_audio": params.reference_audio,
|
||||
"audio_duration": audio_duration,
|
||||
"batch_size": config.batch_size if config.batch_size is not None else 1,
|
||||
# text2music (Custom mode) never uses src_audio; force None to
|
||||
# prevent stale UI values from leaking into generation.
|
||||
"src_audio": None if params.task_type == "text2music" else params.src_audio,
|
||||
# text2music (Custom mode) never uses src_audio EXCEPT when
|
||||
# flow_edit_morph=True — the overlay needs ``src_audio`` for
|
||||
# zt_src/zt_tar formation in the V_delta integration.
|
||||
"src_audio": (
|
||||
params.src_audio
|
||||
if params.task_type != "text2music" or params.flow_edit_morph
|
||||
else None
|
||||
),
|
||||
"audio_code_string": audio_code_string_to_use,
|
||||
"repainting_start": params.repainting_start,
|
||||
"repainting_end": params.repainting_end,
|
||||
|
|
@ -669,11 +702,12 @@ def generate_music(
|
|||
"repaint_strength": params.repaint_strength,
|
||||
"retake_seed": params.retake_seed,
|
||||
"retake_variance": params.retake_variance,
|
||||
"edit_target_caption": params.edit_target_caption,
|
||||
"edit_target_lyrics": params.edit_target_lyrics,
|
||||
"edit_n_min": params.edit_n_min,
|
||||
"edit_n_max": params.edit_n_max,
|
||||
"edit_n_avg": params.edit_n_avg,
|
||||
"flow_edit_morph": params.flow_edit_morph,
|
||||
"flow_edit_source_caption": params.flow_edit_source_caption,
|
||||
"flow_edit_source_lyrics": params.flow_edit_source_lyrics,
|
||||
"flow_edit_n_min": params.flow_edit_n_min,
|
||||
"flow_edit_n_max": params.flow_edit_n_max,
|
||||
"flow_edit_n_avg": params.flow_edit_n_avg,
|
||||
"instruction": params.instruction,
|
||||
"audio_cover_strength": params.audio_cover_strength,
|
||||
"cover_noise_strength": params.cover_noise_strength,
|
||||
|
|
|
|||
|
|
@ -66,7 +66,7 @@ def _warn_about_disabled_v1_tricks(
|
|||
disabled.append("dcw_enabled=True")
|
||||
if disabled:
|
||||
logger.info(
|
||||
"[flowedit] task_type='edit' v1 ignores {}; forcing euler + plain "
|
||||
"[flowedit] overlay v1 ignores {}; forcing euler + plain "
|
||||
"CFG/APG. See issue #1156 for the per-feature follow-up plan.",
|
||||
", ".join(disabled),
|
||||
)
|
||||
|
|
@ -113,6 +113,10 @@ def flowedit_generate_audio(
|
|||
# describe the *source audio*, which is shared across src and tar).
|
||||
precomputed_lm_hints_25Hz: Optional[torch.Tensor] = None,
|
||||
audio_codes: Optional[torch.Tensor] = None,
|
||||
# Override for the audio-context input to ``prepare_condition``.
|
||||
# text2music callers pass ``silence`` here so V_delta is purely
|
||||
# text-driven; defaults to ``src_latents`` for back-compat.
|
||||
ctx_src_latents: Optional[torch.Tensor] = None,
|
||||
# Accepted-but-disabled v1 sampler tricks (logged + bypassed)
|
||||
sampler_mode: str = "euler",
|
||||
use_adg: bool = False,
|
||||
|
|
@ -133,37 +137,26 @@ def flowedit_generate_audio(
|
|||
device=src_latents.device, dtype=src_latents.dtype,
|
||||
)
|
||||
|
||||
src_enc_hs, src_enc_am, src_ctx = model.prepare_condition(
|
||||
text_hidden_states=text_hidden_states,
|
||||
text_attention_mask=text_attention_mask,
|
||||
lyric_hidden_states=lyric_hidden_states,
|
||||
lyric_attention_mask=lyric_attention_mask,
|
||||
refer_audio_acoustic_hidden_states_packed=refer_audio_acoustic_hidden_states_packed,
|
||||
refer_audio_order_mask=refer_audio_order_mask,
|
||||
hidden_states=src_latents,
|
||||
attention_mask=attention_mask,
|
||||
silence_latent=silence_latent,
|
||||
src_latents=src_latents,
|
||||
chunk_masks=chunk_masks,
|
||||
is_covers=is_covers,
|
||||
precomputed_lm_hints_25Hz=precomputed_lm_hints_25Hz,
|
||||
audio_codes=audio_codes,
|
||||
ctx_input = ctx_src_latents if ctx_src_latents is not None else src_latents
|
||||
def _prep(text_hs, text_am, lyric_hs, lyric_am):
|
||||
return model.prepare_condition(
|
||||
text_hidden_states=text_hs, text_attention_mask=text_am,
|
||||
lyric_hidden_states=lyric_hs, lyric_attention_mask=lyric_am,
|
||||
refer_audio_acoustic_hidden_states_packed=refer_audio_acoustic_hidden_states_packed,
|
||||
refer_audio_order_mask=refer_audio_order_mask,
|
||||
hidden_states=ctx_input, attention_mask=attention_mask,
|
||||
silence_latent=silence_latent, src_latents=ctx_input,
|
||||
chunk_masks=chunk_masks, is_covers=is_covers,
|
||||
precomputed_lm_hints_25Hz=precomputed_lm_hints_25Hz,
|
||||
audio_codes=audio_codes,
|
||||
)
|
||||
src_enc_hs, src_enc_am, src_ctx = _prep(
|
||||
text_hidden_states, text_attention_mask,
|
||||
lyric_hidden_states, lyric_attention_mask,
|
||||
)
|
||||
tar_enc_hs, tar_enc_am, tar_ctx = model.prepare_condition(
|
||||
text_hidden_states=target_text_hidden_states,
|
||||
text_attention_mask=target_text_attention_mask,
|
||||
lyric_hidden_states=target_lyric_hidden_states,
|
||||
lyric_attention_mask=target_lyric_attention_mask,
|
||||
refer_audio_acoustic_hidden_states_packed=refer_audio_acoustic_hidden_states_packed,
|
||||
refer_audio_order_mask=refer_audio_order_mask,
|
||||
hidden_states=src_latents,
|
||||
attention_mask=attention_mask,
|
||||
silence_latent=silence_latent,
|
||||
src_latents=src_latents,
|
||||
chunk_masks=chunk_masks,
|
||||
is_covers=is_covers,
|
||||
precomputed_lm_hints_25Hz=precomputed_lm_hints_25Hz,
|
||||
audio_codes=audio_codes,
|
||||
tar_enc_hs, tar_enc_am, tar_ctx = _prep(
|
||||
target_text_hidden_states, target_text_attention_mask,
|
||||
target_lyric_hidden_states, target_lyric_attention_mask,
|
||||
)
|
||||
|
||||
# Forward-noise generators: prefer retake_seed (independent draws), else
|
||||
|
|
|
|||
|
|
@ -1826,7 +1826,29 @@ class AceStepConditionGenerationModel(AceStepPreTrainedModel):
|
|||
t = t.unsqueeze(-1).unsqueeze(-1)
|
||||
xt = t * noise + (1 - t) * x
|
||||
return xt
|
||||
|
||||
|
||||
def flowedit_generate_audio(self, **kwargs):
|
||||
"""Flow-edit overlay (#1156) on the CFG-distilled turbo variant.
|
||||
|
||||
Force ``diffusion_guidance_scale=1.0`` because turbo bakes CFG
|
||||
into the velocity head — paired-CFG would amplify a delta the
|
||||
model wasn't trained to produce. Otherwise delegate to the
|
||||
shared pipeline.
|
||||
"""
|
||||
from acestep.models.common.flow_edit_pipeline import (
|
||||
flowedit_generate_audio as _flowedit_impl,
|
||||
)
|
||||
|
||||
gs = kwargs.get("diffusion_guidance_scale", 1.0)
|
||||
if gs != 1.0:
|
||||
from loguru import logger
|
||||
logger.info(
|
||||
"[turbo flowedit] turbo is CFG-distilled; forcing "
|
||||
"diffusion_guidance_scale=1.0 (was {:.2f}).", gs,
|
||||
)
|
||||
kwargs["diffusion_guidance_scale"] = 1.0
|
||||
return _flowedit_impl(self, **kwargs)
|
||||
|
||||
def generate_audio(
|
||||
self,
|
||||
text_hidden_states: torch.FloatTensor,
|
||||
|
|
|
|||
|
|
@ -1838,7 +1838,29 @@ class AceStepConditionGenerationModel(AceStepPreTrainedModel):
|
|||
t = t.unsqueeze(-1).unsqueeze(-1)
|
||||
xt = t * noise + (1 - t) * x
|
||||
return xt
|
||||
|
||||
|
||||
def flowedit_generate_audio(self, **kwargs):
|
||||
"""Flow-edit overlay (#1156) on the CFG-distilled XL-turbo variant.
|
||||
|
||||
Force ``diffusion_guidance_scale=1.0`` because XL-turbo bakes
|
||||
CFG into the velocity head — paired-CFG would amplify a delta
|
||||
the model wasn't trained to produce. Otherwise delegate to
|
||||
the shared pipeline.
|
||||
"""
|
||||
from acestep.models.common.flow_edit_pipeline import (
|
||||
flowedit_generate_audio as _flowedit_impl,
|
||||
)
|
||||
|
||||
gs = kwargs.get("diffusion_guidance_scale", 1.0)
|
||||
if gs != 1.0:
|
||||
from loguru import logger
|
||||
logger.info(
|
||||
"[xl-turbo flowedit] xl-turbo is CFG-distilled; forcing "
|
||||
"diffusion_guidance_scale=1.0 (was {:.2f}).", gs,
|
||||
)
|
||||
kwargs["diffusion_guidance_scale"] = 1.0
|
||||
return _flowedit_impl(self, **kwargs)
|
||||
|
||||
def generate_audio(
|
||||
self,
|
||||
text_hidden_states: torch.FloatTensor,
|
||||
|
|
|
|||
|
|
@ -32,7 +32,9 @@ def compute_mode_ui_updates(mode: str, llm_handler=None, previous_mode: str = "C
|
|||
show_custom_group = not_simple and not is_extract
|
||||
show_generate_row = not_simple
|
||||
generate_interactive = not_simple
|
||||
show_src_audio = is_cover or is_repaint or is_extract or is_lego or is_complete
|
||||
# Custom mode shows src_audio so the flow-edit morph overlay can use
|
||||
# it; the row is harmless when morph is off (just an unused upload).
|
||||
show_src_audio = is_cover or is_repaint or is_extract or is_lego or is_complete or is_custom
|
||||
show_optional = not_simple and not is_extract and not is_lego
|
||||
show_repainting = is_repaint or is_lego
|
||||
show_audio_codes = is_custom
|
||||
|
|
@ -174,7 +176,8 @@ def compute_mode_ui_updates(mode: str, llm_handler=None, previous_mode: str = "C
|
|||
gr.skip(), # 36: retake_seed
|
||||
mode, # 37: previous_generation_mode
|
||||
gr.update(visible=is_cover), # 34: remix_help_group
|
||||
gr.update(visible=(is_extract or is_lego)), # 35: extract_help_group
|
||||
gr.update(visible=(is_custom or is_cover or is_repaint)), # 35: variation_group (Retake all 3; Edit honoured in Custom/Remix)
|
||||
gr.update(visible=(is_extract or is_lego)), # 36: extract_help_group
|
||||
gr.update(visible=is_complete), # 36: complete_help_group
|
||||
auto_bpm_update, # 37: bpm_auto
|
||||
auto_key_update, # 38: key_auto
|
||||
|
|
|
|||
|
|
@ -123,13 +123,15 @@ def validate_uploaded_audio_file(audio_value: Any, audio_role: str = "reference"
|
|||
soundfile.info(audio_path)
|
||||
return gr.skip()
|
||||
except (OSError, RuntimeError, ValueError):
|
||||
role_label = (
|
||||
t("generation.reference_audio")
|
||||
if audio_role == "reference"
|
||||
else t("generation.source_audio")
|
||||
)
|
||||
gr.Warning(t("messages.audio_format_invalid", role=role_label))
|
||||
return gr.update(value=None)
|
||||
# ``soundfile`` (libsndfile) has spotty mp3 support and refuses
|
||||
# files torchaudio / process_src_audio handle fine. Issuing a
|
||||
# silent ``gr.update(value=None)`` here previously cleared the
|
||||
# user's upload while leaving the waveform visible (cached by
|
||||
# the player) — the user clicked Generate and got
|
||||
# "src_audio=None" with no obvious cause. Preserve the value;
|
||||
# the backend's own decode path will surface a clearer error
|
||||
# if the file is genuinely unreadable.
|
||||
return gr.skip()
|
||||
|
||||
|
||||
def _contains_audio_code_tokens(codes_string: str) -> bool:
|
||||
|
|
|
|||
|
|
@ -77,14 +77,21 @@ def send_audio_to_remix(audio_file, lm_metadata, current_lyrics, current_caption
|
|||
"""
|
||||
if audio_file is None:
|
||||
mode_updates = compute_mode_ui_updates("Remix", llm_handler, previous_mode=current_mode)
|
||||
return (gr.skip(), gr.skip(), gr.skip(), gr.skip()) + (gr.skip(),) * len(mode_updates)
|
||||
return (gr.skip(),) * 6 + (gr.skip(),) * len(mode_updates)
|
||||
|
||||
lyrics, caption = _extract_metadata_for_editing(lm_metadata, current_lyrics, current_caption)
|
||||
mode_updates = list(compute_mode_ui_updates("Remix", llm_handler, previous_mode=current_mode))
|
||||
mode_updates[19] = gr.update(value=caption, visible=True, interactive=True)
|
||||
mode_updates[20] = gr.update(value=lyrics, visible=True, interactive=True)
|
||||
|
||||
return (audio_file, gr.update(value="Remix"), lyrics, caption, *mode_updates)
|
||||
# Pre-fill flow-edit source fields with the prior conditions so the
|
||||
# user can use the morph overlay against the previous prompt as V_src
|
||||
# and edit the top-level caption / lyrics as V_tar.
|
||||
return (
|
||||
audio_file, gr.update(value="Remix"), lyrics, caption,
|
||||
gr.update(value=caption), gr.update(value=lyrics),
|
||||
*mode_updates,
|
||||
)
|
||||
|
||||
|
||||
def send_audio_to_repaint(audio_file, lm_metadata, current_lyrics, current_caption,
|
||||
|
|
@ -107,14 +114,18 @@ def send_audio_to_repaint(audio_file, lm_metadata, current_lyrics, current_capti
|
|||
"""
|
||||
if audio_file is None:
|
||||
mode_updates = compute_mode_ui_updates("Repaint", llm_handler, previous_mode=current_mode)
|
||||
return (gr.skip(), gr.skip(), gr.skip(), gr.skip()) + (gr.skip(),) * len(mode_updates)
|
||||
return (gr.skip(),) * 6 + (gr.skip(),) * len(mode_updates)
|
||||
|
||||
lyrics, caption = _extract_metadata_for_editing(lm_metadata, current_lyrics, current_caption)
|
||||
mode_updates = list(compute_mode_ui_updates("Repaint", llm_handler, previous_mode=current_mode))
|
||||
mode_updates[19] = gr.update(value=caption, visible=True, interactive=True)
|
||||
mode_updates[20] = gr.update(value=lyrics, visible=True, interactive=True)
|
||||
|
||||
return (audio_file, gr.update(value="Repaint"), lyrics, caption, *mode_updates)
|
||||
return (
|
||||
audio_file, gr.update(value="Repaint"), lyrics, caption,
|
||||
gr.update(value=caption), gr.update(value=lyrics),
|
||||
*mode_updates,
|
||||
)
|
||||
|
||||
|
||||
def convert_result_audio_to_codes(dit_handler, generated_audio):
|
||||
|
|
|
|||
|
|
@ -53,6 +53,12 @@ def generate_with_batch_management(
|
|||
repaint_strength,
|
||||
retake_variance,
|
||||
retake_seed,
|
||||
flow_edit_morph,
|
||||
flow_edit_source_caption,
|
||||
flow_edit_source_lyrics,
|
||||
flow_edit_n_min,
|
||||
flow_edit_n_max,
|
||||
flow_edit_n_avg,
|
||||
autogen_checkbox,
|
||||
current_batch_index,
|
||||
total_batches,
|
||||
|
|
@ -88,6 +94,8 @@ def generate_with_batch_management(
|
|||
latent_shift, latent_rescale,
|
||||
repaint_mode, repaint_strength,
|
||||
retake_variance, retake_seed,
|
||||
flow_edit_morph, flow_edit_source_caption, flow_edit_source_lyrics,
|
||||
flow_edit_n_min, flow_edit_n_max, flow_edit_n_avg,
|
||||
progress,
|
||||
)
|
||||
|
||||
|
|
|
|||
|
|
@ -63,6 +63,12 @@ def generate_with_progress(
|
|||
repaint_strength,
|
||||
retake_variance=0.0,
|
||||
retake_seed="",
|
||||
flow_edit_morph=False,
|
||||
flow_edit_source_caption="",
|
||||
flow_edit_source_lyrics="",
|
||||
flow_edit_n_min=0.0,
|
||||
flow_edit_n_max=1.0,
|
||||
flow_edit_n_avg=1,
|
||||
progress=gr.Progress(track_tqdm=True),
|
||||
):
|
||||
"""Generate audio with progress tracking.
|
||||
|
|
@ -110,7 +116,12 @@ def generate_with_progress(
|
|||
|
||||
task_type = resolve_no_fsq_task_type(task_type, bool(no_fsq))
|
||||
|
||||
if task_type == "text2music":
|
||||
# text2music never uses src_audio EXCEPT when flow_edit_morph is on:
|
||||
# the morph overlay needs the source audio for ``zt_src``/``zt_tar``
|
||||
# formation in the V_delta integration. Without this guard the UI
|
||||
# silently zeroed src_audio for Custom mode and the backend's morph
|
||||
# check then errored with "Flow-edit morph requires a source audio".
|
||||
if task_type == "text2music" and not flow_edit_morph:
|
||||
src_audio = None
|
||||
|
||||
# Defensive guard: cover/repaint/extract/lego tasks should never use
|
||||
|
|
@ -174,6 +185,12 @@ def generate_with_progress(
|
|||
retake_variance=float(retake_variance) if retake_variance is not None else 0.0,
|
||||
# Empty textbox -> None; otherwise a string is fine (handler.prepare_seeds parses it).
|
||||
retake_seed=(retake_seed.strip() or None) if isinstance(retake_seed, str) else retake_seed,
|
||||
flow_edit_morph=bool(flow_edit_morph),
|
||||
flow_edit_source_caption=flow_edit_source_caption or "",
|
||||
flow_edit_source_lyrics=flow_edit_source_lyrics or "",
|
||||
flow_edit_n_min=float(flow_edit_n_min) if flow_edit_n_min is not None else 0.0,
|
||||
flow_edit_n_max=float(flow_edit_n_max) if flow_edit_n_max is not None else 1.0,
|
||||
flow_edit_n_avg=int(flow_edit_n_avg) if flow_edit_n_avg is not None else 1,
|
||||
)
|
||||
|
||||
if isinstance(seed, str) and seed.strip():
|
||||
|
|
|
|||
|
|
@ -72,6 +72,7 @@ _MODE_UI_OUTPUT_KEYS = (
|
|||
"retake_seed",
|
||||
"previous_generation_mode",
|
||||
"remix_help_group",
|
||||
"variation_group",
|
||||
"extract_help_group",
|
||||
"complete_help_group",
|
||||
"bpm_auto",
|
||||
|
|
|
|||
|
|
@ -66,6 +66,7 @@ MODE_OUTPUT_EXPECTED = [
|
|||
"retake_seed",
|
||||
"previous_generation_mode",
|
||||
"remix_help_group",
|
||||
"variation_group",
|
||||
"extract_help_group",
|
||||
"complete_help_group",
|
||||
"bpm_auto",
|
||||
|
|
|
|||
|
|
@ -127,6 +127,12 @@ def register_generation_run_handlers(context: GenerationWiringContext) -> None:
|
|||
generation_section["repaint_strength"],
|
||||
generation_section["retake_variance"],
|
||||
generation_section["retake_seed"],
|
||||
generation_section["flow_edit_morph"],
|
||||
generation_section["flow_edit_source_caption"],
|
||||
generation_section["flow_edit_source_lyrics"],
|
||||
generation_section["flow_edit_n_min"],
|
||||
generation_section["flow_edit_n_max"],
|
||||
generation_section["flow_edit_n_avg"],
|
||||
generation_section["autogen_checkbox"],
|
||||
results_section["current_batch_index"],
|
||||
results_section["total_batches"],
|
||||
|
|
|
|||
|
|
@ -53,6 +53,44 @@ def register_generation_text_format_handlers(
|
|||
outputs=list(auto_checkbox_outputs),
|
||||
)
|
||||
|
||||
# ========== Edit: Copy Current Caption/Lyrics into Source Fields ==========
|
||||
# Quick way to bootstrap V_src with the user-level prompt before they
|
||||
# edit the top-level fields to define V_tar.
|
||||
generation_section["flow_edit_copy_from_current_btn"].click(
|
||||
fn=lambda caption, lyrics: (caption or "", lyrics or ""),
|
||||
inputs=[
|
||||
generation_section["captions"],
|
||||
generation_section["lyrics"],
|
||||
],
|
||||
outputs=[
|
||||
generation_section["flow_edit_source_caption"],
|
||||
generation_section["flow_edit_source_lyrics"],
|
||||
],
|
||||
)
|
||||
|
||||
# ========== Retake × Think interaction warning ==========
|
||||
# Retake's variation is only meaningful if every other condition (LM
|
||||
# codes included) matches the baseline. When Think is on the LM
|
||||
# regenerates codes per call, so Retake's noise blend layers on top
|
||||
# of an already-different starting point. Surface a one-line warning
|
||||
# in the Retake panel whenever both checkboxes are simultaneously on.
|
||||
def _retake_think_warn(retake_on: bool, think_on: bool):
|
||||
import gradio as _gr # local import — module top reserved for typing
|
||||
return _gr.update(visible=bool(retake_on and think_on))
|
||||
|
||||
for trigger in (
|
||||
generation_section["retake_enabled"],
|
||||
generation_section["think_checkbox"],
|
||||
):
|
||||
trigger.change(
|
||||
fn=_retake_think_warn,
|
||||
inputs=[
|
||||
generation_section["retake_enabled"],
|
||||
generation_section["think_checkbox"],
|
||||
],
|
||||
outputs=[generation_section["retake_think_warning"]],
|
||||
)
|
||||
|
||||
# ========== Format Lyrics Button ==========
|
||||
generation_section["format_lyrics_btn"].click(
|
||||
fn=lambda caption, lyrics, bpm, duration, key_scale, time_sig, temp, top_k, top_p, debug: gen_h.handle_format_lyrics(
|
||||
|
|
|
|||
|
|
@ -56,6 +56,8 @@ def register_results_aux_handlers(
|
|||
generation_section["generation_mode"],
|
||||
generation_section["lyrics"],
|
||||
generation_section["captions"],
|
||||
generation_section["flow_edit_source_caption"],
|
||||
generation_section["flow_edit_source_lyrics"],
|
||||
]
|
||||
+ list(mode_ui_outputs),
|
||||
)
|
||||
|
|
@ -75,6 +77,8 @@ def register_results_aux_handlers(
|
|||
generation_section["generation_mode"],
|
||||
generation_section["lyrics"],
|
||||
generation_section["captions"],
|
||||
generation_section["flow_edit_source_caption"],
|
||||
generation_section["flow_edit_source_lyrics"],
|
||||
]
|
||||
+ list(mode_ui_outputs),
|
||||
)
|
||||
|
|
|
|||
|
|
@ -92,6 +92,7 @@
|
|||
"analyze_btn": "🔍 Analyze",
|
||||
"sample_btn": "🎲 Click Me",
|
||||
"lm_codes_hints": "🎼 LM Codes Hints",
|
||||
"lm_codes_audio_upload_label": "Audio → Codes (utility)",
|
||||
"lm_codes_label": "LM Codes Hints",
|
||||
"lm_codes_placeholder": "<|audio_code_10695|><|audio_code_54246|>.",
|
||||
"lm_codes_info": "Paste LM codes hints for text2music generation.",
|
||||
|
|
@ -521,6 +522,8 @@
|
|||
"generation_caption": "## Writing Good Captions\n\n### Structure\nA good caption includes:\n- **Genre/Style**: pop, rock, jazz, electronic, classical…\n- **Instruments**: guitar, piano, synth, drums, strings…\n- **Mood**: upbeat, melancholic, energetic, dreamy…\n- **Vocal style**: whispered, powerful, falsetto, rap…\n- **Tempo feel**: fast, slow, moderate, driving…\n\n### Examples\n- *\"Energetic pop-punk with distorted guitars, fast drums, and shouted vocals\"*\n- *\"Smooth jazz trio with walking bass, brushed drums, and mellow piano\"*\n- *\"Ambient electronic with layered synth pads and no vocals\"*\n\n### Tips\n- More detail = better results\n- Use the **Format** button to let AI enhance your caption\n- Check 🎲 for example captions",
|
||||
"generation_lyrics": "## Writing Lyrics\n\n### Structure Tags\nUse section tags to structure your song:\n```\n[Verse 1]\nYour verse lyrics here\n\n[Chorus]\nYour chorus lyrics here\n\n[Verse 2]\nSecond verse here\n\n[Bridge]\nBridge section\n\n[Outro]\nEnding lyrics\n```\n\n### Tips\n- Keep verses 4–8 lines\n- Choruses should be memorable and repetitive\n- Use `[Instrumental]` or `[Interlude]` for non-vocal sections\n- Check **Instrumental** checkbox for pure instrumental music\n- Select **Vocal Language** to match your lyrics language",
|
||||
"generation_advanced": "## Advanced Settings\n\n### Key Parameters\n- **Inference Steps**: Turbo=8 (default), Base=up to 200. More steps ≠ always better for turbo\n- **Guidance Scale**: Base model only. Higher = follows prompt more strictly\n- **Shift**: Timestep shift (1.0–5.0). 3.0 recommended for turbo\n- **Seed**: Set a specific seed for reproducible results\n\n### LM Parameters\n- **Temperature** (0.0–2.0): Higher = more creative/random\n- **CFG Scale** (1.0–3.0): Higher = follows prompt more\n- **Top-K / Top-P**: Sampling strategies for diversity\n\n### Think Mode\nEnable **Think** to use 5Hz LM for smarter generation:\n- Generates semantic codes and metadata\n- Requires LM to be initialized\n- **ParallelThinking**: Process batches in parallel (faster)",
|
||||
"generation_retake": "## Retake (Variation Generation)\n\n**Best for:** producing controlled, smooth variations of a seeded baseline without changing prompts.\n\n### Principle\nRetake mixes a fresh independent noise draw into the seeded initial noise via a variance-preserving sin/cos blend:\n\n```\nmixed = cos(v · π/2) · base_noise + sin(v · π/2) · retake_noise\n```\n\nBecause `cos² + sin² = 1`, the total noise variance is preserved exactly. `v=0` is a no-op (reproduces the baseline); `v=1` swaps the noise entirely.\n\n### Inputs\n- **variance** (0–1): blend factor.\n - `0` = baseline (identical to no retake)\n - `0.05–0.15` = subtle variation, same melody/structure with minor differences\n - `0.3–0.5` = moderate drift\n - `0.5+` = strong drift, may diverge significantly\n- **seed** (integer, optional): independent reproducibility seed for the retake noise. Empty = random per call.\n\n### ⚠️ Consistency requirement\nRetake variation is **only meaningful** if every other condition matches the baseline run. In particular:\n\n- The main **seed** must be the same — leave the random-seed checkbox off and reuse the seed from the baseline batch.\n- **Think** must be **off**, OR the LM-generated codes from the baseline must be reused. With Think on the LM regenerates audio codes from scratch each call, so the input to the diffusion model is different — Retake's noise blend gets layered on top of an already-different starting point and the result mixes \"LM variation\" with \"noise variation\".\n- All other knobs (caption, lyrics, BPM, key, duration, guidance_scale, shift, sampler, DCW, etc.) should match the baseline.\n\n### Workflow for retaking a Think-mode result\n1. On the result you like, expand the **📊 Score & LRC & LM Codes** accordion.\n2. Copy the **LM Codes** text out of that result.\n3. Paste it into the **LM Codes Hints** textbox.\n4. **Uncheck Think** — the diffusion model will now use the pasted LM codes verbatim.\n5. Lock the main seed to the baseline's seed, enable **Retake**, set `retake_seed`, adjust `variance`.\n\n### Reference\nSee [issue #1155](https://github.com/ace-step/ACE-Step-1.5/issues/1155).",
|
||||
"generation_edit": "## Edit (Flow-Edit Overlay)\n\n**Best for:** morphing an existing audio toward new lyrics or a new style while keeping the source's structure. Smoothly blends source → target via paired velocity-field integration along the diffusion schedule.\n\n### Principle\nFlow-Edit integrates a velocity difference:\n\n```\nz_edit_{t-Δt} = z_edit_t + (V_tar(z_tar_t, c_tar) − V_src(z_src_t, c_src)) · Δt\n```\n\nwhere `c_src` and `c_tar` are paired text/lyric conditions. Both branches share the same audio context, so V_delta isolates the **text-only** delta and accumulates it into `z_edit` over `[n_min, n_max]` of the schedule.\n\n### Workflow\n1. Pick **Custom** (recommended for new edits) or **Remix** (when you want to keep cover-style scaffolding from a reference audio).\n2. Upload the audio to edit into the **Source Audio** field at the top.\n3. Fill the top-level **Music Caption** / **Lyrics** with the **original** description of the source audio (this will be V_src in the next step).\n4. Tick **Edit** to expand the panel.\n5. Click **Copy current → source** — this snapshots the *original* into the source fields below as V_src.\n6. Now edit the top-level **Music Caption** / **Lyrics** to define the **target** (what you want the morphed result to be) — V_tar.\n7. Leave `n_min=0`, `n_max=1`, `n_avg=1` for the first try.\n8. In **Optional Parameters** set `shift=3.0` (recommended for all variants).\n9. Click **Generate Music**.\n\n> Tip: if you arrived here via **Send to Remix** / **Send to Repaint**, the source fields are already pre-filled from the previous run — skip steps 3 and 5 and go straight to editing the top-level fields as the target.\n\n### Mode behaviour\n- **Custom** (text2music): the backend silently **ignores Think and `LM Codes Hints`** — it always VAE-encodes your `Source Audio` directly so flow-edit's `z_edit` starts on a clean audio latent. V_delta is purely text-driven. Most reliable mode for v1.\n- **Remix** (cover / cover-nofsq): backend uses cover's natural LM-codes context, shared by both branches. Good when you want the morphed result to retain the cover task's reference scaffolding.\n- **Repaint / Extract / Lego**: not supported in v1 — the backend logs a warning and falls back to plain task behaviour (paired-CFG for those task shapes is the follow-up PR).\n\n### Inputs\n- **source caption / lyrics**: describe the *original* audio (V_src).\n- top-level **caption / lyrics**: the *target* you want (V_tar).\n- **n_min / n_max**: diffusion-schedule window where V_delta is integrated. `0 / 1` = full schedule (recommended).\n- **n_avg**: Monte-Carlo samples per step (1 = fast; higher = more stable, slower).\n\n### Models\nAll six DiT variants are supported. Turbo / XL-turbo are CFG-distilled, so the backend automatically forces `guidance_scale=1.0` for them. 8 inference steps is enough for turbo; bump to ≥60 for `base / sft / xl_base / xl_sft`.\n\n### Method (deeper)\n1. Encode `flow_edit_source_caption / lyrics` → V_src condition.\n2. Encode top-level `caption / lyrics` → V_tar condition.\n3. Run paired CFG/APG forward at each step in `[n_min, n_max]`.\n4. Integrate `(V_tar − V_src)` into the running `z_edit`.\n5. After `n_max`, Euler-step toward the clean latent on the target branch only.\n\n### Tips\n- `Send to Remix` / `Send to Repaint` already auto-fills source fields from the previous run's prompts so you can iterate quickly.\n- The recommended `shift=3.0` matches ACE-Step 1.0's default flow-edit schedule and is empirically stable across all six variants.\n\n### Reference\n- Kulikov, V. et al. *FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models.* CVPR 2025. [arXiv:2412.08629](https://arxiv.org/abs/2412.08629)\n- [Issue #1156](https://github.com/ace-step/ACE-Step-1.5/issues/1156).",
|
||||
"results": "## Results Section\n\n### Per-Sample Controls\n- **Audio Player**: Play, pause, download\n- **Send To Remix/Repaint**: Use this result as source for further editing\n- **Save**: Export audio + metadata as JSON\n- **Score**: Calculate quality score (perplexity-based)\n- **LRC**: Generate lyrics timestamps\n\n### Batch Navigation\n- Use **◀ Previous** / **Next ▶** to browse batches\n- Enable **AutoGen** to auto-generate next batch\n- Click **Apply These Settings to UI** to reuse parameters from a good result\n\n### Tips\n- Generate 2–4 variations (batch size) and pick the best\n- Use Score to objectively compare results\n- Save good results for reference",
|
||||
"training_dataset": "## Dataset Builder Tutorial\n\n### Step 1: Load or Scan\n- **Load**: Enter path to existing dataset JSON → Click Load\n- **Scan**: Enter audio folder path → Click Scan\n - Supported: wav, mp3, flac, ogg, opus\n\n### Step 2: Configure\n- Set **Dataset Name**\n- Check **All Instrumental** if no vocals\n- Set **Custom Activation Tag** (unique trigger word for your LoRA)\n- Choose **Tag Position**: Prepend, Append, or Replace\n\n### Step 3: Auto-Label\n- Click **Auto-Label All** to generate captions, BPM, key, time sig\n- Use **Skip Metas** to skip BPM/Key/TimeSig (faster)\n\n### Step 4: Preview & Edit\n- Use slider to browse samples\n- Edit caption, lyrics, BPM, key manually\n- Click **Save Changes** per sample\n\n### Step 5: Save\n- Enter save path → Click **Save Dataset**\n\n### Step 6: Preprocess\n- Set tensor output directory → Click **Preprocess**\n- This encodes audio/text to tensors for training\n\n### 📖 Documentation\n- [LoRA Training Tutorial](https://github.com/ACE-Step/ACE-Step-1.5/blob/main/docs/en/LoRA_Training_Tutorial.md) — Full step-by-step guide\n- [Side-Step Advanced Training](https://github.com/ACE-Step/ACE-Step-1.5/blob/main/docs/sidestep/Getting%20Started.md) — CLI-based training with advanced features",
|
||||
"training_train": "## LoRA Training Tutorial\n\n### Setup\n1. Enter **Preprocessed Tensors Directory** → Click **Load Dataset**\n2. Configure LoRA:\n - **Rank** (r): 64 default. Higher = more capacity\n - **Alpha**: Usually 2× rank (128)\n - **Dropout**: 0.1 for regularization\n\n### Training\n3. Set **Learning Rate** (start with 1e-4)\n4. Set **Max Epochs** (500 default)\n5. Click **Start Training**\n6. Monitor loss curve — it should decrease over time\n7. Click **Stop Training** when satisfied\n\n### Export\n8. Enter export path → Click **Export LoRA**\n9. Load in Settings: set LoRA Path → Load LoRA → Enable Use LoRA\n\n### 🚀 Try LoKr for Faster Training\nLoKr has greatly improved training efficiency. What used to take an hour now only takes 5 minutes — **over 10× faster**. This is crucial for training on consumer-grade GPUs. Switch to the **Train LoKr** tab to get started.\n\n### Tips\n- Use small batch size (1) if VRAM is limited\n- Gradient accumulation increases effective batch size\n- Save checkpoints frequently (every 200 epochs)\n\n### 📖 Documentation\n- [LoRA Training Tutorial](https://github.com/ACE-Step/ACE-Step-1.5/blob/main/docs/en/LoRA_Training_Tutorial.md) — Full step-by-step guide\n- [Side-Step Advanced Training](https://github.com/ACE-Step/ACE-Step-1.5/blob/main/docs/sidestep/Getting%20Started.md) — CLI training with corrected timesteps, LoKR, VRAM optimization",
|
||||
|
|
|
|||
|
|
@ -90,6 +90,7 @@
|
|||
"analyze_btn": "🔍 ניתוח",
|
||||
"sample_btn": "🎲 לחץ עליי",
|
||||
"lm_codes_hints": "🎼 רמזי קודי LM",
|
||||
"lm_codes_audio_upload_label": "אודיו → קודים (כלי עזר)",
|
||||
"lm_codes_label": "רמזי קודי LM",
|
||||
"lm_codes_placeholder": "<|audio_code_10695|><|audio_code_54246|>.",
|
||||
"lm_codes_info": "הדבק רמזי קודי LM עבור יצירת text2music.",
|
||||
|
|
@ -496,6 +497,8 @@
|
|||
"generation_caption": "## כתיבת תיאורים טובים\n\n### מבנה\nתיאור טוב כולל:\n- **ז'אנר/סגנון**: פופ, רוק, ג'אז, אלקטרוני, קלאסי…\n- **כלים**: גיטרה, פסנתר, סינתיסייזר, תופים, כלי מיתר…\n- **אווירה**: שמח, מלנכולי, אנרגטי, חלומי…\n- **סגנון שירה**: לחישה, עוצמתי, פלסטו, ראפ…\n- **תחושת קצב**: מהיר, איטי, מתון…\n\n### דוגמאות\n- *\"פופ-פאנק אנרגטי עם גיטרות מעוותות, תופים מהירים ושירה צעקנית\"*\n- *\"שלישיית ג'אז רכה עם 'ווקינג בס', תופים מוברשים ופסנתר רגוע\"*",
|
||||
"generation_lyrics": "## כתיבת מילים\n\n### תגיות מבנה\nהשתמש בתגיות כדי לבנות את השיר שלך:\n```\n[Verse 1]\nמילות הבית כאן\n\n[Chorus]\nמילות הפזמון כאן\n\n[Verse 2]\nבית שני\n\n[Bridge]\nקטע מעבר (גשר)\n\n[Outro]\nמילות סיום\n```\n\n### טיפים\n- שמור על בתים של 4–8 שורות\n- פזמונים צריכים להיות זכירים וחוזרים על עצמם\n- השתמש ב-`[Instrumental]` לקטעים ללא שירה",
|
||||
"generation_advanced": "## הגדרות מתקדמות\n\n### פרמטרים מרכזיים\n- **צעדי הסקה**: Turbo=8 (ברירת מחדל), Base=עד 200. יותר צעדים לא תמיד טוב יותר ב-Turbo\n- **Guidance Scale**: למודל Base בלבד. גבוה יותר = נצמד יותר להנחיה\n- **Shift**: הסטת צעדי זמן (1.0–5.0). 3.0 מומלץ ל-turbo\n- **Seed**: הגדר גרעין ספציפי לתוצאות הניתנות לשחזור\n\n### פרמטרי LM\n- **Temperature**: גבוה יותר = יצירתי/אקראי יותר\n- **CFG Scale**: גבוה יותר = עוקב יותר אחרי ההנחיה\n\n### מצב חשיבה (Think)\nהפעל את **Think** לשימוש ב-5Hz LM ליצירה חכמה יותר.",
|
||||
"generation_retake": "## Retake (יצירת וריאציות)\n\nליצירת וריאציות חלקות ומבוקרות מ-baseline עם seed קבוע, מבלי לשנות את ההנחיה.\n\n### עיקרון\nRetake מערבב רעש עצמאי חדש לרעש ההתחלתי בשיטת sin/cos שומרת שונות:\n\n```\nmixed = cos(v · π/2) · base_noise + sin(v · π/2) · retake_noise\n```\n\n`v=0` שווה ערך לבסיס; `v=1` מחליף את הרעש לחלוטין.\n\n### קלט\n- **variance** (0–1): מקדם הערבוב.\n- **seed** (אופציונלי): seed עצמאי לרעש retake.\n\n### ⚠️ דרישת עקביות\nRetake תקף רק אם שאר התנאים תואמים ל-baseline:\n- ה-**seed** הראשי חייב להיות זהה.\n- **Think חייב להיות כבוי** או שצריך להשתמש שוב ב-LM codes של ה-baseline. כשThink מופעל ה-LM מייצר codes חדשים בכל קריאה, אז נקודת ההתחלה של מודל הדיפוזיה משתנה.\n- כל שאר הפרמטרים (caption, lyrics, BPM, key, וכו') צריכים להיות זהים.\n\n### Workflow ל-retake של תוצאה ב-Think\n1. הרחב את **📊 Score & LRC & LM Codes** של התוצאה.\n2. העתק את הטקסט של **LM Codes**.\n3. הדבק לתיבת **LM Codes Hints**.\n4. **בטל את הסימון של Think**.\n5. נעל את ה-seed הראשי, הפעל **Retake**, הגדר `retake_seed`, התאם `variance`.\n\n### הפניה\nראה [issue #1155](https://github.com/ace-step/ACE-Step-1.5/issues/1155).",
|
||||
"generation_edit": "## Edit (Flow-Edit Overlay)\n\nליצירת מורפינג של אודיו קיים לכיוון מילים/סגנון חדשים תוך שמירת המבנה. עובד על-ידי אינטגרציה של שדה מהירויות מזווג לאורך schedule הדיפוזיה.\n\n### עיקרון\nFlow-Edit מחשב הפרש מהירויות:\n\n```\nz_edit_{t-Δt} = z_edit_t + (V_tar(z_tar_t, c_tar) − V_src(z_src_t, c_src)) · Δt\n```\n\nשני הענפים חולקים אותו הקשר אודיו, אז V_delta מבודד את ההפרש **הטקסטואלי בלבד** ומצטבר ב-`z_edit` לאורך `[n_min, n_max]`.\n\n### Workflow\n1. בחר מצב **Custom** (מומלץ לעריכה חדשה) או **Remix** (לשמירה על שלד ה-cover).\n2. העלה את האודיו לעריכה ל-**Source Audio** בראש המסך.\n3. מלא **Music Caption / Lyrics** העליונים עם התיאור **המקורי** של האודיו (יהיה ה-V_src בשלב הבא).\n4. סמן **Edit** לפתיחת הפאנל.\n5. לחץ **Copy current → source** — זה מצלם את התיאור *המקורי* לשדות source שמתחת כ-V_src.\n6. עכשיו ערוך את **Music Caption / Lyrics** העליונים כדי להגדיר את ה-**יעד** (מה שאתה רוצה בתוצאה הממורפת) — V_tar.\n7. השאר `n_min=0`, `n_max=1`, `n_avg=1` בניסיון הראשון.\n8. ב-**Optional Parameters** הגדר `shift=3.0` (מומלץ לכל הוריאציות).\n9. לחץ **Generate Music**.\n\n> טיפ: אם הגעת לכאן דרך **Send to Remix** / **Send to Repaint**, שדות ה-source כבר מולאו אוטומטית מהריצה הקודמת — דלג על שלבים 3 ו-5 ועבור ישר לעריכת השדות העליונים כיעד.\n\n### התנהגות לפי מצב\n- **Custom** (text2music): ה-backend **מתעלם בשקט מ-Think ו-LM Codes Hints** — תמיד עושה VAE-encode על Source Audio שהעלית. V_delta מונע טקסט בלבד. המצב היציב ביותר ב-v1.\n- **Remix** (cover / cover-nofsq): משתמש בהקשר ה-LM-codes הטבעי של cover, משותף לשני הענפים.\n- **Repaint / Extract / Lego**: לא נתמך ב-v1.\n\n### קלט\n- **source caption / lyrics**: מתאר את האודיו ה-*מקורי* (V_src).\n- caption / lyrics עליונים: ה-*יעד* (V_tar).\n- **n_min / n_max**: חלון ה-schedule שבו V_delta משולב.\n- **n_avg**: דגימות Monte-Carlo לצעד.\n\n### מודלים\nכל 6 וריאציות ה-DiT נתמכות. Turbo / XL-turbo הם CFG-distilled, אז ה-backend מאלץ אוטומטית `guidance_scale=1.0`. 8 צעדים מספיקים ל-turbo; ל-`base / sft / xl_base / xl_sft` כדאי ≥60.\n\n### הפניה\n- Kulikov, V. et al. *FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models.* CVPR 2025. [arXiv:2412.08629](https://arxiv.org/abs/2412.08629)\n- [Issue #1156](https://github.com/ace-step/ACE-Step-1.5/issues/1156).",
|
||||
"results": "## מדור תוצאות\n\n### בקרות לכל דגימה\n- **נגן אודיו**: נגן, השהה, הורד\n- **שלח לרמיקס/צביעה מחדש**: השתמש בתוצאה זו כמקור להמשך עריכה\n- **שמירה**: ייצא אודיו + מטא-דאטה כ-JSON\n- **Score**: חשב ציון איכות\n- **LRC**: ייצר חותמות זמן למילים",
|
||||
"training_dataset": "## מדריך בונה מערכי נתונים\n\n### שלב 1: טעינה או סריקה\n- **טעינה**: הזן נתיב ל-JSON קיים ← לחץ על טעינה\n- **סריקה**: הזן נתיב לתיקיית אודיו ← לחץ על סריקה\n\n### שלב 2: תצורה\n- הגדר **שם מערך נתונים**\n- סמן **הכל אינסטרומנטלי** אם אין שירה\n- הגדר **תג הפעלה מותאם**\n\n### שלב 3: תיוג אוטומטי\n- לחץ על **תיוג אוטומטי להכל** לייצור תיאורים ונתונים מוזיקליים",
|
||||
"training_train": "## מדריך אימון LoRA\n\n### הגדרה\n1. הזן **ספריית טנזורים מעובדים** ← לחץ על **טען מערך נתונים**\n2. הגדר LoRA:\n - **Rank** (r): 64 ברירת מחדל.\n - **Alpha**: בדרך כלל פי 2 מה-Rank (128)\n\n### אימון\n3. הגדר **קצב למידה** (התחל עם 1e-4)\n4. הגדר **מקסימום איטרציות** (500 ברירת מחדל)\n5. לחץ על **התחל אימון**\n\n### ייצוא\n8. הזן נתיב ייצוא ← לחץ על **ייצוא LoRA**",
|
||||
|
|
|
|||
|
|
@ -90,6 +90,7 @@
|
|||
"analyze_btn": "🔍 分析",
|
||||
"sample_btn": "🎲 お試し",
|
||||
"lm_codes_hints": "🎼 LM コードヒント",
|
||||
"lm_codes_audio_upload_label": "音声 → コード(ユーティリティ)",
|
||||
"lm_codes_label": "LM コードヒント",
|
||||
"lm_codes_placeholder": "<|audio_code_10695|><|audio_code_54246|>.",
|
||||
"lm_codes_info": "text2music生成用のLMコードヒントを貼り付け。",
|
||||
|
|
@ -479,6 +480,8 @@
|
|||
"generation_caption": "## 良いキャプションの書き方\n\n### 構造\n良いキャプションに含めるもの:\n- **ジャンル/スタイル**:pop、rock、jazz、electronic、classical…\n- **楽器**:guitar、piano、synth、drums、strings…\n- **ムード**:upbeat、melancholic、energetic、dreamy…\n- **ボーカルスタイル**:whispered、powerful、falsetto、rap…\n- **テンポ感**:fast、slow、moderate、driving…\n\n### 例\n- *「ディストーションギター、速いドラム、叫ぶボーカルのエネルギッシュなポップパンク」*\n- *「ウォーキングベース、ブラシドラム、メロウなピアノのスムースジャズトリオ」*\n- *「レイヤードシンセパッドとボーカルなしのアンビエントエレクトロニカ」*\n\n### ヒント\n- 詳細 = より良い結果\n- **フォーマット**ボタンでAIにキャプションを強化させる\n- 🎲でキャプション例を確認",
|
||||
"generation_lyrics": "## 歌詞の書き方\n\n### 構造タグ\nセクションタグで曲を構成:\n```\n[Verse 1]\nバースの歌詞をここに\n\n[Chorus]\nコーラスの歌詞をここに\n\n[Verse 2]\n2番のバース\n\n[Bridge]\nブリッジセクション\n\n[Outro]\nエンディングの歌詞\n```\n\n### ヒント\n- バースは4–8行\n- コーラスは記憶に残りやすく反復的に\n- ボーカルなしセクションには `[Instrumental]` や `[Interlude]` を使用\n- 純粋なインストゥルメンタルには**インストゥルメンタル**チェックボックス\n- 歌詞の言語に合わせて**ボーカル言語**を選択",
|
||||
"generation_advanced": "## 高度な設定\n\n### 主要パラメータ\n- **推論ステップ**:Turbo=8(デフォルト)、Base=最大200。turboでは多いステップ≠常に良い\n- **ガイダンススケール**:Baseモデルのみ。高い = プロンプトにより忠実\n- **シフト**:タイムステップシフト(1.0–5.0)。turboには3.0推奨\n- **シード**:特定のシードで再現可能な結果\n\n### LMパラメータ\n- **温度**(0.0–2.0):高い = より創造的/ランダム\n- **CFGスケール**(1.0–3.0):高い = プロンプトにより忠実\n- **Top-K / Top-P**:多様性のサンプリング戦略\n\n### Thinkモード\n**Think**を有効にして5Hz LMでスマートな生成:\n- セマンティックコードとメタデータを生成\n- LMの初期化が必要\n- **並列思考**:バッチを並列処理(高速)",
|
||||
"generation_retake": "## Retake(バリエーション生成)\n\n**用途:** プロンプトを変えずにシード固定のベースラインから滑らかなバリエーションを作る。\n\n### 原理\nRetakeは分散保存型のsin/cosブレンドで、新しい独立ノイズをシード固定ノイズに混ぜます:\n\n```\nmixed = cos(v · π/2) · base_noise + sin(v · π/2) · retake_noise\n```\n\n`cos² + sin² = 1`なのでノイズ分散は完全に保たれます。`v=0`はノーオペ(ベースラインと同一)、`v=1`はノイズを完全に置き換え。\n\n### 入力\n- **variance**(0–1):ブレンド係数。\n - `0` = ベースライン\n - `0.05–0.15` = 微妙なバリエーション\n - `0.3–0.5` = 中程度のドリフト\n - `0.5+` = 強いドリフト\n- **seed**(整数、任意):retakeノイズの独立シード。空 = 毎回ランダム。\n\n### ⚠️ 一貫性要件\nRetakeはベースラインと**他の全条件が一致**する場合のみ意味があります:\n\n- メイン**シード**を一致させる(random-seedチェックボックスをオフ)。\n- **Thinkはオフ**にするか、ベースラインのLMコードを再利用する。Thinkがオンだと毎回LMが新たにaudio codesを生成し、拡散モデルの入力起点が変わるため、Retakeのノイズ混合が「LMドリフト + ノイズドリフト」の混合になる。\n- 他のすべてのパラメータ(caption、lyrics、BPM、key、duration、guidance_scale、shift、sampler、DCWなど)もベースラインと一致させる。\n\n### Thinkモード結果でRetakeするワークフロー\n1. 対象結果の **📊 Score & LRC & LM Codes** アコーディオンを展開。\n2. **LM Codes** テキストをコピー。\n3. **LM Codes Hints** テキストボックスに貼り付け。\n4. **Thinkのチェックを外す** — 拡散モデルは貼り付けたコードをそのまま使う。\n5. メインシードをロックし、**Retake**を有効化、`retake_seed`を設定、`variance`を調整。\n\n### 参考\n[issue #1155](https://github.com/ace-step/ACE-Step-1.5/issues/1155)を参照。",
|
||||
"generation_edit": "## Edit(Flow-Editオーバーレイ)\n\n**用途:** 既存音声を新しい歌詞/スタイルへ滑らかに変形させつつ、元の構造を保つ。拡散scheduleに沿ったペア化速度場積分により実現。\n\n### 原理\nFlow-Editは速度差を積分します:\n\n```\nz_edit_{t-Δt} = z_edit_t + (V_tar(z_tar_t, c_tar) − V_src(z_src_t, c_src)) · Δt\n```\n\n`c_src`と`c_tar`はペアのテキスト/歌詞条件です。両分岐は同じ音声コンテキストを共有するため、V_deltaは**テキストのみ**の差を抽出し、`[n_min, n_max]`で`z_edit`に蓄積します。\n\n### 操作手順\n1. **Custom**(新編集に推奨)または**Remix**(cover の参考骨格を保ちたい場合)モードを選択。\n2. 編集対象音声を上部の **Source Audio** にアップロード。\n3. 上部の **Music Caption / Lyrics** に**元の**説明(source の状態を描写するもの、次のステップで V_src になる)を入力。\n4. **Edit** をチェックしてパネルを展開。\n5. **Copy current → source** ボタンで、上で入力した**元の**条件を下の source 欄にスナップショット(V_src 用)。\n6. 改めて上部の **Music Caption / Lyrics** を**ターゲット**(morph 後に欲しい結果)に書き換える — V_tar。\n7. 初回は `n_min=0`、`n_max=1`、`n_avg=1` のままで OK。\n8. **Optional Parameters** で `shift` を **3.0** に設定(全モデル推奨)。\n9. **Generate Music** をクリック。\n\n> ヒント:**Send to Remix** / **Send to Repaint** から遷移してきた場合、source 欄は前回の実行から自動入力済みなので、ステップ 3 と 5 はスキップして上部をターゲットに編集するだけで OK。\n\n### モード別の挙動\n- **Custom**(text2music):バックエンドは**Thinkと LM Codes Hintsを自動的に無視**し、アップロードした Source Audio をVAE エンコードしてflow-editの起点とする。V_deltaは純粋にテキスト駆動。v1で最も安定。\n- **Remix**(cover / cover-nofsq):cover タスク本来のLM-codesコンテキストを両分岐で共有。cover の参考骨格を保ちたい時に。\n- **Repaint / Extract / Lego**:v1では非対応 — バックエンドは警告ログを出して通常タスクとして実行(これらのtask shape向けペアCFGは後続PR)。\n\n### 入力\n- **source caption / lyrics**:**元の**音声を表現(V_src)。\n- 上部の **caption / lyrics**:望む**ターゲット**(V_tar)。\n- **n_min / n_max**:V_deltaを積分する拡散scheduleウィンドウ。`0 / 1` = 全schedule(推奨)。\n- **n_avg**:1ステップあたりのモンテカルロサンプル数(1 = 高速、高い = 安定だが遅い)。\n\n### モデル\n6つのDiTバリアント全てに対応。Turbo / XL-turboはCFG-distilledなのでバックエンドが自動的に`guidance_scale=1.0`に強制。Turboは8推論ステップで十分、`base / sft / xl_base / xl_sft`は≥60ステップ推奨。\n\n### ヒント\n- `Send to Remix` / `Send to Repaint` は前回の実行のプロンプトを source欄に自動入力するため、繰り返し試行しやすい。\n- 推奨の `shift=3.0` はACE-Step 1.0デフォルトのflow-edit scheduleと一致し、経験上6つのバリアント全てで安定。\n\n### 参考\n- Kulikov, V. et al. *FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models.* CVPR 2025. [arXiv:2412.08629](https://arxiv.org/abs/2412.08629)\n- [Issue #1156](https://github.com/ace-step/ACE-Step-1.5/issues/1156)。",
|
||||
"results": "## 結果セクション\n\n### サンプルごとのコントロール\n- **オーディオプレーヤー**:再生、一時停止、ダウンロード\n- **リミックス/リペイントに送信**:この結果をソースとして更に編集\n- **保存**:オーディオ + メタデータをJSONでエクスポート\n- **スコア**:品質スコアを計算(パープレキシティベース)\n- **LRC**:歌詞タイムスタンプを生成\n\n### バッチナビゲーション\n- **◀ 前へ** / **次へ ▶** でバッチを閲覧\n- **自動生成**を有効にして次のバッチを自動生成\n- **これらの設定をUIに適用**をクリックして良い結果のパラメータを再利用\n\n### ヒント\n- 2–4個のバリエーション(バッチサイズ)を生成して最良を選択\n- スコアで客観的に結果を比較\n- 良い結果を参考用に保存",
|
||||
"training_dataset": "## データセットビルダーチュートリアル\n\n### ステップ1:読み込みまたはスキャン\n- **読み込み**:既存のデータセットJSONパスを入力 → 読み込みをクリック\n- **スキャン**:オーディオフォルダパスを入力 → スキャンをクリック\n - 対応:wav、mp3、flac、ogg、opus\n\n### ステップ2:設定\n- **データセット名**を設定\n- ボーカルなしなら**すべてインストゥルメンタル**をチェック\n- **カスタムアクティベーションタグ**を設定(LoRAのユニークなトリガーワード)\n- **タグの位置**を選択:前置、後置、または置換\n\n### ステップ3:自動ラベル\n- **一括自動ラベル**をクリックしてキャプション、BPM、キー、拍子を生成\n- **メタをスキップ**でBPM/キー/拍子をスキップ(高速)\n\n### ステップ4:プレビューと編集\n- スライダーでサンプルを閲覧\n- キャプション、歌詞、BPM、キーを手動編集\n- サンプルごとに**変更を保存**をクリック\n\n### ステップ5:保存\n- 保存パスを入力 → **データセットを保存**をクリック\n\n### ステップ6:前処理\n- テンソル出力ディレクトリを設定 → **前処理**をクリック\n- オーディオ/テキストをトレーニング用テンソルにエンコード\n\n### 📖 ドキュメント\n- [LoRA トレーニングチュートリアル](https://github.com/ACE-Step/ACE-Step-1.5/blob/main/docs/ja/LoRA_Training_Tutorial.md) — 完全なステップバイステップガイド\n- [Side-Step 高度なトレーニング](https://github.com/ACE-Step/ACE-Step-1.5/blob/main/docs/sidestep/Getting%20Started.md) — CLIベースのトレーニング、高度な機能付き",
|
||||
"training_train": "## LoRAトレーニングチュートリアル\n\n### セットアップ\n1. **前処理済みテンソルディレクトリ**を入力 → **データセットを読み込み**をクリック\n2. LoRAを設定:\n - **ランク** (r):デフォルト64。高い = より大きな容量\n - **Alpha**:通常はランクの2倍(128)\n - **Dropout**:正則化に0.1\n\n### トレーニング\n3. **学習率**を設定(1e-4から開始)\n4. **最大エポック数**を設定(デフォルト500)\n5. **トレーニング開始**をクリック\n6. 損失曲線を監視 — 時間とともに減少するはず\n7. 満足したら**トレーニング停止**をクリック\n\n### エクスポート\n8. エクスポートパスを入力 → **LoRAをエクスポート**をクリック\n9. 設定で読み込み:LoRAパスを設定 → LoRAを読み込み → LoRAを使用を有効化\n\n### 🚀 LoKr で高速トレーニング\nLoKr はトレーニング効率を大幅に向上させました。以前は1時間かかっていたトレーニングが、わずか5分で完了します——**10倍以上の高速化**。コンシューマーGPUでのトレーニングに最適です。**Train LoKr** タブに切り替えて始めましょう。\n\n### ヒント\n- VRAMが限られている場合は小さいバッチサイズ(1)を使用\n- 勾配累積で実効バッチサイズを増加\n- チェックポイントを頻繁に保存(200エポックごと)\n\n### 📖 ドキュメント\n- [LoRA トレーニングチュートリアル](https://github.com/ACE-Step/ACE-Step-1.5/blob/main/docs/ja/LoRA_Training_Tutorial.md) — 完全なステップバイステップガイド\n- [Side-Step 高度なトレーニング](https://github.com/ACE-Step/ACE-Step-1.5/blob/main/docs/sidestep/Getting%20Started.md) — CLIトレーニング、修正タイムステップ、LoKR、VRAM最適化",
|
||||
|
|
|
|||
|
|
@ -90,6 +90,7 @@
|
|||
"analyze_btn": "🔍 Analisar",
|
||||
"sample_btn": "🎲 Clique Aqui",
|
||||
"lm_codes_hints": "🎼 Dicas de Códigos LM",
|
||||
"lm_codes_audio_upload_label": "Áudio → Códigos (utilitário)",
|
||||
"lm_codes_label": "Dicas de Códigos LM",
|
||||
"lm_codes_placeholder": "<|audio_code_10695|><|audio_code_54246|>.",
|
||||
"lm_codes_info": "Cole dicas de códigos LM para geração de text2music.",
|
||||
|
|
@ -508,6 +509,8 @@
|
|||
"generation_caption": "## Escrevendo Boas Legendas\n\n### Estrutura\nUma boa legenda inclui:\n- **Gênero/Estilo**: pop, rock, jazz, eletrônico, clássico…\n- **Instrumentos**: guitarra, piano, sintetizador, bateria, cordas…\n- **Humor**: animado, melancólico, enérgico, onírico…\n- **Estilo vocal**: sussurrado, poderoso, falsete, rap…\n- **Sensação de andamento**: rápido, lento, moderado, pulsante…\n\n### Exemplos\n- *\"Pop-punk energético com guitarras distorcidas, bateria rápida e vocais gritados\"*\n- *\"Trio de jazz suave com baixo walking, bateria com vassourinhas e piano tranquilo\"*\n- *\"Eletrônico ambiente com camadas de pads de sintetizador e sem vocais\"*\n\n### Dicas\n- Mais detalhes = melhores resultados\n- Use o botão **Formatar** para deixar a IA aprimorar sua legenda\n- Marque 🎲 para legendas de exemplo",
|
||||
"generation_lyrics": "## Escrevendo Letras\n\n### Tags de Estrutura\nUse tags de seção para estruturar sua música:\n```\n[Verse 1]\nSua letra do verso aqui\n\n[Chorus]\nSua letra do refrão aqui\n\n[Verse 2]\nSegundo verso aqui\n\n[Bridge]\nSeção da ponte\n\n[Outro]\nLetras de encerramento\n```\n\n### Dicas\n- Mantenha os versos com 4–8 linhas\n- Os refrões devem ser memoráveis e repetitivos\n- Use `[Instrumental]` ou `[Interlude]` para seções sem vocais\n- Marque a caixa **Instrumental** para música puramente instrumental\n- Selecione o **Idioma Vocal** para corresponder ao idioma da sua letra",
|
||||
"generation_advanced": "## Configurações Avançadas\n\n### Parâmetros Principais\n- **Etapas de Inferência**: turbo=8 (padrão), base=até 200. Mais etapas ≠ sempre melhor para turbo\n- **Escala de Orientação**: Somente modelo base. Maior = segue o prompt com mais rigor\n- **Shift**: Deslocamento de timestep (1,0–5,0). 3,0 recomendado para turbo\n- **Semente**: Defina uma semente específica para resultados reproduzíveis\n\n### Parâmetros do LM\n- **Temperatura** (0,0–2,0): Maior = mais criativo/aleatório\n- **Escala CFG** (1,0–3,0): Maior = segue mais o prompt\n- **Top-K / Top-P**: Estratégias de amostragem para diversidade\n\n### Modo Pensar\nAtive **Pensar** para usar o LM 5Hz para geração mais inteligente:\n- Gera códigos semânticos e metadados\n- Requer que o LM esteja inicializado\n- **ParallelThinking**: Processa lotes em paralelo (mais rápido)",
|
||||
"generation_retake": "## Retake (Geração de Variação)\n\n**Ideal para:** produzir variações controladas e suaves de uma baseline com seed fixa, sem alterar prompts.\n\n### Princípio\nRetake mistura uma nova amostra de ruído independente ao ruído inicial via mistura sin/cos com preservação de variância:\n\n```\nmixed = cos(v · π/2) · base_noise + sin(v · π/2) · retake_noise\n```\n\nComo `cos² + sin² = 1`, a variância total é exatamente preservada. `v=0` é noop (idêntico à baseline); `v=1` substitui o ruído por completo.\n\n### Entradas\n- **variance** (0–1): fator de mistura.\n - `0` = baseline\n - `0.05–0.15` = variação sutil\n - `0.3–0.5` = drift moderado\n - `0.5+` = drift forte\n- **seed** (inteiro, opcional): seed independente para o ruído de retake. Vazio = aleatório a cada chamada.\n\n### ⚠️ Requisito de consistência\nRetake só é significativo se as outras condições baterem com a baseline:\n\n- A **seed** principal precisa ser a mesma (desligue random-seed).\n- **Think** deve estar **off**, OU os LM codes da baseline devem ser reutilizados. Com Think on o LM regenera audio codes a cada chamada, então o ponto de partida do modelo de difusão muda — a mistura de ruído do Retake fica sobreposta a um ponto já diferente.\n- Todos os outros parâmetros (caption, lyrics, BPM, key, duration, guidance_scale, shift, sampler, DCW, etc.) devem bater com a baseline.\n\n### Workflow para retake de um resultado em modo Think\n1. Expanda **📊 Score & LRC & LM Codes** no resultado desejado.\n2. Copie o texto **LM Codes** desse resultado.\n3. Cole no campo **LM Codes Hints**.\n4. **Desmarque Think** — o modelo vai usar os codes colados literalmente.\n5. Trave a seed principal, ative **Retake**, defina `retake_seed`, ajuste `variance`.\n\n### Referência\nVeja [issue #1155](https://github.com/ace-step/ACE-Step-1.5/issues/1155).",
|
||||
"generation_edit": "## Edit (Overlay Flow-Edit)\n\n**Ideal para:** transformar um áudio existente em direção a uma nova letra ou estilo, mantendo a estrutura da fonte. Faz transição suave da fonte → alvo via integração de campos de velocidade pareados ao longo do schedule de difusão.\n\n### Princípio\nFlow-Edit integra uma diferença de velocidades:\n\n```\nz_edit_{t-Δt} = z_edit_t + (V_tar(z_tar_t, c_tar) − V_src(z_src_t, c_src)) · Δt\n```\n\nonde `c_src` e `c_tar` são condicionamentos texto/letra pareados. Ambos os ramos compartilham o mesmo contexto de áudio, então V_delta isola a diferença **somente de texto** e a acumula em `z_edit` ao longo de `[n_min, n_max]` do schedule.\n\n### Workflow\n1. Escolha **Custom** (recomendado para novas edições) ou **Remix** (quando quiser preservar o scaffolding do cover).\n2. Suba o áudio a editar no campo **Source Audio** no topo.\n3. Preencha **Music Caption** / **Lyrics** no topo com a descrição **original** do áudio de origem (será o V_src no próximo passo).\n4. Marque **Edit** para expandir o painel.\n5. Clique **Copy current → source** — isso captura o conteúdo *original* nos campos source abaixo como V_src.\n6. Agora edite os campos topo **Music Caption** / **Lyrics** para definir o **alvo** (o que você quer no resultado morphado) — V_tar.\n7. Mantenha `n_min=0`, `n_max=1`, `n_avg=1` na primeira tentativa.\n8. Em **Optional Parameters**, defina `shift=3.0` (recomendado para todas as variantes).\n9. Clique **Generate Music**.\n\n> Dica: se você chegou aqui via **Send to Remix** / **Send to Repaint**, os campos source já estão pré-preenchidos da execução anterior — pule os passos 3 e 5 e vá direto editar os campos topo como alvo.\n\n### Comportamento por modo\n- **Custom** (text2music): o backend **ignora silenciosamente Think e LM Codes Hints** — sempre faz VAE-encode do seu Source Audio para o ponto de partida do flow-edit. V_delta é puramente texto. Modo mais confiável da v1.\n- **Remix** (cover / cover-nofsq): o backend usa o contexto natural de LM-codes do cover, compartilhado entre os dois ramos. Bom quando você quer reter o scaffolding do cover.\n- **Repaint / Extract / Lego**: não suportado na v1 — o backend loga um aviso e roda o task normal (CFG pareado para esses task shapes é o PR seguinte).\n\n### Entradas\n- **source caption / lyrics**: descreve o áudio *original* (V_src).\n- topo **caption / lyrics**: o *alvo* desejado (V_tar).\n- **n_min / n_max**: janela do schedule onde V_delta é integrado. `0 / 1` = schedule completo (recomendado).\n- **n_avg**: amostras Monte Carlo por passo (1 = rápido; mais = estável mas lento).\n\n### Modelos\nTodas as 6 variantes DiT são suportadas. Turbo / XL-turbo são CFG-distilled, então o backend força automaticamente `guidance_scale=1.0`. 8 etapas de inferência bastam para turbo; suba para ≥60 em `base / sft / xl_base / xl_sft`.\n\n### Dicas\n- `Send to Remix` / `Send to Repaint` já preenche os campos source com o prompt da execução anterior, agilizando iteração.\n- O `shift=3.0` recomendado bate com o schedule padrão de flow-edit do ACE-Step 1.0 e é empiricamente estável nas 6 variantes.\n\n### Referência\n- Kulikov, V. et al. *FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models.* CVPR 2025. [arXiv:2412.08629](https://arxiv.org/abs/2412.08629)\n- [Issue #1156](https://github.com/ace-step/ACE-Step-1.5/issues/1156).",
|
||||
"results": "## Seção de Resultados\n\n### Controles por Amostra\n- **Reprodutor de Áudio**: Reproduzir, pausar, baixar\n- **Enviar para Remix/Repintar**: Use este resultado como fonte para edição adicional\n- **Salvar**: Exportar áudio + metadados como JSON\n- **Pontuação**: Calcular pontuação de qualidade (baseada em perplexidade)\n- **LRC**: Gerar timestamps de letra\n\n### Navegação em Lote\n- Use **◀ Anterior** / **Próximo ▶** para navegar pelos lotes\n- Ative **AutoGen** para gerar o próximo lote automaticamente\n- Clique em **Aplicar Estas Configurações à Interface** para reutilizar parâmetros de um bom resultado\n\n### Dicas\n- Gere 2–4 variações (tamanho do lote) e escolha a melhor\n- Use Pontuação para comparar resultados objetivamente\n- Salve bons resultados para referência",
|
||||
"training_dataset": "## Tutorial do Construtor de Dataset\n\n### Passo 1: Carregar ou Escanear\n- **Carregar**: Insira o caminho para o JSON do dataset existente → Clique em Carregar\n- **Escanear**: Insira o caminho da pasta de áudio → Clique em Escanear\n - Suportados: wav, mp3, flac, ogg, opus\n\n### Passo 2: Configurar\n- Defina o **Nome do Dataset**\n- Marque **Tudo Instrumental** se não houver vocais\n- Defina a **Tag de Ativação Personalizada** (palavra-gatilho única para seu LoRA)\n- Escolha a **Posição da Tag**: Antes, Depois ou Substituir\n\n### Passo 3: Rotular Automaticamente\n- Clique em **Rotular Tudo Automaticamente** para gerar legendas, BPM, tom e fórmula de compasso\n- Use **Ignorar Metas** para pular BPM/Tom/Fórmula de Compasso (mais rápido)\n\n### Passo 4: Visualizar e Editar\n- Use o controle deslizante para navegar pelas amostras\n- Edite legenda, letra, BPM e tom manualmente\n- Clique em **Salvar Alterações** por amostra\n\n### Passo 5: Salvar\n- Insira o caminho de salvamento → Clique em **Salvar Dataset**\n\n### Passo 6: Pré-processar\n- Defina o diretório de saída dos tensores → Clique em **Pré-processar**\n- Isso codifica áudio/texto em tensores para treinamento\n\n### 📖 Documentação\n- [Tutorial de Treinamento LoRA](https://github.com/ACE-Step/ACE-Step-1.5/blob/main/docs/en/LoRA_Training_Tutorial.md) — Guia completo passo a passo\n- [Treinamento Avançado Side-Step](https://github.com/ACE-Step/ACE-Step-1.5/blob/main/docs/sidestep/Getting%20Started.md) — Treinamento via CLI com recursos avançados",
|
||||
"training_train": "## Tutorial de Treinamento LoRA\n\n### Configuração\n1. Insira o **Diretório de Tensores Pré-processados** → Clique em **Carregar Dataset**\n2. Configure o LoRA:\n - **Rank** (r): 64 padrão. Maior = mais capacidade\n - **Alpha**: Geralmente 2× o rank (128)\n - **Dropout**: 0,1 para regularização\n\n### Treinamento\n3. Defina a **Taxa de Aprendizado** (comece com 1e-4)\n4. Defina os **Máximo de Epochs** (500 padrão)\n5. Clique em **Iniciar Treinamento**\n6. Monitore a curva de perda — ela deve diminuir ao longo do tempo\n7. Clique em **Parar Treinamento** quando estiver satisfeito\n\n### Exportação\n8. Insira o caminho de exportação → Clique em **Exportar LoRA**\n9. Carregue nas Configurações: defina o Caminho do LoRA → Carregar LoRA → Ative Usar LoRA\n\n### 🚀 Experimente LoKr para Treinamento Mais Rápido\nO LoKr melhorou muito a eficiência do treinamento. O que antes levava uma hora agora leva apenas 5 minutos — **mais de 10× mais rápido**. Isso é essencial para treinar em GPUs de nível consumidor. Mude para a aba **Treinar LoKr** para começar.\n\n### Dicas\n- Use tamanho de lote pequeno (1) se o VRAM for limitado\n- O acúmulo de gradiente aumenta o tamanho efetivo do lote\n- Salve checkpoints com frequência (a cada 200 epochs)\n\n### 📖 Documentação\n- [Tutorial de Treinamento LoRA](https://github.com/ACE-Step/ACE-Step-1.5/blob/main/docs/en/LoRA_Training_Tutorial.md) — Guia completo passo a passo\n- [Treinamento Avançado Side-Step](https://github.com/ACE-Step/ACE-Step-1.5/blob/main/docs/sidestep/Getting%20Started.md) — Treinamento CLI com timesteps corrigidos, LoKR e otimização de VRAM",
|
||||
|
|
|
|||
|
|
@ -90,6 +90,7 @@
|
|||
"analyze_btn": "🔍 分析",
|
||||
"sample_btn": "🎲 试试看",
|
||||
"lm_codes_hints": "🎼 LM 代码提示",
|
||||
"lm_codes_audio_upload_label": "音频 → Codes(工具)",
|
||||
"lm_codes_label": "LM 代码提示",
|
||||
"lm_codes_placeholder": "<|audio_code_10695|><|audio_code_54246|>.",
|
||||
"lm_codes_info": "粘贴用于text2music生成的LM代码提示。",
|
||||
|
|
@ -490,6 +491,8 @@
|
|||
"generation_caption": "## 编写好的描述\n\n### 结构\n好的描述包括:\n- **流派/风格**:pop、rock、jazz、electronic、classical…\n- **乐器**:guitar、piano、synth、drums、strings…\n- **情绪**:upbeat、melancholic、energetic、dreamy…\n- **人声风格**:whispered、powerful、falsetto、rap…\n- **节奏感**:fast、slow、moderate、driving…\n\n### 示例\n- *\"充满能量的朋克流行,失真吉他,快速鼓点,嘶吼人声\"*\n- *\"流畅的爵士三重奏,行走贝斯,刷子鼓,柔和钢琴\"*\n- *\"氛围电子,层叠的合成器垫和无人声\"*\n\n### 提示\n- 越详细 = 越好的结果\n- 使用**格式化**按钮让 AI 增强你的描述\n- 点击 🎲 查看示例描述",
|
||||
"generation_lyrics": "## 编写歌词\n\n### 结构标签\n使用段落标签来组织歌曲:\n```\n[Verse 1]\n在这里写你的主歌歌词\n\n[Chorus]\n在这里写你的副歌歌词\n\n[Verse 2]\n第二段主歌\n\n[Bridge]\n过渡段\n\n[Outro]\n结尾歌词\n```\n\n### 提示\n- 主歌保持 4-8 行\n- 副歌应该朗朗上口且有重复性\n- 使用 `[Instrumental]` 或 `[Interlude]` 表示无人声段落\n- 勾选**纯音乐**复选框生成纯器乐音乐\n- 选择**人声语言**以匹配歌词语言",
|
||||
"generation_advanced": "## 高级设置\n\n### 关键参数\n- **推理步数**:Turbo=8(默认),Base=最多 200。对 turbo 来说更多步数不一定更好\n- **引导比例**:仅 Base 模型。越高 = 越严格遵循提示\n- **Shift**:时间步偏移(1.0–5.0)。turbo 推荐 3.0\n- **种子**:设置特定种子以获得可重现的结果\n\n### LM 参数\n- **温度**(0.0–2.0):越高 = 越有创意/随机\n- **CFG 比例**(1.0–3.0):越高 = 越遵循提示\n- **Top-K / Top-P**:多样性的采样策略\n\n### Think 模式\n启用 **Think** 使用 5Hz LM 进行更智能的生成:\n- 生成语义代码和元数据\n- 需要初始化 LM\n- **并行思考**:并行处理批次(更快)",
|
||||
"generation_retake": "## Retake(变体生成)\n\n**适用场景:** 在不改变提示的情况下,对已固定 seed 的 baseline 做受控的、平滑的变体。\n\n### 原理\nRetake 通过 variance-preserving 的 sin/cos 混合,把一组独立的新噪声叠加到固定 seed 噪声上:\n\n```\nmixed = cos(v · π/2) · base_noise + sin(v · π/2) · retake_noise\n```\n\n由于 `cos² + sin² = 1`,总噪声方差精确保持不变。`v=0` 完全等价于不开 Retake;`v=1` 把噪声整组替换。\n\n### 输入\n- **variance**(0–1):混合系数。\n - `0` = baseline(与不开 Retake 完全一致)\n - `0.05–0.15` = 微变,旋律/结构基本不变,细节略有差异\n - `0.3–0.5` = 中等漂移\n - `0.5+` = 强漂移,可能跑得比较远\n- **seed**(整数,可选):retake 噪声的独立种子。留空 = 每次随机。\n\n### ⚠️ 一致性要求\nRetake 只有在**其它所有条件都跟 baseline 一致**时才有意义。具体:\n\n- 主 **seed** 必须一致——关掉 random-seed 复选框,沿用 baseline 那一批的 seed。\n- **Think 必须关闭**,或者复用 baseline 的 LM codes。Think 开启时 LM 每次都重新生成 audio codes,扩散模型的输入起点就变了——Retake 的噪声混合叠加在已经不同的起点上,结果是「LM 漂移 + 噪声漂移」混在一起。\n- 其它所有参数(caption、lyrics、BPM、key、duration、guidance_scale、shift、sampler、DCW 等)都应跟 baseline 对齐。\n\n### 在 Think 模式结果上做 Retake 的工作流\n1. 在你想做变体的那个结果上,展开 **📊 Score & LRC & LM Codes** 折叠面板。\n2. 复制其中的 **LM Codes** 文本。\n3. 粘贴到生成区顶部的 **LM Codes Hints** 输入框。\n4. **取消勾选 Think** —— 现在扩散模型会直接使用粘贴的 LM codes,不再重新生成。\n5. 锁定主 seed 为 baseline 的 seed,勾选 **Retake**,给 `retake_seed` 一个固定整数,调 `variance`。\n\n### 参考\n详见 [issue #1155](https://github.com/ace-step/ACE-Step-1.5/issues/1155)。",
|
||||
"generation_edit": "## Edit(Flow-Edit 叠加)\n\n**适用场景:** 把现有音频朝新的歌词/风格平滑漂移,同时保留原音频的结构。通过沿扩散 schedule 做配对速度场积分实现。\n\n### 原理\nFlow-Edit 在扩散 schedule 上积分一个速度差:\n\n```\nz_edit_{t-Δt} = z_edit_t + (V_tar(z_tar_t, c_tar) − V_src(z_src_t, c_src)) · Δt\n```\n\n其中 `c_src` 和 `c_tar` 是配对的文本/歌词条件。两个分支共用相同的音频上下文,所以 V_delta 隔离出**纯文本**层面的差异,沿 `[n_min, n_max]` 段累积进 `z_edit`。\n\n### 操作流程\n1. 选择 **Custom**(推荐做新编辑)或 **Remix**(想保留 cover 的参考骨架)模式。\n2. 把待编辑的音频上传到顶部的 **Source Audio** 输入框。\n3. 在顶部的 **Music Caption / Lyrics** 填**原始**描述(描述源音频是什么样的,这是下一步的 V_src)。\n4. 勾选 **Edit** 展开面板。\n5. 点 **Copy current → source** —— 把刚填的**原始**条件快照到下面的 source 字段,作为 V_src。\n6. 现在再把顶层 **Music Caption / Lyrics** 改成**目标**(你想要 morph 后的结果)—— V_tar。\n7. 第一次试用建议保留 `n_min=0`、`n_max=1`、`n_avg=1`。\n8. 在 **Optional Parameters** 里把 `shift` 设为 **3.0**(所有变体都推荐)。\n9. 点 **Generate Music**。\n\n> 提示:如果你是通过 **Send to Remix** / **Send to Repaint** 跳过来的,source 字段已经从上次的运行自动填好——直接跳过第 3、5 步,去顶层改成目标就行。\n\n### 不同模式的行为\n- **Custom**(text2music):后端**自动忽略 Think 和 LM Codes Hints** —— 始终对你上传的 Source Audio 做 VAE 编码作为 flow-edit 的起点。V_delta 完全由文本驱动。v1 最稳定的模式。\n- **Remix**(cover / cover-nofsq):后端复用 cover 自然的 LM-codes 上下文,两分支共享。适合想保留 cover 任务参考骨架的场景。\n- **Repaint / Extract / Lego**:v1 不支持 —— 后端会打 warning 并按普通任务跑(这些 task shape 的配对 CFG 推导留给后续 PR)。\n\n### 输入\n- **source caption / lyrics**:描述**原始**音频(V_src)。\n- 顶层 **caption / lyrics**:你想要的**目标**(V_tar)。\n- **n_min / n_max**:V_delta 积分的扩散 schedule 窗口。`0 / 1` = 全 schedule(推荐)。\n- **n_avg**:每步的蒙特卡洛采样数(1 = 快;越高 = 越稳定但越慢)。\n\n### 模型\n6 个 DiT 变体全部支持。Turbo / XL-turbo 是 CFG-distilled 的,后端会自动强制 `guidance_scale=1.0`。Turbo 的 8 推理步够用;`base / sft / xl_base / xl_sft` 建议 ≥60 步。\n\n### 提示\n- `Send to Remix` / `Send to Repaint` 已经会把上次运行的 prompt 自动填到 source 字段,方便快速迭代。\n- 推荐的 `shift=3.0` 与 ACE-Step 1.0 默认 flow-edit schedule 一致,经验上对 6 个变体都稳定。\n\n### 参考\n- Kulikov, V. et al. *FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models.* CVPR 2025. [arXiv:2412.08629](https://arxiv.org/abs/2412.08629)\n- [Issue #1156](https://github.com/ace-step/ACE-Step-1.5/issues/1156)。",
|
||||
"results": "## 结果区域\n\n### 每个样本的控制\n- **音频播放器**:播放、暂停、下载\n- **发送到混音/重绘**:将此结果作为源进行进一步编辑\n- **保存**:导出音频 + 元数据为 JSON\n- **评分**:计算质量分数(基于困惑度)\n- **LRC**:生成歌词时间戳\n\n### 批次导航\n- 使用 **◀ 上一个** / **下一个 ▶** 浏览批次\n- 启用 **自动生成** 自动生成下一批\n- 点击**将这些设置应用到 UI** 以重用好结果的参数\n\n### 提示\n- 生成 2-4 个变体(批量大小)并选择最好的\n- 使用评分客观比较结果\n- 保存好的结果以备参考",
|
||||
"training_dataset": "## 数据集构建教程\n\n### 步骤 1:加载或扫描\n- **加载**:输入现有数据集 JSON 路径 → 点击加载\n- **扫描**:输入音频文件夹路径 → 点击扫描\n - 支持:wav、mp3、flac、ogg、opus\n\n### 步骤 2:配置\n- 设置**数据集名称**\n- 勾选**全部为纯音乐**(如果没有人声)\n- 设置**自定义激活标签**(LoRA 的唯一触发词)\n- 选择**标签位置**:前置、后置或替换\n\n### 步骤 3:自动标注\n- 点击**自动标注全部**生成描述、BPM、调性、拍号\n- 使用**跳过元数据**跳过 BPM/调性/拍号(更快)\n\n### 步骤 4:预览与编辑\n- 使用滑块浏览样本\n- 手动编辑描述、歌词、BPM、调性\n- 每个样本点击**保存更改**\n\n### 步骤 5:保存\n- 输入保存路径 → 点击**保存数据集**\n\n### 步骤 6:预处理\n- 设置张量输出目录 → 点击**预处理**\n- 将音频/文本编码为张量用于训练\n\n### 📖 文档\n- [LoRA 训练教程](https://github.com/ACE-Step/ACE-Step-1.5/blob/main/docs/zh/LoRA_Training_Tutorial.md) — 完整分步指南\n- [Side-Step 高级训练](https://github.com/ACE-Step/ACE-Step-1.5/blob/main/docs/sidestep/Getting%20Started.md) — 命令行训练,支持高级功能",
|
||||
"training_train": "## LoRA 训练教程\n\n### 设置\n1. 输入**预处理张量目录** → 点击**加载数据集**\n2. 配置 LoRA:\n - **秩** (r):默认 64。越高 = 容量越大\n - **Alpha**:通常为秩的 2 倍(128)\n - **Dropout**:0.1 用于正则化\n\n### 训练\n3. 设置**学习率**(从 1e-4 开始)\n4. 设置**最大轮数**(默认 500)\n5. 点击**开始训练**\n6. 监控损失曲线 — 应该随时间下降\n7. 满意时点击**停止训练**\n\n### 导出\n8. 输入导出路径 → 点击**导出 LoRA**\n9. 在设置中加载:设置 LoRA 路径 → 加载 LoRA → 启用使用 LoRA\n\n### 🚀 推荐使用 LoKr 加速训练\nLoKr 大幅提升了训练效率,原来需要一小时的训练现在只需 5 分钟——**速度提升超过 10 倍**。这对于在消费级 GPU 上训练至关重要。切换到 **Train LoKr** 标签页即可开始。\n\n### 提示\n- 显存有限时使用小批量(1)\n- 梯度累积增加有效批量大小\n- 频繁保存检查点(每 200 轮)\n\n### 📖 文档\n- [LoRA 训练教程](https://github.com/ACE-Step/ACE-Step-1.5/blob/main/docs/zh/LoRA_Training_Tutorial.md) — 完整分步指南\n- [Side-Step 高级训练](https://github.com/ACE-Step/ACE-Step-1.5/blob/main/docs/sidestep/Getting%20Started.md) — 命令行训练,修正时间步采样、LoKR、显存优化",
|
||||
|
|
|
|||
|
|
@ -1,45 +0,0 @@
|
|||
"""Retake (variation-generation) controls for the generation tab.
|
||||
|
||||
Split out from ``generation_tab_secondary_controls.py`` to keep that
|
||||
module under the 200 LOC cap defined in AGENTS.md.
|
||||
"""
|
||||
|
||||
from typing import Any
|
||||
|
||||
import gradio as gr
|
||||
|
||||
|
||||
def build_retake_controls() -> dict[str, Any]:
|
||||
"""Create retake controls — controllable variation generation (issue #1155).
|
||||
|
||||
Retake mixes a fresh independent noise draw into the main initial noise
|
||||
via a variance-preserving sin/cos blend. ``variance=0`` is a no-op;
|
||||
higher values drift the output further from the seeded baseline.
|
||||
|
||||
Args:
|
||||
None.
|
||||
|
||||
Returns:
|
||||
Component map with the retake accordion and its inputs.
|
||||
"""
|
||||
|
||||
with gr.Accordion("Retake (variation generation)", open=False) as retake_accordion:
|
||||
retake_variance = gr.Slider(
|
||||
label="Retake Variance",
|
||||
minimum=0.0,
|
||||
maximum=1.0,
|
||||
step=0.01,
|
||||
value=0.0,
|
||||
info="0=no retake (baseline). Low (0.05–0.15) = subtle variation; high (0.5+) = stronger drift.",
|
||||
)
|
||||
retake_seed = gr.Textbox(
|
||||
label="Retake Seed",
|
||||
value="",
|
||||
placeholder="Leave empty for random; integer to reproduce a variation",
|
||||
info="Independent seed for the retake noise. Recorded in metadata when variance > 0.",
|
||||
)
|
||||
return {
|
||||
"retake_accordion": retake_accordion,
|
||||
"retake_variance": retake_variance,
|
||||
"retake_seed": retake_seed,
|
||||
}
|
||||
|
|
@ -28,7 +28,7 @@ from .generation_tab_secondary_controls import (
|
|||
build_custom_mode_controls,
|
||||
build_repainting_controls,
|
||||
)
|
||||
from .generation_tab_retake_controls import build_retake_controls
|
||||
from .generation_tab_variation_morph_controls import build_variation_morph_controls
|
||||
|
||||
|
||||
def create_generation_tab_section(
|
||||
|
|
@ -70,10 +70,12 @@ def create_generation_tab_section(
|
|||
hidden_state_controls = build_hidden_generation_state()
|
||||
simple_mode_controls = build_simple_mode_controls()
|
||||
source_track_code_controls = build_source_track_and_code_controls()
|
||||
# Retake + Edit accordion sits right under LM codes Hints so the
|
||||
# variation knobs are next to the source-audio inputs they apply to.
|
||||
variation_morph_controls = build_variation_morph_controls()
|
||||
cover_controls = build_cover_strength_controls()
|
||||
custom_mode_controls = build_custom_mode_controls()
|
||||
repainting_controls = build_repainting_controls()
|
||||
retake_controls = build_retake_controls()
|
||||
optional_controls = build_optional_parameter_controls(
|
||||
max_duration=max_duration,
|
||||
max_batch_size=max_batch_size,
|
||||
|
|
@ -92,10 +94,10 @@ def create_generation_tab_section(
|
|||
result.update(hidden_state_controls)
|
||||
result.update(simple_mode_controls)
|
||||
result.update(source_track_code_controls)
|
||||
result.update(variation_morph_controls)
|
||||
result.update(cover_controls)
|
||||
result.update(custom_mode_controls)
|
||||
result.update(repainting_controls)
|
||||
result.update(retake_controls)
|
||||
result.update(optional_controls)
|
||||
result.update(generate_controls)
|
||||
result.update(
|
||||
|
|
|
|||
|
|
@ -89,7 +89,10 @@ def build_lm_code_hint_controls() -> dict[str, Any]:
|
|||
elem_classes=["has-info-container"],
|
||||
) as text2music_audio_codes_group:
|
||||
with gr.Row(equal_height=True):
|
||||
lm_codes_audio_upload = gr.Audio(label=t("generation.source_audio"), type="filepath", scale=3)
|
||||
lm_codes_audio_upload = gr.Audio(
|
||||
label=t("generation.lm_codes_audio_upload_label"),
|
||||
type="filepath", scale=3,
|
||||
)
|
||||
text2music_audio_code_string = gr.Textbox(
|
||||
label=t("generation.lm_codes_label"),
|
||||
placeholder=t("generation.lm_codes_placeholder"),
|
||||
|
|
|
|||
|
|
@ -0,0 +1,126 @@
|
|||
"""Retake + Edit (flow-edit overlay) controls (#1155, #1156).
|
||||
|
||||
Two columns inside one ``gr.Accordion`` so each subsystem's panel sits
|
||||
directly under its own checkbox. Both available in Custom / Remix /
|
||||
Repaint modes; outer accordion visibility is controlled by ``mode_ui``.
|
||||
The ``Copy from current`` button click handler is wired in
|
||||
``generation_run_wiring.py`` (kept out of this builder so the captions /
|
||||
lyrics components only exist in the wiring scope).
|
||||
"""
|
||||
|
||||
from typing import Any
|
||||
|
||||
import gradio as gr
|
||||
|
||||
from acestep.ui.gradio.help_content import create_help_button
|
||||
|
||||
|
||||
def build_variation_morph_controls() -> dict[str, Any]:
|
||||
"""Build the Retake + Edit accordion.
|
||||
|
||||
Layout (collapsed by default)::
|
||||
|
||||
> Retake & Edit
|
||||
| [ ] Retake (?) | [ ] Edit (?) |
|
||||
| variance ── seed ── | [Copy from current] |
|
||||
| | source caption ── |
|
||||
| | source lyrics ── |
|
||||
| | n_min n_max n_avg |
|
||||
|
||||
Each checkbox toggles the visibility of the panel directly below it,
|
||||
so the two columns can be expanded independently. The (?) buttons
|
||||
open modal tutorials for each subsystem.
|
||||
"""
|
||||
|
||||
with gr.Group() as variation_group:
|
||||
with gr.Row(equal_height=False):
|
||||
# ---- LEFT column: Retake ----
|
||||
with gr.Column(scale=1, min_width=200):
|
||||
with gr.Row():
|
||||
retake_enabled = gr.Checkbox(
|
||||
label="Retake", value=False, scale=8,
|
||||
)
|
||||
create_help_button("generation_retake")
|
||||
with gr.Group(visible=False) as retake_panel:
|
||||
with gr.Row():
|
||||
retake_variance = gr.Slider(
|
||||
minimum=0.0, maximum=1.0, step=0.01, value=0.0,
|
||||
label="variance", scale=2,
|
||||
info="0=baseline; 0.05–0.15 subtle; 0.5+ strong.",
|
||||
)
|
||||
retake_seed = gr.Textbox(
|
||||
label="seed", value="", scale=1,
|
||||
placeholder="empty=random",
|
||||
)
|
||||
retake_think_warning = gr.Markdown(
|
||||
"⚠️ **Think is on — Retake will mix LM drift with "
|
||||
"noise drift.** To retake a Think-mode result "
|
||||
"cleanly: open the result's 📊 Score & LRC & LM "
|
||||
"Codes panel, copy its **LM Codes** into the "
|
||||
"**LM Codes Hints** field above, then uncheck Think "
|
||||
"before adjusting variance. See the (?) help for "
|
||||
"the full workflow.",
|
||||
visible=False,
|
||||
)
|
||||
# ---- RIGHT column: Edit ----
|
||||
with gr.Column(scale=1, min_width=200):
|
||||
with gr.Row():
|
||||
flow_edit_morph = gr.Checkbox(
|
||||
label="Edit", value=False, scale=8,
|
||||
)
|
||||
create_help_button("generation_edit")
|
||||
with gr.Group(visible=False) as morph_panel:
|
||||
with gr.Row():
|
||||
flow_edit_copy_from_current_btn = gr.Button(
|
||||
"Copy current → source",
|
||||
size="sm", scale=0, min_width=180,
|
||||
)
|
||||
with gr.Row(equal_height=False):
|
||||
flow_edit_source_caption = gr.Textbox(
|
||||
label="source caption",
|
||||
placeholder="Describe the ORIGINAL audio.",
|
||||
lines=4, max_lines=8, scale=1,
|
||||
)
|
||||
flow_edit_source_lyrics = gr.Textbox(
|
||||
label="source lyrics",
|
||||
placeholder="Original lyrics; top-level lyrics is the target.",
|
||||
lines=4, max_lines=8, scale=1,
|
||||
)
|
||||
with gr.Row():
|
||||
flow_edit_n_min = gr.Slider(
|
||||
minimum=0.0, maximum=1.0, value=0.0, step=0.05,
|
||||
label="n_min",
|
||||
)
|
||||
flow_edit_n_max = gr.Slider(
|
||||
minimum=0.0, maximum=1.0, value=1.0, step=0.05,
|
||||
label="n_max",
|
||||
)
|
||||
flow_edit_n_avg = gr.Slider(
|
||||
minimum=1, maximum=8, value=1, step=1,
|
||||
label="n_avg",
|
||||
)
|
||||
# Visibility chains.
|
||||
retake_enabled.change(
|
||||
lambda v: gr.update(visible=bool(v)),
|
||||
inputs=[retake_enabled], outputs=[retake_panel],
|
||||
)
|
||||
flow_edit_morph.change(
|
||||
lambda v: gr.update(visible=bool(v)),
|
||||
inputs=[flow_edit_morph], outputs=[morph_panel],
|
||||
)
|
||||
return {
|
||||
"variation_group": variation_group,
|
||||
"retake_enabled": retake_enabled,
|
||||
"retake_panel": retake_panel,
|
||||
"retake_variance": retake_variance,
|
||||
"retake_seed": retake_seed,
|
||||
"retake_think_warning": retake_think_warning,
|
||||
"flow_edit_morph": flow_edit_morph,
|
||||
"morph_panel": morph_panel,
|
||||
"flow_edit_copy_from_current_btn": flow_edit_copy_from_current_btn,
|
||||
"flow_edit_source_caption": flow_edit_source_caption,
|
||||
"flow_edit_source_lyrics": flow_edit_source_lyrics,
|
||||
"flow_edit_n_min": flow_edit_n_min,
|
||||
"flow_edit_n_max": flow_edit_n_max,
|
||||
"flow_edit_n_avg": flow_edit_n_avg,
|
||||
}
|
||||
120
scripts/flow_edit_overlay_smoke.py
Normal file
120
scripts/flow_edit_overlay_smoke.py
Normal file
|
|
@ -0,0 +1,120 @@
|
|||
"""Flow-edit overlay smoke test (#1156, post-redesign).
|
||||
|
||||
Drives ``inference.generate_music`` with ``task_type='cover'`` +
|
||||
``flow_edit_morph=True`` so the V_delta overlay layers on top of the
|
||||
existing cover dispatch. This is the architecturally correct shape:
|
||||
the user's caption/lyrics are the *target*; the overlay's
|
||||
``flow_edit_source_caption`` / ``flow_edit_source_lyrics`` describe
|
||||
the original audio.
|
||||
|
||||
Run on jieyue (or any GPU host):
|
||||
|
||||
cd /root/data/repo/gongjunmin/workspace/ACE-Step-1.5
|
||||
conda activate acestep_v15_train
|
||||
CUDA_VISIBLE_DEVICES=1 python scripts/flow_edit_overlay_smoke.py
|
||||
|
||||
Outputs go to ``flow_edit_test_outputs/overlay_*.wav``.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
REPO = Path(__file__).resolve().parent.parent
|
||||
sys.path.insert(0, str(REPO))
|
||||
os.environ.setdefault("CUDA_VISIBLE_DEVICES", "1")
|
||||
|
||||
import torch
|
||||
import torchaudio
|
||||
from loguru import logger
|
||||
|
||||
from acestep.handler import AceStepHandler
|
||||
from acestep.inference import GenerationConfig, GenerationParams, generate_music
|
||||
from acestep.llm_inference import LLMHandler
|
||||
|
||||
OUT = REPO / "flow_edit_test_outputs"
|
||||
OUT.mkdir(exist_ok=True)
|
||||
SEED = 42
|
||||
DIT_CONFIG = "acestep-v15-sft"
|
||||
|
||||
ex = json.loads((REPO / "examples/text2music/example_01.json").read_text())
|
||||
src_path = OUT / f"baseline_01_seed{SEED}.wav"
|
||||
assert src_path.exists(), f"missing baseline wav at {src_path}; run flow_edit_smoke_test.py first"
|
||||
|
||||
NEW_VERSE1 = (
|
||||
"[Verse 1]\n清晨阳光洒在花园里\n鸟儿欢唱迎接早晨\n"
|
||||
"露珠闪耀在叶尖上\n微风轻拂带来安宁\n"
|
||||
"远方传来悠扬钢琴声\n阳光穿过透明的窗户\n"
|
||||
"花香弥漫在空气中\n心情舒畅自由飞翔\n"
|
||||
)
|
||||
ORIG_VERSE1 = (
|
||||
"[Verse 1]\n黑夜里的风吹过耳畔\n甜蜜时光转瞬即万\n"
|
||||
"脚步飘摇在星光上\n心追节奏心跳狂乱\n"
|
||||
"耳边传来电吉他呼唤\n手指轻触碰点流点燃\n"
|
||||
"梦在云端任它蔓延\n疯狂跳跃自由无间\n"
|
||||
)
|
||||
TGT_LYRICS = ex["lyrics"].replace(ORIG_VERSE1, NEW_VERSE1)
|
||||
assert TGT_LYRICS != ex["lyrics"], "verse-1 replace marker not found"
|
||||
|
||||
logger.info(f"Initializing DiT ({DIT_CONFIG})")
|
||||
dit = AceStepHandler()
|
||||
_, ok = dit.initialize_service(
|
||||
project_root=str(REPO), config_path=DIT_CONFIG, device="cuda",
|
||||
use_flash_attention=False, compile_model=False,
|
||||
offload_to_cpu=False, offload_dit_to_cpu=False, quantization=None,
|
||||
use_mlx_dit=False,
|
||||
)
|
||||
assert ok
|
||||
llm = LLMHandler()
|
||||
|
||||
|
||||
def run(label, n_min, n_max, n_avg, infer_steps=60, shift=3.0, guidance_scale=15.0):
|
||||
p = GenerationParams(
|
||||
# text2music task — silence-derived context for prepare_condition,
|
||||
# real src_latents only used for zt formation in the sampling loop.
|
||||
task_type="text2music",
|
||||
src_audio=str(src_path),
|
||||
caption=ex["caption"], # target = original style (keep melody)
|
||||
lyrics=TGT_LYRICS, # target = NEW lyrics
|
||||
# Flow-edit overlay — describes the source side for V_src.
|
||||
flow_edit_morph=True,
|
||||
flow_edit_source_caption=ex["caption"],
|
||||
flow_edit_source_lyrics=ex["lyrics"],
|
||||
flow_edit_n_min=n_min,
|
||||
flow_edit_n_max=n_max,
|
||||
flow_edit_n_avg=n_avg,
|
||||
instrumental=False,
|
||||
vocal_language=ex.get("language", "en"),
|
||||
bpm=ex.get("bpm"),
|
||||
keyscale=ex.get("keyscale", ""),
|
||||
timesignature=str(ex.get("timesignature", "")),
|
||||
duration=float(ex.get("duration", 120)),
|
||||
inference_steps=infer_steps,
|
||||
seed=SEED,
|
||||
guidance_scale=guidance_scale,
|
||||
shift=shift,
|
||||
thinking=False,
|
||||
)
|
||||
cfg = GenerationConfig(batch_size=1, use_random_seed=False, seeds=[SEED])
|
||||
logger.info(f"[{label}] n_min={n_min} n_max={n_max} n_avg={n_avg}")
|
||||
t0 = time.time()
|
||||
r = generate_music(dit, llm, p, cfg, save_dir=str(OUT))
|
||||
dt = time.time() - t0
|
||||
if not r.success:
|
||||
logger.error(f"[{label}] FAILED: {r.error}")
|
||||
return
|
||||
audio = r.audios[0]
|
||||
out_path = OUT / f"overlay_{label}.wav"
|
||||
if isinstance(audio, dict) and "tensor" in audio:
|
||||
torchaudio.save(str(out_path), audio["tensor"].to(torch.float32),
|
||||
audio.get("sample_rate", 48000))
|
||||
logger.info(f"[{label}] {dt:.1f}s -> {out_path}")
|
||||
|
||||
|
||||
# Match ACE-Step 1.0 defaults exactly: shift=3.0 (FlowMatchEulerDiscreteScheduler
|
||||
# default), guidance=15, infer=60, n_min=0, n_max=1, n_avg=1.
|
||||
run("v10_shift3", 0.0, 1.0, 1, infer_steps=60, shift=3.0)
|
||||
# Sanity: shift=1.0 (uniform schedule) for comparison.
|
||||
run("v10_shift1", 0.0, 1.0, 1, infer_steps=60, shift=1.0)
|
||||
Loading…
Add table
Reference in a new issue