mirror of
https://github.com/ace-step/ACE-Step-1.5.git
synced 2026-07-01 16:27:05 +00:00
* refactor(flow-edit): redesign as cover-overlay sampler, drop "edit" task (#1156) The previous design (PRs #1162–#1167) shipped a standalone ``task_type="edit"`` that fed the user's source audio into ``prepare_condition`` for both branches. In ACE-Step 1.5 ``prepare_condition`` builds ``context_latents`` from ``src_latents`` (line 1701 of the base modeling), which becomes the dominant audio-self-conditioning at the decoder. Both branches receiving the same source latents meant V_src ≈ V_tar regardless of how we tweaked the text/lyric encoder inputs — the overlay had no signal to integrate and produced near-identical output to the baseline at every (n_min, n_max, n_avg) we tried. Architecturally correct shape (from user feedback): flow-edit is a sampler-level technique that layers on top of an existing task, not a new task. The cover/cover-nofsq dispatch already pairs ref-audio context with new caption/lyrics; the overlay simply adds a paired *source* branch (encoded from new ``flow_edit_source_caption`` / ``flow_edit_source_lyrics``) so V_delta = V_tar - V_src has meaning. Backend changes --------------- * Drop ``task_type == "edit"`` everywhere (constants, TASK_INSTRUCTIONS, inference skip_lm_tasks, generate_music_request _src_audio_required_tasks, generate_music guard rails, task_utils generate_instruction). * Replace ``edit_target_*`` GenerationParams with ``flow_edit_morph`` + ``flow_edit_source_caption/lyrics`` + ``flow_edit_n_min/max/avg`` so the user's ``caption``/``lyrics`` keep cover's existing target semantics. * ``service_generate`` builds ``flow_edit_ctx`` only when ``flow_edit_morph=True and task_type in (cover, cover-nofsq)``. * ``_execute_service_generate_diffusion`` dispatches to the new ``dispatch_flow_edit_overlay`` (renamed from ``dispatch_flow_edit``) when the overlay is active; otherwise the regular cover dispatch runs unchanged. * ``service_generate_flow_edit_target.py`` renamed to ``service_generate_flow_edit_source.py`` — same helpers, but now they encode the *source* side (the existing payload already carries the user's caption/lyrics as the target). * ``service_generate_flow_edit.py`` rewritten to drive ``flow_edit_pipeline.flowedit_generate_audio`` with the freshly encoded source side + payload's target side. UI changes ---------- * ``build_flow_edit_morph_controls()`` adds a Smooth-morph checkbox + source caption/lyrics + n_min/n_max/n_avg sliders inside a group that ``mode_ui`` toggles visible only on Remix (cover) mode. v1 leaves the controls visible for inspection but does NOT thread them through ``generation_run_wiring`` / ``batch_management_*`` yet — the end-to-end UI run path lands in the follow-up PR. Smoke testing goes through the Python API for now (see ``scripts/flow_edit_overlay_smoke.py``). Tests ----- * Rewrite ``service_generate_flow_edit_test.py`` for the overlay shape: 9 tests now exercise dispatch source-side tokenization, target-side pass-through, missing-method guard, device alignment, default-window fallbacks, dict-meta parsing, and retake_seed forwarding. * Update ``_flow_edit_dispatch_test_support.py`` fakes accordingly. * All 36 flow-edit / pipeline / helper tests pass locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(flow-edit): re-target overlay onto text2music + 1.0 default hyperparams (#1156) User feedback after listening to the cover-overlay v1 outputs: * drop cover task as the underlying carrier — use text2music instead so ``prepare_condition`` produces silence-derived context for both src and tar branches (the cleanest condition shape, identical between branches, so V_delta is purely text-driven); * fall back to ACE-Step 1.0's flow-edit defaults: ``n_min=0, n_max=1.0, n_avg=1, infer_steps=60``. Implementation -------------- * ``flow_edit_pipeline.flowedit_generate_audio`` gains a ``ctx_src_latents`` param. When the caller wants text2music-style context but still needs the real ``src_latents`` for ``zt_src``/ ``zt_tar`` formation in the sampling loop, it passes ``ctx_src_latents=silence``. Both ``prepare_condition`` calls then feed silence as both ``hidden_states`` and ``src_latents`` while the loop continues to use the real audio for trajectory formation. * ``service_generate_flow_edit.dispatch_flow_edit_overlay`` rewritten: builds the silence tile, passes ``is_covers=zeros`` (text2music), and forwards real ``src_latents`` for the loop. * ``service_generate.flow_edit_ctx`` now activates only on ``task_type == "text2music"`` (was cover/cover-nofsq). * ``generate_music_request._prepare_reference_and_source_audio`` accepts ``flow_edit_morph`` so text2music + morph encodes ``src_audio`` instead of silently ignoring it. * ``generate_music`` locks ``audio_duration`` to source audio for the morph case and warns if morph is enabled on a non-text2music task. * ``inference.generate_music`` no longer forces ``src_audio=None`` for text2music when ``flow_edit_morph=True``. * ``flow_edit_overlay_smoke.py`` now drives a single 1.0-default run (``n_min=0, n_max=1.0, n_avg=1, infer_steps=60``) on the SFT model. Tests ----- * Update fakes/fixtures so ``task_type='text2music'`` and ``is_covers=0``; all 36 flow-edit / pipeline / helper tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(flow-edit): slice silence_latent properly when tiling for prepare_condition The previous expand() call assumed silence_latent was (1, 1, C) but it's actually (1, available, C) — for example (1, 15000, C) — so expanding to (bsz, 4000, C) blew up with a shape mismatch. Mirror ``conditioning_target._get_silence_latent_slice``: slice the first ``seq`` frames if available, otherwise tile and slice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(flow-edit): add shift=3.0 vs shift=1.0 A/B for overlay smoke ACE-Step 1.0's ``FlowMatchEulerDiscreteScheduler`` defaults to shift=3.0, which front-loads the schedule near t=0 (more dense steps at the clean end). Our smoke wasn't setting shift explicitly so it ran at shift=1.0 (uniform). Add a paired run so we can A/B the two on the same source and confirm whether the shift mismatch explains the v1 distortion the user reported. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(flow-edit): wire morph UI to text2music + run-handler chain (#1156) User found that the Gradio demo didn't surface a working morph control because the UI pieces weren't actually connected end-to-end: * ``mode_ui`` made ``flow_edit_morph_group`` visible only on Remix mode, but the backend dispatches morph for ``task_type == text2music`` (Custom mode). Net: morph never fired. Fix: gate visibility on ``is_custom`` and add Custom to ``show_src_audio`` so users can upload the source audio for the overlay. * ``generation_run_wiring`` didn't pass the 6 ``flow_edit_*`` UI components to the handler. Add them to the click-event ``inputs``. * ``generate_with_batch_management`` and ``generate_with_progress`` signatures gain the same 6 params and forward them into the ``GenerationParams`` constructor. * ``generate_music_request`` now hard-errors when ``flow_edit_morph`` is True without a ``src_audio`` (was a silent no-op). * ``flow_edit_pipeline._warn_about_disabled_v1_tricks`` log no longer references the removed ``task_type='edit'`` — overlay is the right noun now. * Drop the "API preview" caveat from the morph checkbox info text. Hygiene: * Add ``*.pkf`` and ``flow_edit_test_outputs/`` / ``retake_test_outputs/`` to ``.gitignore``; remove the 28 .pkf binaries that a previous ``git add -A`` accidentally pushed. Limitations (follow-up PRs): * save/restore for the 6 morph params isn't threaded through batch_queue / metadata yet — restoring a session resets morph to defaults. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ui): combine retake + smooth-morph into one accordion (#1155, #1156) Per user feedback after first manual demo round: * The two variation knobs (Retake = #1155 noise blend, Smooth morph = #1156 V_delta overlay) are conceptually paired — both produce a controlled deviation from the seeded baseline. Stack them in one ``gr.Accordion("Variation & Smooth Morph")`` with two top-line checkboxes; checking a box reveals only that subsystem's inputs. * Compact retake — variance + seed sit on one row inside their panel. * Morph controls (source caption / source lyrics / n_min / n_max / n_avg) get their own subpanel with explanatory help text below. * Both subsystems are now visible in Custom / Remix / Repaint modes; ``mode_ui`` gates the outer accordion via ``is_custom or is_cover or is_repaint``. Backend dispatch still honours morph only in Custom (text2music) for v1; the morph panel info text flags this. * Send-to-Remix / Send-to-Repaint pre-fills ``flow_edit_source_caption`` + ``flow_edit_source_lyrics`` with the previous run's prompt so the user can flip on morph and edit the top-level caption / lyrics as the target without re-typing the source description. Implementation -------------- * New ``generation_tab_variation_morph_controls.py`` (renamed from ``generation_tab_retake_controls.py``) houses the combined builder. * ``build_flow_edit_morph_controls`` removed from ``generation_tab_secondary_controls.py`` (back under the 200 LOC cap). * ``_MODE_UI_OUTPUT_KEYS`` swaps ``flow_edit_morph_group`` → ``variation_accordion``; ``mode_ui`` outputs gr.update for the new key. Existing UI tests (``context_test.py``) updated. * ``send_audio_to_remix`` / ``send_audio_to_repaint`` return shape grows by 2 (caption + lyrics for ``flow_edit_source_*``); ``results_aux_wiring`` adds them to the click outputs list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ui): retake/edit panels side-by-side; rename to plain Retake / Edit (#1155, #1156) Three issues reported after the manual demo round: 1. Panels were stacked vertically — when both Retake and Edit were checked the layout collapsed top-to-bottom. Fix: two ``gr.Column``s side-by-side inside a single ``gr.Row``; each checkbox sits on top of its own panel, expanded independently. 2. Verbose checkbox labels. Drop the parentheticals: "Retake (variation)" → "Retake"; "Smooth morph (flow-edit)" → "Edit". Accordion title becomes "Retake & Edit". 3. The trailing ``gr.Markdown`` with retake explainer was wider than its column and got clipped at the right edge. Replace with a slim ``gr.HTML`` line under the Edit panel only (the morph-only caveat), and shorten the per-input ``info=`` blurbs so they fit inside the slider/textbox without overlap. Also: place the accordion right under "LM codes Hints" in the layout so the variation knobs are next to the source-audio inputs they apply to. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ui): help modal + Copy-from-current button for Retake & Edit (#1155, #1156) User reported two issues from manual demo: 1. The inline ``gr.HTML`` help block under the Edit panel rendered as a black band (z-index / overflow conflict with the surrounding column / accordion). Replace with the project-standard ``create_help_button`` pattern — a (?) button next to each checkbox that opens a modal with full markdown help. Both Retake and Edit now have their own modals. 2. There was no quick way to bootstrap the Edit ``source caption / lyrics`` from the user-level fields. Add a ``📋 Copy from current`` button inside the Edit panel that copies ``captions`` → ``flow_edit_source_caption`` and ``lyrics`` → ``flow_edit_source_lyrics``. Wired in ``generation_text_format_wiring`` next to the existing Format buttons. Help modal content (en.json): * ``help.generation_retake`` — explains the variance-preserving sin/cos blend (``mixed = cos(v·π/2)·base + sin(v·π/2)·retake``), variance-band recommendations, retake_seed semantics, and links issue #1155. * ``help.generation_edit`` — explains the V_delta integration math (``z_edit_{t-Δt} = z_edit_t + (V_tar − V_src)·Δt``), the paired-CFG method, recommended hyperparams (Custom mode, shift=3, n_min=0, n_max=1, n_avg=1), the Copy-from-current workflow, and cites the FlowEdit paper (Kulikov et al., CVPR 2025, arXiv:2412.08629) plus issue #1156. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ui): compact Copy button + side-by-side source caption/lyrics in Edit panel After the previous round the Copy-from-current button was rendering as a full-width dark grey bar (Gradio expanded the button to fill the column, so the label was pushed to one side and the empty fill looked like a broken help block). Wrap the button in a ``gr.Row`` with ``scale=0`` + ``min_width=180`` so it claims only the space its label needs. Per follow-up feedback, ``source caption`` and ``source lyrics`` now sit side-by-side inside the Edit panel — the previous top-bottom stack wasted vertical space and made the panel scroll for no reason. Both textboxes now share a 4-line minimum on a single row and split the width 50/50 via ``scale=1`` each. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(ui): drop checkbox ``info`` blurbs; help (?) button is enough The Gradio-auto ``ⓘ`` icon next to the checkbox label and the (?) help button were both rendered next to each other, looking like two redundant info indicators (and one of them was a no-op tooltip). Drop the short ``info=`` strings on the Retake / Edit checkboxes — the (?) modal carries the full explanation already. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(ui): make issue references clickable in Retake/Edit help modals The ``_md_to_html`` renderer already converts ``[text](url)`` markdown links to anchor tags, but the issue references in the help text were plain ``Issue #1156`` strings. Wrap them as proper markdown links pointing at the GitHub issues so the modal renders them as ``<a href=...>Issue #1156</a>`` (matching the existing FlowEdit paper arXiv link). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(ui): drop the outer "Retake & Edit" accordion The accordion added an unnecessary collapse layer — its title is self-evident from the two checkboxes underneath, and one extra click to expand was friction. Replace with a plain ``gr.Group()`` so the two checkboxes sit directly in the layout. Mode-UI visibility still hides/shows the whole block via the renamed ``variation_group`` key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ui): warn user when Retake + Think collide (#1155) Retake's noise-blend variation is only meaningful when every other condition matches the baseline run. In particular, with Think on the LM regenerates audio codes per call, so Retake's variance gets layered on top of an already-different starting point and the result mixes "LM drift" with "noise drift" — confusing and rarely what the user wants. Two surfaces, one message: * ``help.generation_retake`` modal gains a "⚠️ Consistency requirement" section explaining the seed / Think / knob-locking caveats and the recommended A/B-comparison workflow. * A live ``gr.Markdown`` warning sits inside the Retake panel and becomes visible only when both ``retake_enabled`` and ``think_checkbox`` are simultaneously on. Wired in ``generation_text_format_wiring.py`` next to the existing Copy-from-current handler. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(ui): document Think→Retake workflow via LM-codes pinning (#1155) User feedback: when the baseline you want to retake was generated with Think on, the cleanest way to lock the LM-side starting point is to copy the result's LM Codes and paste them into the LM Codes Hints field, then uncheck Think. Add this workflow: * ``help.generation_retake`` modal gains a "Recommended workflow for retaking a Think-mode result" section walking through the 5 steps (open Score & LRC & LM Codes accordion → copy LM Codes → paste into LM Codes Hints → uncheck Think → enable Retake / lock seed / set retake_seed / adjust variance). * The inline Retake×Think warning becomes actionable: it now points the user at the exact panels and fields they need to use, with the full step-by-step left to the (?) modal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(flow-edit): extend overlay to cover / cover-nofsq tasks (#1156) User reported Edit didn't fire when generating with LM Codes (cover scenario) on Gradio. Root cause: the v1 backend gate restricted morph to ``task_type == "text2music"`` and silently ignored everything else. Extend the dispatch: * ``service_generate.flow_edit_ctx`` now activates on ``task_type in (text2music, cover, cover-nofsq)``. * ``dispatch_flow_edit_overlay`` branches on task: - ``text2music`` keeps the silence-context behaviour (the verified clean text-driven V_delta path). - ``cover`` / ``cover-nofsq`` pass the payload's real ``src_latents`` and ``is_covers`` straight through, so cover's natural LM-codes context flows into both branches via ``prepare_condition``'s ``is_covers > 0`` lm_hints branch. Both branches share the same codes, so V_delta is still text-driven. * ``generate_music`` warning relaxed to flag only repaint / extract / lego (which need paired-CFG derivation per task shape). * ``mode_ui`` row 35 comment updated to note Edit's mode coverage. Help text refreshed: * ``help.generation_edit`` now describes both text2music and Remix paths, and the "ignored on Repaint/Extract/Lego" caveat replaces the old "Custom-only" note. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(flow-edit): add flowedit_generate_audio delegate to turbo / xl-turbo (#1156) User hit ``RuntimeError: Flow-edit overlay requires a base DiT variant`` when toggling Edit on a turbo model. Add the same thin delegate the 4 base variants carry, with one turbo-specific guard: * Force ``diffusion_guidance_scale=1.0`` because turbo / xl-turbo are CFG-distilled — paired-CFG over a delta that the model wasn't trained to produce just amplifies noise. Log an INFO when overriding so the user sees why their guidance_scale slider didn't take effect. Both variants share the exact same delegate body; resisted the urge to factor it out because the 4 base variants don't have the gs override and a single shared helper would muddle that. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(flow-edit): drop silence-context override; use task-natural context (#1156) User reported turbo + Custom + Edit produced pure noise on Gradio. Root cause: ``dispatch_flow_edit_overlay`` forced ``is_covers=zeros`` and a silence-tiled context for text2music, but turbo's velocity head was trained on real audio / LM-codes context. 8-step turbo on silence is OOD — V predictions are unstable and the V_delta integration over a short schedule accumulates into garbage latents. Fix: stop overriding the audio context. Pass the payload's real ``src_latents`` and ``is_covers`` straight through to ``prepare_condition`` for every task type. Both branches still share the SAME context (whatever the task naturally built — LM- codes hints when Think is on / cover task, src-latents auto- tokenization otherwise), so V_delta is still purely text-driven, but the velocity head stays in distribution. Verified by inspection of the user's run log: * ``Using precomputed LM hints`` printed 3× (LM Phase 1 ran with Think on; both flow-edit prepare_condition calls + the downstream one for auto-LRC saw the precomputed tensor) * ``infer_steps=8, guidance_scale=1.00`` → turbo dispatch path * The forced ``is_covers=zeros`` in our dispatch was discarding the precomputed LM-codes hints and falling back to silence-context. Help text updated to drop the silence-context language and add a note that turbo's 8-step budget is sufficient for flow-edit (the V_delta integration uses the same paired-forward count regardless of variant). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(flow-edit): restore silence-context for text2music; keep cover's natural Empirical sweep (sft60 / tb60 / tb8w05 / tb8s1, all on the user's provided audio + LM codes) showed every variant collapsed to peak ≈ 0.007 the moment the dispatch passed real ``src_latents`` / LM-codes context to ``prepare_condition``. The previous "transparent payload" fix was wrong: text2music's training distribution is (clean target, silence audio context), so feeding it either real source latents or one-sided LM-codes hints pushes the velocity head OOD — V_delta accumulates noise, the latent collapses, and VAE decode + auto-normalisation amplifies the residual to full scale. That's the "纯噪音" the user reports. Branch dispatch: * text2music — force silence-context (proven working: peak=1.0 on the earlier jieyue sft+60 run that used silence-context). * cover / cover-nofsq — keep the payload's real ``src_latents`` and ``is_covers``. The cover task IS trained on LM-codes-derived audio context shared by both branches, so V_delta stays clean while staying in distribution. Documented the diagnostic and the design choice inline so the next edit doesn't re-do the same loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(flow-edit): drop precomputed LM hints in text2music branch The empirical sweep showed silence-context worked when ``audio_codes`` was absent (peak=0.92) but collapsed when codes were present (peak=0.007), even with ``is_covers=0`` forced. Even though ``prepare_condition``'s ``where(is_covers > 0, lm_hints, src_latents)`` keeps the silence src_latents unmodified, the codes-derived ``lm_hints_25Hz`` tensor itself lingers in downstream tensor paths and empirically collapses the latent. Force ``precomputed_lm_hints_25Hz=None`` for the text2music silence- context branch so ``prepare_condition`` tokenizes silence afresh, matching the working no-codes path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(flow-edit): strip audio_codes for text2music morph (#1156) Root cause of the user's "纯噪音" repro: when ``audio_codes`` was present, ``conditioning_target._prepare_target_latents_and_wavs`` replaced ``target_wavs`` with zeros and put ``_decode_audio_codes_to_latents(codes)`` into ``target_latents``. That decoded-from-codes tensor sits at a different distribution than VAE-encoded audio, so flow-edit's ``zt_edit = src_latents.clone()`` started OOD and the V_delta integration collapsed to a near-silent latent (peak ≈ 0.007), which auto-normalisation then amplified to full scale as audible noise. Verified by toggling ``USE_CODES`` in the repro script: * with codes → peak 0.0076 (broken) * no codes → peak 0.9258 (clean output) Strip ``audio_code_string`` for ``text2music`` + ``flow_edit_morph`` specifically. The downstream pipeline then VAE-encodes the uploaded mp3 normally, and zt_edit starts at a real audio latent — flow-edit's math behaves as intended. Cover / cover-nofsq + morph keeps codes intact (cover IS trained on codes-derived context, both branches share it, no OOD). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(flow-edit): skip LM Phase 1 for text2music + morph (#1156) Earlier fix only zeroed ``audio_code_string_to_use`` at the top of ``generate_music``. But when Think is on (UI default), LM Phase 1 runs anyway and overwrites ``audio_code_string_to_use`` with freshly-generated codes — the same codes path then bites ``conditioning_target`` and zt_edit starts OOD again. Add ``text2music + flow_edit_morph`` to the LM-skip path alongside cover / cover-nofsq / repaint / extract. Think / CoT both silently no-op for the morph case so the downstream pipeline VAE-encodes the src_audio cleanly regardless of whether Think was checked in the UI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(ui): rewrite Retake/Edit help to match current behaviour + i18n (#1155, #1156) The previous help text for Edit was written when the dispatch still used silence-context unconditionally for text2music. After the codes-context fix (commits |
||
|---|---|---|
| .. | ||
| lora_data_prepare | ||
| check_gpu.py | ||
| fetch-awesome.mjs | ||
| flow_edit_overlay_smoke.py | ||
| new_pr_branch.ps1 | ||
| prepare_vae_calibration_data.py | ||
| profile_vram.py | ||