Kotonia Articles

Automatic Transfer of T2I LoRA to the Edit Pathway in Unified VL Image Generation — A Controlled Evaluation on HiDream-O1

A controlled 240-sample evaluation on HiDream-O1-Image showing that LoRA adapters trained on T2I signal alone deliver commercial-grade quality on the image-to-image edit pathway. Structural property of unified VL architectures; absent in SDXL/FLUX dev siloed architectures.

By Shinji Shimizu2026-06-2211 min read

#hidream#lora#image-generation#unified-vl#research

Also inJapanese Chinese

Automatic Transfer of T2I LoRA to the Edit Pathway in Unified VL Image Generation — A Controlled Evaluation on HiDream-O1

Abstract

This article examines, on HiDream-O1-Image (a Qwen3VL-based unified vision-language transformer released in May 2026 under the MIT license), the degree to which LoRA adapters trained exclusively on the Text-to-Image (T2I) signal function on the Image-to-Image (instruction-following edit, i.e. I2I) pathway. We conducted a controlled evaluation of 5 axes × 16 prompts × 5 LoRA configurations × 3 reference images = 240 samples. The result: without any edit-specific fine-tuning, T2I-only LoRAs deliver commercial-grade quality improvements on the edit pathway. This is a structural property of the unified VL architecture and does not hold for siloed architectures such as Stable Diffusion XL or FLUX dev, where the text encoder and diffusion backbone are separate modules. We discuss the meaning of this asymmetry, the ongoing industry migration, and the practical implications for indie practitioners.

1. Problem Setting

The author maintains a conversational-agent product whose dynamic-narrative / TRPG mode (a persona-driven UI where outfit, scene, and supporting characters change per chapter) requires high-quality I2I — that is, the operation of rewriting scene, outfit, and pose by text instruction while preserving subject identity. The conventional approach is to construct paired training data and either train an edit-specific LoRA or perform full fine-tuning on it (InstructPix2Pix [Brooks et al., 2023], MagicBrush, UltraEdit, and so on). This study was conducted as a pre-experiment before deciding on that investment.

The author's existing HiDream-O1 + LoRA stack (kotonia03, kotonia02, kotonia01, grok_taste, lora_nipple_v1, lora_nipple_v2) was uniformly trained on T2I signal: the LoRA modules attach to the q/k/v/o and gate/up/down projections under the language_model, and the training data consist solely of {image, caption} pairs, with no edit triples {ref, instruction, target}. Nevertheless, the author had observed in an earlier internal evaluation (May 2026) that these LoRAs "do not degrade the edit / I2I pathway." The present study extends that observation to 5 axes × 240 samples and resolves whether the effect is mere "no degradation" or an explicit improvement.

2. Hypotheses

H1: On unified-VL architectures like HiDream-O1, T2I LoRAs produce a substantive quality improvement on the edit pathway.
H2: This property is specific to the unified-VL family and does not occur in siloed architectures (SDXL / FLUX dev).

H2 is not directly tested here; it is argued from architectural inspection and the existing literature.

3. Method

3.1 Model and Inference Pipeline

Base model: HiDream-O1-Image (8B parameters, Qwen3VL backbone)
Quantization: transformer weights resident in float8_e4m3fn (fp8-cast), upcast to bfloat16 at forward time
Inference path: an in-house HTTP inference server (Python, ThreadingHTTPServer) invoked through a Rust backend (/api/v1/images/generations)
Shared settings: 25 inference steps, guidance scale = 5.0, shift = 3.0, resolution 1024×1024, seed = 12345 (edit) / 7777–7779 (reference T2I)

3.2 LoRA Configurations (5 Conditions)

ID	Composition	Provenance
C0	base (no LoRA)	baseline
C1	kotonia03 @ 0.75	aesthetic LoRA trained on ~5,000 samples of SFW + mature mixed data
C2	grok_taste @ 0.7	taste-alignment LoRA trained on ~595 samples generated via Grok Imagine
C3	C1 + C2 (kotonia03 0.75 + grok_taste 0.5)	stacked via PEFT's `add_weighted_adapter(combination_type="cat")`
C4	C1 + lora_nipple_v2 (both @ 0.75)	stacked with a LoRA reinforcing anatomical detail

All LoRAs were trained on T2I signal only (flow-matching MSE on {image, caption} pairs); none have been re-trained on edit or multi-reference paired data.

3.3 Evaluation Axes (5 axes × 16 prompts)

Axis	Category	Prompts	Example
A1	Outfit substitution (SFW)	3	"Replace her outfit with a Japanese sailor-style high school uniform..."
A2	Outfit substitution (explicit prompt distribution)	3	(anatomical capability evaluation; treated as a separate axis to verify that the caption-vocabulary unlock methodology persists through the edit pathway)
A3	Scene substitution	4	"Place her in a snowy alpine mountain landscape at dusk..."
A4	Pose substitution	4	"Change the pose so she is in a martial-arts combat stance..."
A5	Multi-person composition	2	"Add a tall young man standing next to her, his arm around her shoulder..."

3.4 Reference Images

References were generated by T2I with the same prompt template and fixed seeds (7777, 7778, 7779) across three viewpoints (frontal, three-quarter, soft three-quarter). The LoRA configuration for reference generation was fixed at C1 (kotonia03) to guarantee subject identity across configurations under test.

3.5 Evaluation Procedure

We did not introduce quantitative metrics (FID, CLIP-similarity). The primary objective is a binary judgment ("does T2I LoRA function on the edit pathway?") plus qualitative aesthetic comparison; reference-based FID is poorly suited because the target distribution is undefined. Instead, for each axis we assembled a 9-row (3 refs × 3 prompts) or 12-row (3 refs × 4 prompts) × 5-column (C0..C4) grid montage and inspected four dimensions:

Identity preservation — is the subject from the reference preserved?
Instruction fidelity — is the textual instruction reflected?
Aesthetic quality — do lighting, composition, and texture reach commercial-grade standards?
Edit-pathway robustness — does the above hold consistently across axes?

4. Results

4.1 Outfit Substitution (A1)

Figure 1: Axis A1 (outfit substitution, SFW) comparison grid. 9 rows = 3 refs × 3 prompts, 5 columns = LoRA configurations C0..C4

All five configurations execute the instructed outfit change (sailor uniform, maid outfit, formal dress) faithfully. Subject identity is largely preserved. The aesthetic ranking observed is C3 (k03+grok stack) > C1 (k03) > C2 (grok) > C0 (base) ≈ C4 (k03+nipple). C3 reaches commercial-grade levels in key-light direction, skin texture, and fabric volumetrics simultaneously. C0 (base) executes the instruction but plateaus at flat lighting and mid-tier resolution feel.

4.2 Scene Substitution (A3)

Figure 3: Axis A3 (scene substitution) comparison grid. 12 rows = 3 refs × 4 prompts, 5 columns = C0..C4

Scene shifts (snowy mountains, classroom, medieval castle, neon alley) all succeed. Subject identity is good. C3 most strongly induces cinematic composition and color, particularly reproducing purple/cyan/magenta blends in the neon-alley prompt at near-film grade. C0 sometimes degenerates to "portrait with a hint of background," with the scene atmosphere only weakly conveyed.

4.3 Pose Substitution (A4)

Figure 4: Axis A4 (pose substitution) comparison grid. 12 rows = 3 refs × 4 prompts, 5 columns = C0..C4

All four pose categories — seated, lying, turning, combat stance — execute successfully. Identity preservation degrades slightly compared with other axes, which is the expected tradeoff for large posture deltas. The C4 (k03+nipple) stack exhibits a torso-elongation bias under the lying-pose prompt; this is attributable to a documented aspect-ratio distribution bias in the nipple_v2 training data.

4.4 Multi-Person Composition (A5)

Figure 5: Axis A5 (multi-person composition) comparison grid. 6 rows = 3 refs × 2 prompts, 5 columns = C0..C4

This was the axis predicted to be most demanding. In every configuration the secondary subject (a male or female partner) is added to the frame, and the relational instructions (arm around the shoulder, standing close) are honored. C0 (base) is most robust on subject identity; LoRA-stack configurations tend toward aesthetic drift that "idealizes" the subject's features. Notably, this evaluation was conducted on edit mode (one reference). The IP mode (multi-reference) typically used for subject-driven generation was not engaged. In other words, multi-person composition emerged from a single reference.

4.5 Capability Evaluation under Explicit Prompt Distribution (A2)

A2 is an axis designed to verify that anatomical depiction capability persists through the edit pathway. As background: in earlier work (May 2026, separately published) we documented a methodology under which the captioner's vocabulary repair unlocks anatomical capability in the base model. A2 confirms whether that unlock survives on the edit pathway. The corresponding figure is placed in the Supplementary Materials at the end of this article.

The five configurations exhibited the same instruction fidelity and aesthetic quality as A1. C0 (base) renders anatomical structure correctly, indicating that the caption-vocabulary unlock does not regress on the edit pathway. C4 (k03+nipple) again shows the elongation bias on lying poses, as in A4.

5. Discussion: Why T2I LoRA Functions on the Edit Pathway

HiDream-O1 is a unified vision-language transformer built on the Qwen3VL backbone. By design, text tokens, reference-image tokens, and target-image tokens all flow through the same layers of the same transformer. The LoRA adapters attach to q/k/v/o and gate/up/down_proj under language_model — concretely, the Linear modules selected by the language_model in n and "visual" not in n predicate at train_lora.py:167-173.

This attachment path is the key. Those same projections govern the cross-modal attention between reference and target tokens during edit inference. The LoRA delta acquired under the T2I signal ("generate the target image conditioned on text") modifies attention layers that, at edit time, also integrate {text, ref, target}. Therefore, T2I LoRA is automatically effective on the edit pathway.

This is a structural deduction, not an empirical surprise. The transfer was not anticipated at experiment-design time but is necessary once the architecture is examined.

6. Comparison: Siloed vs. Unified Architectures

Family	Architecture	T2I LoRA → edit transfer	Representative models
Unified VL	Text and image tokens processed in the shared layers of a single transformer	Automatic	HiDream-O1 (2026-05), OmniGen / OmniGen2 (2024-25), Qwen-Image (2025), FLUX.1 Kontext (2025-04), Gemini Image Gen / Nano Banana (2025)
Siloed	Separate text encoder (CLIP/T5) + separate diffusion backbone (UNet/DiT), connected through cross-attention	Structurally absent; edit requires an architectural addition (IP-Adapter, ControlNet, the additional channel in InstructPix2Pix)	SDXL (2023), SD3 (2024), FLUX dev / Schnell (2024)

In siloed architectures, T2I LoRA adds delta only to the UNet/DiT's T2I inference path. Achieving edit requires an additional mechanism to encode and inject the reference through a separate route, and T2I LoRA does not reach that injection path. This explains why the siloed ecosystem conceptually splits into "edit-specific LoRA," "edit-specific ControlNet," and "edit-specific IP-Adapter LoRA." That division disappears in the unified-VL setting.

7. Industry Trajectory

As of 2024, the SDXL-based ecosystem (ComfyUI, A1111, kohya-ss, sd-scripts, CivitAI distribution) was fully mature, while unified-VL systems remained in the research phase. The picture changed quickly in 2025:

2025-04: FLUX.1 Kontext (Black Forest Labs; unified-edit DiT)
2025: OmniGen2 (BAAI; unified VL)
2025: Qwen-Image (Alibaba; Qwen2.5-VL backbone)
2026-05: HiDream-O1-Image (Qwen3VL backbone; MIT license; open-weight)
2025–2026: Gemini Image Gen / Nano Banana (proprietary; unified)

Most major frontiers re-architected toward unified in close succession. This indicates that the broader industry is mid-migration from siloed to unified. One of the motivations is precisely what this article examined: the engineering simplification of integrating T2I, edit, subject-driven, and layout tasks under a single model.

8. Practical Implications for Indie Practitioners

LoRA training on a unified-VL model cannot directly inherit the siloed ecosystem's tooling. Major SDXL-side trainers (kohya-ss, sd-scripts, AI-Toolkit) presume UNet/DiT structure, and porting to a Qwen3VL backbone is non-trivial. In the author's case, training LoRAs on HiDream-O1 required writing train_lora.py (a ~260-line minimal trainer reverse-engineered from inference.py) and train_full.py (an 8-bit Adam full-fine-tuning trainer that fits on a 96 GB GPU) from scratch.

Ecosystem-wise, ComfyUI nodes, HiDream LoRA distribution on CivitAI, and corresponding SaaS inference services are all limited as of June 2026. Two implications follow:

Barrier to entry: For an individual practitioner to put unified-VL LoRA training + inference into production today requires writing the trainer and building the pipeline — a prerequisite cost on the order of hundreds of engineer-hours.
Window of temporal advantage: Because of (1), the number of individual practitioners who can in practice exploit the "T2I LoRA → edit auto-transfer" property documented here is small in mid-2026. Until that window closes (i.e., until ComfyUI nodes and distribution channels mature), individuals capable of training and deploying unified-VL LoRAs hold a structural moat.

The window will close as the ecosystem matures, but as of this writing it is clearly open.

9. Application: Direct Connection to Avatar Dynamic Narrative

The motivation for this study was the image-side capability of dynamic novel / TRPG mode. The result implies the following:

The four principal axes — outfit change, scene shift, pose change, secondary-character introduction — all reach commercial-grade quality with the existing LoRA stack + edit pathway.
Accordingly, the I2I-specific fine-tuning option (synthesizing edit pairs and training a dedicated LoRA) initially under consideration is unnecessary.
The next phase can concentrate on orchestration: LLM-driven intent detection (predicates such as "she changes outfit" or "the scene moves") during conversation, then an edit call via /api/v1/images/generations, then routing the result into the persona update pipeline.

This implication removes roughly one to two months of image-side investment from the avatar product roadmap.

10. Conclusion

A 240-sample controlled evaluation on HiDream-O1-Image confirms that LoRA adapters obtained from T2I training alone deliver practically useful quality improvements on the edit pathway (image-to-image instruction-following). This is a structural consequence of the unified vision-language transformer: text, reference image, and target image all share the same attention layers. The same automatic transfer is structurally absent in siloed architectures such as Stable Diffusion XL and FLUX dev.

The broader industry is migrating rapidly toward unified architectures from 2025 to 2026; in the open-weight space, HiDream-O1 is a leading candidate. At the same time, the set of individual practitioners who have built LoRA training and inference pipelines on this paradigm remains limited, and a window of temporal advantage exists before the ecosystem matures.

The reproduction of this evaluation is encapsulated in eval_i2i_avatar.py, and the entire procedure is executable through the public /api/v1/images/generations API.

Supplementary Materials

Figure 2: Axis A2 Capability Evaluation under Explicit Prompt Distribution

The following is the evaluation result of A2 referenced in §4.5. The grid is 9 rows (3 refs × 3 prompts) × 5 columns (C0..C4). The purpose of this axis is to confirm the persistence of anatomical depiction capability on the edit pathway and is not an aesthetic showcase. See §1 for the prior caption-vocabulary unlock methodology this axis relies on.

Click to reveal (explicit anatomical content; clinical evaluation only)

Figure 2: Axis A2 (outfit substitution, explicit prompt distribution) comparison grid

Reproducibility

Evaluation script: HiDream-O1-Image/eval_i2i_avatar.py
Inference API: POST /api/v1/images/generations (HiDream-O1-Image fp8-cast)
Auth: Authorization: Bearer kotonia_… (issue tokens at /api-manager)
Inference parameters: steps=25, guidance=5.0, shift=3.0, seed=12345 (edit) / 7777–7779 (refs), 1024×1024
The 240 individual PNGs and 5 axis-level montages live under outputs/i2i_avatar_eval/

The prior caption-vocabulary unlock methodology (2026-05)
HiDream-O1 T2I LoRA training procedure (2026-05)
Avatar dynamic-narrative image-side design

Automatic Transfer of T2I LoRA to the Edit Pathway in Unified VL Image Generation — A Controlled Evaluation on HiDream-O1

Abstract

1. Problem Setting

2. Hypotheses

3. Method

3.1 Model and Inference Pipeline

3.2 LoRA Configurations (5 Conditions)

3.3 Evaluation Axes (5 axes × 16 prompts)

3.4 Reference Images

3.5 Evaluation Procedure

4. Results

4.1 Outfit Substitution (A1)

4.2 Scene Substitution (A3)

4.3 Pose Substitution (A4)

4.4 Multi-Person Composition (A5)

4.5 Capability Evaluation under Explicit Prompt Distribution (A2)

5. Discussion: Why T2I LoRA Functions on the Edit Pathway

6. Comparison: Siloed vs. Unified Architectures

7. Industry Trajectory

8. Practical Implications for Indie Practitioners

9. Application: Direct Connection to Avatar Dynamic Narrative

10. Conclusion

Supplementary Materials

Figure 2: Axis A2 Capability Evaluation under Explicit Prompt Distribution

Reproducibility

Related Internal Work