Kotonia Articles

I built a terminal-native creative agent in a day — for verbal thinkers who keep getting beaten up by visual-first tools

kotonia-cli, written in a day: bash as the only tool, image generation that returns in 3 seconds, native ffmpeg orchestration. The category gap that Claude Code and Codex never filled. The only place a solo dev can beat the frontier labs isn't model quality — it's lossless UX for verbal thinkers.

By Shinji Shimizu2026-06-2110 min read

#agent#cli#indie-dev#creative-coding#rust#llm

Also inJapanese Chinese

Three lines into a terminal. Three seconds of waiting. The image hit my disk.

Not once did I open a GUI. No Midjourney visit, no ComfyUI node graph, no ChatGPT browser tab with a download button to hunt for. I exported a KOTONIA_API_KEY, launched kotonia-cli, and said in Japanese: "generate a cute anime girl, save as ./bishoujo.png".

Getting there took a day. I wrote a CLI agent from scratch in Rust, wired it into my own /api/v1 image-generation endpoint, taught the agent to issue curl through bash, and let the generated PNG land in ./bishoujo.png. Eight commits, 31 unit tests, roughly 2,000 lines of Rust.

The moment the setup actually worked, I felt a strong conviction: this fills a category gap nobody else is filling. The rest of this article is the dissection of why.

1. The empty intersection — code agents and image gen never meet

As of June 2026, developer-facing agent products cluster into three groups:

Cluster	Representatives	Can do	Can't do
Code agent (CLI / IDE)	Claude Code, OpenAI Codex CLI, Cursor, Aider	Read / write / run code, shell tools, git	Generate images / audio / video. Article thumbnails. Storyboards.
Image / video gen (GUI)	Midjourney, Stable Diffusion WebUI, ComfyUI, Runway	Visual artifact generation, deep control, high quality	No shell, no git, no code. Not conversational agents.
Chat with image (App)	ChatGPT desktop, Claude desktop, Gemini app	Chat + image gen, GUI-centric	Not CLI, can't reach into your local shell, can't touch files directly

Among GPT options, the macOS Codex variant offers GUI image gen, and the absolute model quality is (probably) better than mine. But "terminal-completes-everything × default-bundled image/audio/video gen × full shell access" in a single binary — I have never heard of one.

That's the category gap.

2. Center of the Venn diagram — where kotonia-cli lives

A rough enumeration of what kotonia-cli ships with:

One tool: bash. The agent can issue any shell command.
Three approval modes (all / allowlist / auto) for adjustable safety margin
Git worktree sandbox by default, --in-place for direct edits
Bidirectional REPL with conversation history persisted across turns
Session JSONL written to ~/.kotonia/sessions/<id>.jsonl, full resume via --resume <id>
Multi-provider (V4-Flash local / Gemma 4 26B local / DeepSeek API / more on deck)
When KOTONIA_API_KEY is exported, the kotonia /api/v1 image/audio/video endpoints are auto-published into the agent's system prompt

That last point is what puts kotonia-cli outside all three clusters above.

As a code agent it has full bash + git + cargo + pytest + npm reach. But it also has the door open to image / audio / video generation, which code agents normally don't. There's no node graph to assemble like in image-gen GUIs — natural-language instructions get translated by the agent into curl + JSON body.

Aside: "Why doesn't Claude Code generate images?" — for Anthropic, the answer is sound business. Their compute is GPU-heavy and bundling image generation distorts the pricing model. The territory they refuse to enter is exactly where a solo developer gets a defensible niche.

3. Verbal-first vs visual-first — the cognitive tax of tools

Existing video editing tools, image editing tools, and design tools are — obviously — optimized for visual thinkers. Grab a timeline, layer up, draw a Bezier curve, push pixels with the mouse. For people who think visually, that's a natural interaction model.

But there's a non-trivial population of verbal-first players. They write, explain, argue, design, code. Input and output are both fundamentally text. Engineers, writers, PMs, researchers, editors, founders — a huge chunk of knowledge workers live here.

These people occasionally have to produce a visual artifact. The article needs a thumbnail. The generated video needs a storyboard. The deck needs a diagram. The product needs a logo prototype.

This is the moment verbal-first players hit a wall.

Premiere's three-layer timeline they have to track by eye, increase track count, drop two keyframes for the fade-in, then tweak the Bezier curve. Or Unreal Engine's Blueprint, or Unity's Visual Scripting, where logic is built from lines and nodes. Natural for the visually-tuned, but for the verbal mind it's "handing the brain over to a hostile UI" — a torture-by-tool that burns hours.

What's actually happening is a mode-switch tax. From the language mode you use to write code, you're forced to switch into the visual mode the tool demands. Over a single task, you switch many times. The brain's resources melt into "conforming to the tool" instead of doing creative work.

kotonia-cli, this thing I built today, drives that tax close to zero.

Input is natural language: "Make it a girl, not a cat — try again." "Cut to 30 seconds vertical, add a fade-in, burn the caption '試したけどダメだった' in yellow."
Output is a file in the workspace: ./bishoujo.png, ./short.mp4, ./narration.wav
No mode switching: you never leave the terminal, never open a GUI, never visit a browser

Living inside the local shell, you conceive in language, instruct in language, and only the result arrives as a visual artifact. This structurally lowers the creator's cognitive load. I think the social significance is actually significant: people who used to give up because they couldn't switch into visual mode — researchers wanting a diagram for the spec, bloggers needing a thumbnail, non-engineers wanting a voice file attached to email — can now produce visual artifacts purely from text.

4. ffmpeg as a native multiplier

If kotonia-cli only added image generation, others could catch up. The real differentiator is that the agent can drive ffmpeg in the shell as naturally as any other command.

For example, "make it a 30-second vertical video, fade in at the top, burn the caption '試したけどダメだった' in yellow" expands into a chain like this:

POST /api/v1/videos/generations (LTX-2, 768×512, async job)
Poll until complete
Download
ffmpeg -i in.mp4 -vf "fade=in:0:30,drawtext=...:fontcolor=yellow" out.mp4
Save ./final.mp4 to the workspace

The agent assembles the entire chain in one turn. The user's keystroke load: one prompt. The agent writes the curl, picks the ffmpeg filter syntax, manages the temp files. Unhappy with the result? "Make the fade quicker." Next turn, the agent substitutes only the ffmpeg step and re-renders.

This makes asset generation + editing a continuum. Normally asset generation (DALL-E / Midjourney) and editing (Premiere / DaVinci) are separate tools with a download → upload → timeline-placement migration in between. kotonia-cli compresses all of that into a single conversation inside one shell.

ffmpeg is the world's media-processing Swiss army knife, its command syntax is 100% text, and it pairs extremely well with an agent. "Agent driving ffmpeg from natural language", shipped by default in a CLI — I haven't seen this packaged anywhere else.

5. How it's built

For a Rust CLI agent, the internals are simple. infrastructure/execution/host.rs handles bash execution; application/kotonia_agent/ holds the agent loop, approval policy, worktree manager, parser, history, and provider.

Agent loop: delimiter-based. The LLM emits <<<BASH>>>...<<<END_BASH>>> or <<<FINAL_ANSWER>>>...<<<END_FINAL_ANSWER>>>. Native tool calling is on the future list (waiting on V4-Flash's chat_template — the journey of running V4-Flash at home is documented in The day my 16 cores finally roared), but delimiters work against any OpenAI-compatible backend.
Approval model: three modes, all selectable. auto = unattended, allowlist = auto-pass read-only / build / test and gate destructive (rm -rf / git push --force), all = every command requires approval. Beginners run all, the operator runs auto, the same CLI scales across both audiences.
Worktree by default: git worktree add /tmp/kotonia-agent-<uuid> for isolation. Mistakes don't touch the main checkout. --in-place for direct cwd editing when wanted.
Session JSONL: ~/.kotonia/sessions/<id>.jsonl holds metadata + every message + bash observation + turn markers. --resume <id> for full replay. When long sessions approach the context cap, you can split them across processes.

And the headline of this article, the B1 integration with /api/v1:

# Just export the key — the agent's system prompt automatically gains
# the four endpoints. Zero code changes from the user's side.
export KOTONIA_API_KEY=kotonia_xxxx
kotonia-cli "make a cute anime girl, save as ./bishoujo.png"

The implementation is 30 lines. We append the curl shapes for kotonia's /api/v1/{images,audio,videos}/generations to the system prompt as inline curriculum. Native tool-calling (the B2 approach) would have been a 200-line refactor, but going through bash gets us the same capability density in 30 lines. The agent assembling jq | base64 -d > ./out.png is in the LLM's standard repertoire — trusting that, the integration stays light.

The image-generation backbone itself (HiDream-O1-Image resident at 8-bit fp, photorealistic LoRAs + 3-stage caption pipeline that pulled quality up) is documented in Breaking the under-fit ceiling with 3-stage caption. The bishoujo.png at the top of this article rides on that foundation.

(The repository is currently closed; the CLI portion will be carved out as a standalone open-source project. I'll publish a follow-up when it's ready.)

6. Strategic implication — the only territory where a solo dev outflanks Anthropic / OpenAI

As I wrote at the start, the code-agent market is dominated by Claude Code / OpenAI Codex. Competing on model quality is a non-starter for a solo dev. Competing on infrastructure-scale pricing is also impossible (DeepSeek API runs at ~$0.27/M tokens — local self-hosting the same call costs 200× more). The trap of "I can run V4-Flash locally, so I should sell my own API" was dismantled when I made Fable 5 adversarially review my strategy and it found four holes — see The AI told me to stop writing tech articles.

So what's left to fight on? Lossless UX for verbal thinkers is a territory the frontier labs structurally won't prioritize. Their main revenue source (B2B developer APIs) is aimed at users who already live in a code IDE or GUI — Claude via Cursor, GPT via VSCode extension, image gen via the ChatGPT browser. With a GUI assumed on the outside, there's little incentive to polish the experience of a verbal-language-completes-everything CLI.

From the solo-dev side, that gap is exactly polishable as "a creative agent that completes everything in a CLI". The differentiation axes:

CLI completeness: never have to open a GUI
Generative tool bundle: image / audio / video are defaults, no additional sign-ups (one key covers everything)
ffmpeg multiplier: asset generation + editing are a continuum in the same shell
Verbal-thinker fit: never hand the brain to visual mode, completes in text
Local shell freedom: agent can run cargo build / git fetch / curl / ffmpeg / anything; no E2B-style isolated sandbox limits

Ship this set as one CLI and the verbal-first chunk of Claude Code users (= those who chose the "never leave the terminal" workflow ≒ a non-trivial share of the population) stops needing to open a separate tool for thumbnails or storyboards.

By the way, this feeling of "solo dev compounding starts kicking in" was articulated in a different context recently: The day a solo developer's accumulated assets finally compounded. That kotonia-cli was buildable in a day is the sequel to the same structure — months of memory / design decisions / the pre-existing /api/v1 finally got leveraged in full, via an agent.

I genuinely believe the social impact is large. Creativity that used to be stranded behind "can't switch into visual mode" becomes text-translatable into visual outputs.

7. What's next

Add /api/v1/chat/completions on the kotonia side — so users don't need a separate DeepSeek API key. One kotonia key handles the agent's LLM loop too. Gemma 4 26B Uncensored (fast + highly parallel; Agentic Index is a middling 11, but general-efficiency tasks are completely within its envelope) gets OpenAI-compat multiplexed.
ffmpeg examples in the system prompt — pre-load curated prompt patterns so "add a fade", "burn captions", "mix BGM" land in one-line prompts.
Distribution format — currently build-from-source only. cargo install and GitHub Releases binary distribution are next, starting with Linux x86_64, macOS / Windows after CI is built out.
Backport to /chat/studio — reflect the CLI-polished agent loop into the web side. The "lossless UX" work mirrors what I did dropping voice LLM latency from 600ms → 22ms (voice-first-local-llm). "Pick an X-first design and strip every other screen/layer to match it" is a principle that translates directly to creative-cli.

This might look like an article about generating one bishoujo.png. But over the next six months I expect "how close to zero can we drive the cost of converting between language and the visual?" to be the central product question for solo dev. The conviction crystallized in one day, and I wanted to record it.