Kotonia
ログイン今すぐ始める

Kotonia Articles

I shipped Iris — an AI partner with a face and limbs — in one day. Meet kotonia-desktop.

A Tauri 2 desktop app where a silver-haired android sits in the corner, listens to your voice, runs bash on your machine, and reads the result back in her own voice while her face lip-syncs. Built in a day. A slightly cheeky hybrid cost model: ride competitors' loss-leader brain subscriptions, layer kotonia's near-unlimited voice/avatar on top.

By 8 min read
#agent#tauri#voice#lip-sync#indie-dev#rust
Also inJapaneseChinese

TL;DR

Iris is a silver-haired android who waits in the bottom-right of your desktop. You ask her something out loud, she runs bash on your machine, and reads the result back in her voice while her face actually lip-syncs. Built in a day.

Honestly, feature-wise this is a degraded desktop clone of Claude Code. "Shell-running agent loop + prompt UI" is what Claude Code already does.

The differentiation is face and voice. Add those, and "a tool" becomes "a partner". The kotonia project's mission has always been making your daily development and your pursuit of a dream just a little more fun, just a little more positive — and Iris is the first physical implementation of that.

Public: https://github.com/zhener562/kotonia-desktop

What it does

You toggle Iris on and ask out loud: "Build me a simple shooter at output/shooting_game.html."

  1. Click the mic → speak → click again to stop
  2. Whisper / Qwen3-ASR transcribes into the prompt box
  3. Hit Enter; Iris runs bash 4-5 times in the background and writes the HTML file
  4. She replies in her voice: "I wrote shooting_game.html. WASD to move, space to fire."
  5. Right next to shooting_game.html in her reply, a ▶ play button shows up — click it and the game opens in a side pane

Under the hood:

  • Local bash execution ← kotonia-cli (a Rust ReAct agent)
  • LLM inference ← Gemma 4 26B hosted on kotonia.ai (or Claude Code, see below)
  • TTS ← Qwen3-TTS via kotonia.ai (Ono_Anna voice, natural pronunciation for JP+EN mixed text)
  • Avatar video ← Ditto-TalkingHead via kotonia.ai (lip-sync at 25 fps)
  • ASR ← Qwen3-ASR 1.7B via kotonia.ai
  • Everything streams, first byte ~60 ms, basically zero wait

The slightly cheeky cost structure

This is the strategic core. The user picks Iris's brain (LLM):

Use caseChoiceCost vibe
Heavy thinking / complex refactorClaude Code (Pro $20/mo)Anthropic is loss-leading hard; smartest brain on the market
Daily routine / fast replieskotonia hosted Gemma 4 26B (low monthly sub, effectively unlimited)Kotonia's own GPU, near-zero marginal cost
Raw APIOpenAI-compat (DeepSeek etc.)Pay-per-token, fine for coding

Here's the cheeky part: the big AI labs offer their frontier-model subscriptions at a deep loss as a market-grab ($20/mo for unlimited Claude Code is almost certainly red ink for Anthropic). kotonia-desktop uses that subsidized brain as its premium thinker, while serving voice / avatar / routine LLM from Kotonia's own GPUs at fixed cost, effectively unmetered.

Typical per-user setup:

  • Heavy thinking → Claude Code Pro $20/mo (smart; the user is openly riding on Anthropic's loss-leader)
  • Light execution / TTS / avatar / ASR / routine → kotonia ~$3-5/mo (Kotonia's own GPU, marginal cost ≈ 0)

Total ~$30/mo for "always-on Iris partner who runs bash, talks back, and shows a face that moves while talking."

Kotonia alone can't ship that brain quality (no body to loss-lead). Claude Code alone has no face or voice. Both stacked is what makes this work. "Deliberately freeloading on someone else's deep-loss market grab" is, strategically, a little cheeky — but from the user's seat it's the rational pick. (If Kotonia's voice / avatar / fast-API side ever gets popular enough to overload our servers, that's a happy problem we'll cross when we get there.)

Engine switching is supported in kotonia-cli, the Rust crate we use as the desktop's library. Three brain backends already work — Claude Code integration, OpenAI-compat, and the Kotonia hosted path. The desktop ships with Kotonia hosted as the default; swapping to a premium brain is a config change away.

The leap from CLI to embodied partner

When kotonia-cli shipped, the "support for challengers" goal was already in place in form. Voice in → bash → answer worked, all inside the terminal. But while you're still inside a terminal, you can't escape being "a tool". The user-tool relation persists.

Add a face and limbs, and finally it becomes an embodied AI partner. Iris waits in the corner, replies in her voice, her mouth moves. Tiny shift in nominal feature surface, but the psychological distance changes. "I'm calling a tool" becomes "I'm asking my partner".

That's the actual pivot. Technically it's all existing pieces glued together, but the product identity flipped from "platform" to "character". kotonia.ai's web product remains a platform with many personas; kotonia-desktop ships as "Iris, the single named character". That's deliberate.

Why it shipped in a day

From the first commit at 8:15 AM to the last at 11:18 PM, 23 commits total, P0 (persona placement) → P1 (TTS playback) → P2 (mic + STT) → P3 (Ditto avatar lip-sync) — all landed in one day.

The speed came from compound interest on accumulated assets:

  • The voice / Ditto / TTS / ASR endpoints on kotonia.ai have been baking since last year — mature, stable
  • kotonia-cli published last month, the agent loop already a self-contained library reusable in-process
  • The Iris portrait (the bottom-right android) was generated months ago by HiDream-O1-Image for an unrelated project
  • Auth (the device_token from kotonia-cli login) already existed; one small fallback added to the backend unlocked all endpoints

What I actually wrote in one day was the Tauri shell + the wire-up to each endpoint + the UI plumbing. The logic was assembled from existing pieces. The tipping point for solo dev compounding is right here: when your accumulated assets start standing on the "being used by" side rather than the "you using" side. Today was that day for kotonia.

Traps worth recording (3 of them)

1. split_mixed_languages was secretly killing streaming

The Qwen3-TTS Python proxy had a feature I'd added a while back for an English-learning persona: "For JP+EN mixed text, split by language and synthesize each run with its native-language pronunciation". Correct behavior — the word shell actually gets read with an English accent.

But the implementation was "synthesize every run → concatenate PCM → yield one chunk at the end", which silently disabled streaming. first byte = full synthesis time. Iris's answers are wall-to-wall technical-vocab mixed text, so this path triggered every utterance — multi-second silence, then a sudden burst of speech.

First fix was a workaround in the desktop: split_mixed_languages: false, escape into the streaming path at the cost of having shell read as シェル in a Japanese accent. Then I went back and rewrote the Python side to yield each run as it lands — the proper fix. Biggest debug of the day.

Lesson: feature flags you add for "the right reason" need their default's blast radius explicitly written down. Six months later you'll trip on it from a different use case.

2. Ditto avatar frames burst, then froze halfway through

The Ditto server emits frames as fast as the GPU can produce them (bursty); audio chunks meanwhile get scheduled into the AudioContext future (50 chunks arrive in 100 ms, but they're scheduled to play across 2 seconds).

We were displaying every frame the instant the bytes arrived, so 200 ms of wall time would consume all 50 frames — and then the face would sit frozen on the last frame for the remaining 1.5 s of audio playback.

The fix is "FPS pacing anchored to wall clock": when the first frame lands, pin the wall clock; frame N then displays at pinTime + N/25 sec via setTimeout. The arrival burst is absorbed by JS's microtask buffer; display marches at audio tempo.

const targetMs = dittoStreamStartMs + (myFrameIndex * 1000) / DITTO_FPS;
const delayMs = Math.max(0, targetMs - performance.now());
setTimeout(() => { /* show frame */ }, delayMs);

Under 20 lines. The same fundamental bug shows up in any app doing AV stream lip-sync — generalizable lesson.

3. Three traps from Linux WebKitGTK in one day

Tauri 2 + Linux is a minefield. Three traps in one day:

  • getUserMedia silent deny: click mic, nothing happens. WebKitGTK silently rejects media permission requests by default (no browser-style "allow microphone?" dialog — it just dies as NO). Fix: hook the WebKitWebView::permission-request signal in Rust and explicitly .allow().
  • High-frequency textContent swap eats click events: I was updating the recording-elapsed seconds on the button's textContent every 100 ms; 7-8 out of 10 clicks on "stop" silently dropped. WebKitGTK has a hit-test race during DOM mutation. Split the dynamic text into a child span so the button element itself never gets mutated post-mount.
  • Japanese IME preedit invisible (this one stayed unsolved): Wayland session + ibus-mozc; the conversion-in-progress hiragana never appears in the textarea. Tried GDK_BACKEND=x11, WEBKIT_FORCE_SANDBOX=0, GTK_IM_MODULE=xim — not one helped.

Lesson from the trio: assume Linux WebView silently drops something by default. The same feature that just works on macOS/Windows will need Linux-specific countermeasures.

What I cut

To ship in a day, I deliberately cut:

  • React + Vite + TypeScript: all vanilla JS, main.js is ~700 lines, no build step. The web app's useVoiceChat.ts is a 1587-line monolith hook; rewriting just the pieces I needed in vanilla took 200 lines.
  • Persona picker: no "choose your character" screen. Iris is fixed. This is what flips kotonia-desktop from "platform" to "character" and locks the product identity.
  • VAD (auto-detect end of speech): explicit click-to-stop, and the transcript only gets inserted into the prompt box — no auto-submit. Given the high stakes of agent bash execution, this enforces a user-review step so a misheard transcript can't fire rm -rf.

The governing principle was "don't try to reuse the feature, reuse the API contract". 80% of the web app is out of scope for the desktop (vision input, multi-persona switching, wiki, server-side session resume…), and forcing them in would have made the thing larger, not smaller.

Try it

git clone [email protected]:zhener562/kotonia-cli
git clone [email protected]:zhener562/kotonia-desktop
cd kotonia-desktop/src-tauri
cargo tauri dev

(Clone kotonia-cli as a sibling — the desktop uses it as a path dependency. See the repo README.)

Prereqs:

  • kotonia-cli login to device-auth into kotonia.ai (one device_token unlocks LLM / TTS / Ditto / ASR)
  • On Linux: libwebkit2gtk-4.1-dev + libdbus-1-dev + GStreamer audio plugins
  • On macOS / Windows: just the Tauri 2 prereqs

What's next

  • A configurable hotkey UI (currently click-only)
  • Session-list sidebar
  • macOS / Windows code signing
  • Higher-quality Iris voice once Qwen3-TTS Base (zero-shot voice clone) is brought back online
  • Engine-switcher UI in the desktop to expose the kotonia-cli engine choice

Launch itself is done. If you want a partner who makes your daily development and your dream-chase just a little more fun, the clone commands above will land Iris in your bottom-right corner inside five minutes.

Kotonia brings voice AI, AI chat, image generation, and team collaboration into one AI workspace.

Try Kotonia