--- name: local-piper-tts-multilang-secure description: Local offline text-to-speech via Piper TTS. Self-contained setup, automatic language detection, per-call voice selection. Extensible to any language. Writes output into the OpenClaw workspace. metadata: {"openclaw":{"emoji":"🔊","requires":{"bins":["ffmpeg","python3"]}}} --- # local-piper-tts-multilang-secure ## Description Local (offline) text-to-speech via Piper. **Purpose:** generate audio files (OGG/Opus by default) from text, fully offline. **No sending** is performed by the skill — sending is handled by the agent after the file is ready. ## Features - Fully offline (no API keys) - Self-contained setup via `setup()` — installs Piper into an isolated venv, no system-wide changes - Automatic language detection for 20+ languages with English as default - Per-call voice selection via `voice` parameter - On-demand voice download via `downloadVoices()` — no models bundled, choose what you need - Voice removal via `removeVoice()` — clean up voices you no longer want - Extensible: add any language by installing a Piper `.onnx` model - Writes outputs into OpenClaw workspace ## First-run flow — full agent procedure Follow this sequence exactly when the user asks to use TTS for the first time in a setup context. ### Step 1 — check status ```js const s = await status(); ``` ### Step 2 — install Piper if needed If `s.stage` is `not-setup` or `no-piper`: - Tell the user: *"To use local TTS I need to install piper-tts into the skill's venv (~30 seconds, one-time). OK to proceed?"* - Wait for confirmation, then call `setup()`. - If setup returns a step containing "WARNING: espeak-ng not found", relay the warning and install instructions to the user. - Call `status()` again after setup completes. ### Step 3 — offer voice download if no models present If `s.stage` is `no-model` (Piper installed but no `.onnx` files): **3a. Offer English defaults:** Explain that two English voices are available as defaults (~65 MB each): - `en_US-ryan-medium` — male, American - `en_US-amy-medium` — female, American Ask which they want, or both: *"Which English voice(s) should I download? Ryan (male), Amy (female), or both?"* **3b. Ask about other languages:** After the English choice, ask: *"Do you need any other languages? For example German, French, Spanish, Polish, Italian, Portuguese, Russian… Just tell me and I'll check what's available."* If the user names a language, look up the available models at https://github.com/rhasspy/piper/blob/master/VOICES.md and list the options. Download whatever the user picks using the same `downloadVoices()` call. **3c. Download everything at once:** ```js const result = await downloadVoices(['en_US-ryan-medium', 'en_US-amy-medium', /* + any others */]); // result.downloaded — succeeded // result.failed — [{stem, error}] if any failed ``` Each voice requires internet access. Download takes ~1–2 min per voice on a typical connection. If any downloads fail: - Check internet connectivity - Verify the stem exists at https://github.com/rhasspy/piper/blob/master/VOICES.md - Offer to retry ### Step 4 — play samples so the user can choose After downloading, generate a short audio sample for each downloaded voice and send it to the user. For each voice, use a greeting **in the voice's language**: - English: `"Hello, I'm [name]. How can I help you today?"` - German: `"Hallo, ich heiße [Name]. Wie kann ich Ihnen helfen?"` - French: `"Bonjour, je m'appelle [prénom]. Comment puis-je vous aider?"` - Spanish: `"Hola, me llamo [nombre]. ¿Cómo puedo ayudarte?"` - Polish: `"Cześć, mam na imię [imię]. Jak mogę Ci pomóc?"` - Italian: `"Ciao, mi chiamo [nome]. Come posso aiutarti?"` - Portuguese: `"Olá, meu nome é [nome]. Como posso ajudar?"` - Russian: `"Привет, меня зовут [имя]. Чем могу помочь?"` - For other languages: use an equivalent native greeting. Replace `[name]` with the voice name (e.g. *Ryan*, *Amy*, *Thorsten*). ```js const sample = await tts({ text: 'Hello, I\'m Ryan. How can I help you today?', voice: 'en_US-ryan-medium' }); // send sample.path to the user as a voice message ``` Send all samples, then ask: *"Which voice do you prefer? Or shall I download a different one?"* ### Step 5 — choose speech speed After the user picks a voice, ask: *"How fast should I speak? Normal is 100%. Some options: 125% (faster), 115% (slightly faster), 100% (normal), 80% (slower) — or tell me a percentage."* Always present speed as a percentage to the user. Never mention `lengthScale` directly. `lengthScale` is the internal duration multiplier — lower = faster. To convert: `lengthScale = 1 / (speed% / 100)`. Examples: - 125% speed → lengthScale 0.8 - 115% speed → lengthScale 0.87 - 100% speed → lengthScale 1.0 (default) - 80% speed → lengthScale 1.25 Generate a short sample at the chosen speed so the user can hear the difference: ```js const sample = await tts({ text: 'This is how I sound at this speed.', voice: 'chosen-voice', lengthScale: 0.8 }); // send sample.path to the user ``` Confirm with the user, then offer to save it permanently: *"Should I save this as your default speed? It'll be used automatically every session."* If the user agrees: ```js await saveConfig({ lengthScale: 0.8 }); ``` Once saved, `tts()` reads it from `config.json` in the skill directory automatically — no need to pass `lengthScale` on every call. ### Step 6 — note the preferred voice and speed Once confirmed, remember both `voice` and `lengthScale` for the session. Pass them to every subsequent `tts()` call unless the user asks to change them. --- ## Before first use — always call status() **Always call `status()` before the first `tts()` call in a session** to determine what is needed. | `stage` | Meaning | What to do | |---|---|---| | `ready` | Fully installed, at least one voice model present | Proceed with `tts()` | | `not-setup` | Piper not installed | Ask user for confirmation, then call `setup()` | | `no-piper` | Venv exists but piper binary missing | Ask user for confirmation, then call `setup()` | | `no-model` | Piper installed but no voice model downloaded | Follow Steps 3–5 of first-run flow above | **IMPORTANT: Always ask the user for confirmation before calling `setup()`.** It installs the `piper-tts` package from PyPI into a venv inside the skill directory. ## Usage - Input: `text`, optional `format` (`"ogg"` or `"wav"`), optional `voice` (model stem), optional `lengthScale` (speech speed, default `1.0`) - Output: path to generated file (usually `.ogg`) ## Controlling voice and language **To list installed voices**, call `listVoices()` — returns stems of all installed `.onnx` models. Never assume a fixed list; it varies per user and installation. **Auto-detection (no `voice` param):** The script detects language from the text using character and script analysis: - Non-Latin scripts: Cyrillic (Russian, Ukrainian, Bulgarian), Greek, Arabic, Persian, Chinese, Japanese, Korean, Georgian - Latin-script languages: Vietnamese, Polish, Romanian, Turkish, Czech, Slovak, Hungarian, Portuguese, Spanish, Catalan, German, Finnish, Scandinavian (Swedish, Norwegian, Danish), French, Italian - Fallback: English keywords → first English model → any installed model Auto-detection is best-effort. For reliable results with a specific language, always pass the `voice` parameter explicitly. **Explicit override:** set `PIPER_VOICE_MODEL` env var to a full `.onnx` path (overrides everything). **When the user requests a specific voice or language:** 1. Call `listVoices()` to see what is installed 2. Pass the matching stem as `voice` to `tts()`, e.g. `voice: "en_US-amy-medium"` 3. If the requested voice is not installed, offer to download it with `downloadVoices([stem])` **To switch back to auto-detect**, omit the `voice` parameter. ## Downloading additional voices The user may say things like *"I don't like this voice, use a female one"* or *"Download a German voice"*. When this happens: 1. Find the model at https://github.com/rhasspy/piper/blob/master/VOICES.md 2. Confirm the stem (e.g. `de_DE-thorsten-medium`) and call `downloadVoices([stem])` 3. Generate a sample and send it to the user 4. Confirm with `listVoices()` — the new voice is immediately usable ## Removing voices The user may say *"remove that voice"* or *"I don't need the German voice anymore"*. When this happens: 1. Call `listVoices()` to confirm which voices are installed 2. Confirm with the user which voice to remove 3. Call `removeVoice(stem)` — e.g. `removeVoice('de_DE-thorsten-medium')` 4. Returns `{ removed, filesDeleted }` on success 5. If the removed voice was the user's preferred voice, ask them to pick a new one **Never remove the last remaining voice without warning the user that TTS will stop working.** ## Changing speech speed The user may say things like *"speak faster"*, *"too slow"*, or *"speed it up"*. When this happens: 1. Ask what speed they want in %, or suggest: 125% (faster), 115%, 100% (normal), 80% (slower) 2. Convert their % to lengthScale: `lengthScale = 1 / (speed% / 100)` 3. Generate a short sample: `await tts({ text: '...', voice: 'current-voice', lengthScale: 0.8 })` 4. Send the sample and confirm 5. Offer to persist: *"Save this as default?"* — if yes, call `saveConfig({ lengthScale: 0.8 })` 6. Use the new `lengthScale` for all subsequent `tts()` calls in the session ## Where files are written - `OPENCLAW_WORKSPACE/tts/` if `OPENCLAW_WORKSPACE` env var is set - otherwise: `~/.openclaw/workspace/tts/` ## Dependencies - `python3` (3.8+) — required for `setup()` to create the venv - `ffmpeg` — for WAV → OGG/Opus conversion - `espeak-ng` — system library used by Piper internally; `setup()` checks for it and warns if missing. Install: `sudo apt install espeak-ng` (Debian/Ubuntu), `sudo dnf install espeak-ng` (Fedora), `brew install espeak` (macOS) - At least one Piper `.onnx` + `.onnx.json` voice model pair in the skill directory ## Platform support - Linux x86_64: fully supported - macOS x86_64 / arm64: fully supported - Linux ARM: may require building piper-tts from source - Windows: not supported ## Remove ```bash rm -rf ~/.openclaw/skills/local-piper-tts-multilang-secure ``` This removes everything: skill code, venv, and all voice models.