# local-piper-tts-multilang-secure Fully offline text-to-speech via Piper TTS. Self-contained setup, automatic language detection for 20+ languages, and per-call voice selection. Writes audio files into the OpenClaw workspace for easy attachment and sending. ## Features - Fully offline — no API keys required - Self-contained setup — `setup()` installs Piper into an isolated venv, no system packages modified - Automatic language detection for 20+ languages with English as default - Per-call voice and speed selection: pass `voice: "voice-stem"` and `lengthScale: 0.85` to `tts()` - Dynamic voice discovery: `listVoices()` returns whatever is installed — no hardcoded assumptions - On-demand voice download: `downloadVoices(["en_US-ryan-medium", ...])` fetches models from HuggingFace - Voice removal: `removeVoice("en_US-ryan-medium")` deletes models you no longer need - Extensible: add any language by dropping in a Piper `.onnx` model - Writes outputs into the OpenClaw workspace for easy attachment - Default output: OGG/Opus (compact, widely compatible) ## Requirements - `python3` (3.8+) — for the one-time `setup()` step - `ffmpeg` — for WAV → OGG/Opus conversion - `espeak-ng` — system library used by Piper for phonemization (see note below) No API keys. No system-wide package installation. Everything stays inside the skill directory. ## Platform support | Platform | Status | |---|---| | Linux x86_64 | Fully supported | | macOS x86_64 / arm64 | Fully supported | | Linux ARM (Raspberry Pi, etc.) | May require building piper-tts from source | | Windows | Not supported (bash dependency) | ## espeak-ng Piper uses `espeak-ng` internally for text-to-phoneme conversion. On many systems it is already installed. `setup()` checks for it and warns if missing. If needed, install via your package manager: ```bash # Debian / Ubuntu sudo apt install espeak-ng # Fedora / RHEL sudo dnf install espeak-ng # macOS brew install espeak ``` After installing, TTS should work without re-running `setup()`. ## Installation ```bash cp -r local-piper-tts-multilang-secure ~/.openclaw/skills/local-piper-tts-multilang-secure ``` Then ask your agent to set it up — it will call `setup()` after asking for your confirmation. `setup()` is a one-time operation that: 1. Creates a Python venv inside the skill directory 2. Installs `piper-tts` from PyPI into that venv 3. Checks for `espeak-ng` and warns if missing ## First run After installation, tell your agent: > "Set up the local TTS skill" The agent will: 1. Call `status()` and explain what needs to be done 2. Ask for confirmation, then run `setup()` 3. Offer to download English voice models (ryan-medium and/or amy-medium) 4. Ask if you need any other languages (German, French, Spanish, Polish, Italian, Russian, …) 5. Download your chosen voices, generate a short sample for each, and send them to you 6. Ask which voice you prefer 7. Ask about preferred speech speed in % (default 100% = normal, e.g. 125% = faster), play a sample at your chosen speed ## Voice models The skill ships with no voice models — you choose what to install. English is recommended as a baseline. Browse available models at: https://github.com/rhasspy/piper/blob/master/VOICES.md ### Recommended English defaults | Stem | Gender | Size | |---|---|---| | `en_US-ryan-medium` | Male, American | ~65 MB | | `en_US-amy-medium` | Female, American | ~65 MB | Download programmatically: ```js const { downloadVoices } = require('./index'); await downloadVoices(['en_US-ryan-medium', 'en_US-amy-medium']); ``` Or just ask your agent: *"Download the English voices"* — it will handle everything including playing samples so you can choose. To see what is installed: ```js require('./index').listVoices() // ["en_US-ryan-medium", "de_DE-thorsten-medium", ...] ``` Or ask your agent: *"What voices do you have available?"* ## Changing voices Just tell your agent: - *"I don't like this voice, use a different one"* - *"Download a female English voice"* - *"Switch to British accent"* - *"Get a German voice"* The agent will check what is installed, download what is needed, play a sample, and use the right model. ## Removing voices Just tell your agent: - *"Remove the German voice"* - *"Delete the Ryan voice, I only use Amy"* - *"Clean up unused voices"* The agent will confirm which voice to remove and delete the model files. Each voice takes ~65 MB, so removing unused ones can free significant disk space. Programmatically: ```js require('./index').removeVoice('en_US-ryan-medium') // { removed: 'en_US-ryan-medium', filesDeleted: ['en_US-ryan-medium.onnx', 'en_US-ryan-medium.onnx.json'] } ``` ## Changing speech speed Just tell your agent: - *"Speak faster"* - *"Too slow, speed it up"* - *"Use 120% speed"* - *"Back to normal"* The agent will suggest options in %, play a sample, and apply the change. Speed is expressed as a percentage — 100% is normal. `lengthScale` is the inverse: `lengthScale = 1 / (speed% / 100)`. | Speed | lengthScale | |---|---| | 125% (fast) | 0.8 | | 115% | 0.87 | | 100% (normal) | 1.0 | | 80% (slow) | 1.25 | Default is 100% (lengthScale 1.0). To persist your preferred speed across sessions, ask your agent to save it — it will call `saveConfig({ lengthScale: 0.8 })` which writes to `config.json` inside the skill directory. The skill picks this up automatically on every subsequent call — no need to repeat your preference each session. ## Language detection Detection logic lives in `piper-tts.sh` and works automatically based on character and script analysis: **Non-Latin scripts (unambiguous):** - Cyrillic → Russian (with Ukrainian detection via і/ї/є/ґ), Bulgarian, Serbian - Greek → Greek - Arabic script → Arabic (with Persian detection via پ/چ/ژ/گ) - CJK ideographs → Chinese (with Japanese detection via Hiragana/Katakana) - Hangul → Korean - Georgian → Georgian **Latin-script languages (by distinctive characters):** - Vietnamese (ăơưđ) - Polish (ąćęłńśźż) - Romanian (șț) - Turkish (ğışİ) - Czech/Slovak (ěščřžďťň, ů for Czech) - Hungarian (őű) - Portuguese (ãõ) - Spanish (ñ¿¡) - Catalan (l·l) - German (ß, äöü) - Finnish (äö, when no Scandinavian markers) - Scandinavian — Norwegian/Danish (æø), Swedish (åäö) - French (œçèêëïî) - Italian (àèìòù) **Fallback:** English keywords → first English model → any installed model. No detection needed when `voice` is specified explicitly. ## Security - `execFile` throughout — no shell interpreter, user text cannot inject commands - Voice path validated to stay within the skill directory — no path traversal - Output filename sanitised with `path.basename()` — no directory traversal - HTTPS-only downloads — non-HTTPS URLs and redirects are rejected - URL path components validated against expected patterns - Atomic downloads (write to .tmp, rename on success) — no corrupt models from interrupted downloads - Piper installed in isolated venv — no system Python packages touched - No credentials, no network calls during TTS (only during setup and voice downloads) ## Remove ```bash rm -rf ~/.openclaw/skills/local-piper-tts-multilang-secure ``` This removes everything: skill code, venv, and all voice models. ## License MIT