Files
openclaw-backups/skills/local-piper-tts-multilang-secure/README.md

7.2 KiB
Raw Blame History

local-piper-tts-multilang-secure

Fully offline text-to-speech via Piper TTS. Self-contained setup, automatic language detection for 20+ languages, and per-call voice selection. Writes audio files into the OpenClaw workspace for easy attachment and sending.

Features

  • Fully offline — no API keys required
  • Self-contained setup — setup() installs Piper into an isolated venv, no system packages modified
  • Automatic language detection for 20+ languages with English as default
  • Per-call voice and speed selection: pass voice: "voice-stem" and lengthScale: 0.85 to tts()
  • Dynamic voice discovery: listVoices() returns whatever is installed — no hardcoded assumptions
  • On-demand voice download: downloadVoices(["en_US-ryan-medium", ...]) fetches models from HuggingFace
  • Voice removal: removeVoice("en_US-ryan-medium") deletes models you no longer need
  • Extensible: add any language by dropping in a Piper .onnx model
  • Writes outputs into the OpenClaw workspace for easy attachment
  • Default output: OGG/Opus (compact, widely compatible)

Requirements

  • python3 (3.8+) — for the one-time setup() step
  • ffmpeg — for WAV → OGG/Opus conversion
  • espeak-ng — system library used by Piper for phonemization (see note below)

No API keys. No system-wide package installation. Everything stays inside the skill directory.

Platform support

Platform Status
Linux x86_64 Fully supported
macOS x86_64 / arm64 Fully supported
Linux ARM (Raspberry Pi, etc.) May require building piper-tts from source
Windows Not supported (bash dependency)

espeak-ng

Piper uses espeak-ng internally for text-to-phoneme conversion. On many systems it is already installed. setup() checks for it and warns if missing. If needed, install via your package manager:

# Debian / Ubuntu
sudo apt install espeak-ng

# Fedora / RHEL
sudo dnf install espeak-ng

# macOS
brew install espeak

After installing, TTS should work without re-running setup().

Installation

cp -r local-piper-tts-multilang-secure ~/.openclaw/skills/local-piper-tts-multilang-secure

Then ask your agent to set it up — it will call setup() after asking for your confirmation. setup() is a one-time operation that:

  1. Creates a Python venv inside the skill directory
  2. Installs piper-tts from PyPI into that venv
  3. Checks for espeak-ng and warns if missing

First run

After installation, tell your agent:

"Set up the local TTS skill"

The agent will:

  1. Call status() and explain what needs to be done
  2. Ask for confirmation, then run setup()
  3. Offer to download English voice models (ryan-medium and/or amy-medium)
  4. Ask if you need any other languages (German, French, Spanish, Polish, Italian, Russian, …)
  5. Download your chosen voices, generate a short sample for each, and send them to you
  6. Ask which voice you prefer
  7. Ask about preferred speech speed in % (default 100% = normal, e.g. 125% = faster), play a sample at your chosen speed

Voice models

The skill ships with no voice models — you choose what to install. English is recommended as a baseline. Browse available models at: https://github.com/rhasspy/piper/blob/master/VOICES.md

Stem Gender Size
en_US-ryan-medium Male, American ~65 MB
en_US-amy-medium Female, American ~65 MB

Download programmatically:

const { downloadVoices } = require('./index');
await downloadVoices(['en_US-ryan-medium', 'en_US-amy-medium']);

Or just ask your agent: "Download the English voices" — it will handle everything including playing samples so you can choose.

To see what is installed:

require('./index').listVoices()
// ["en_US-ryan-medium", "de_DE-thorsten-medium", ...]

Or ask your agent: "What voices do you have available?"

Changing voices

Just tell your agent:

  • "I don't like this voice, use a different one"
  • "Download a female English voice"
  • "Switch to British accent"
  • "Get a German voice"

The agent will check what is installed, download what is needed, play a sample, and use the right model.

Removing voices

Just tell your agent:

  • "Remove the German voice"
  • "Delete the Ryan voice, I only use Amy"
  • "Clean up unused voices"

The agent will confirm which voice to remove and delete the model files. Each voice takes ~65 MB, so removing unused ones can free significant disk space.

Programmatically:

require('./index').removeVoice('en_US-ryan-medium')
// { removed: 'en_US-ryan-medium', filesDeleted: ['en_US-ryan-medium.onnx', 'en_US-ryan-medium.onnx.json'] }

Changing speech speed

Just tell your agent:

  • "Speak faster"
  • "Too slow, speed it up"
  • "Use 120% speed"
  • "Back to normal"

The agent will suggest options in %, play a sample, and apply the change. Speed is expressed as a percentage — 100% is normal. lengthScale is the inverse: lengthScale = 1 / (speed% / 100).

Speed lengthScale
125% (fast) 0.8
115% 0.87
100% (normal) 1.0
80% (slow) 1.25

Default is 100% (lengthScale 1.0).

To persist your preferred speed across sessions, ask your agent to save it — it will call saveConfig({ lengthScale: 0.8 }) which writes to config.json inside the skill directory. The skill picks this up automatically on every subsequent call — no need to repeat your preference each session.

Language detection

Detection logic lives in piper-tts.sh and works automatically based on character and script analysis:

Non-Latin scripts (unambiguous):

  • Cyrillic → Russian (with Ukrainian detection via і/ї/є/ґ), Bulgarian, Serbian
  • Greek → Greek
  • Arabic script → Arabic (with Persian detection via پ/چ/ژ/گ)
  • CJK ideographs → Chinese (with Japanese detection via Hiragana/Katakana)
  • Hangul → Korean
  • Georgian → Georgian

Latin-script languages (by distinctive characters):

  • Vietnamese (ăơưđ)
  • Polish (ąćęłńśźż)
  • Romanian (șț)
  • Turkish (ğışİ)
  • Czech/Slovak (ěščřžďťň, ů for Czech)
  • Hungarian (őű)
  • Portuguese (ãõ)
  • Spanish (ñ¿¡)
  • Catalan (l·l)
  • German (ß, äöü)
  • Finnish (äö, when no Scandinavian markers)
  • Scandinavian — Norwegian/Danish (æø), Swedish (åäö)
  • French (œçèêëïî)
  • Italian (àèìòù)

Fallback: English keywords → first English model → any installed model.

No detection needed when voice is specified explicitly.

Security

  • execFile throughout — no shell interpreter, user text cannot inject commands
  • Voice path validated to stay within the skill directory — no path traversal
  • Output filename sanitised with path.basename() — no directory traversal
  • HTTPS-only downloads — non-HTTPS URLs and redirects are rejected
  • URL path components validated against expected patterns
  • Atomic downloads (write to .tmp, rename on success) — no corrupt models from interrupted downloads
  • Piper installed in isolated venv — no system Python packages touched
  • No credentials, no network calls during TTS (only during setup and voice downloads)

Remove

rm -rf ~/.openclaw/skills/local-piper-tts-multilang-secure

This removes everything: skill code, venv, and all voice models.

License

MIT