openclaw-backups/skills/local-piper-tts-multilang-secure/README.md

# local-piper-tts-multilang-secure

Fully offline text-to-speech via Piper TTS. Self-contained setup, automatic language detection for 20+ languages, and per-call voice selection. Writes audio files into the OpenClaw workspace for easy attachment and sending.

## Features

- Fully offline — no API keys required
- Self-contained setup — `setup()` installs Piper into an isolated venv, no system packages modified
- Automatic language detection for 20+ languages with English as default
- Per-call voice and speed selection: pass `voice: "voice-stem"` and `lengthScale: 0.85` to `tts()`
- Dynamic voice discovery: `listVoices()` returns whatever is installed — no hardcoded assumptions
- On-demand voice download: `downloadVoices(["en_US-ryan-medium", ...])` fetches models from HuggingFace
- Voice removal: `removeVoice("en_US-ryan-medium")` deletes models you no longer need
- Extensible: add any language by dropping in a Piper `.onnx` model
- Writes outputs into the OpenClaw workspace for easy attachment
- Default output: OGG/Opus (compact, widely compatible)

## Requirements

- `python3` (3.8+) — for the one-time `setup()` step
- `ffmpeg` — for WAV → OGG/Opus conversion
- `espeak-ng` — system library used by Piper for phonemization (see note below)

No API keys. No system-wide package installation. Everything stays inside the skill directory.

## Platform support

| Platform | Status |
|---|---|
| Linux x86_64 | Fully supported |
| macOS x86_64 / arm64 | Fully supported |
| Linux ARM (Raspberry Pi, etc.) | May require building piper-tts from source |
| Windows | Not supported (bash dependency) |

## espeak-ng

Piper uses `espeak-ng` internally for text-to-phoneme conversion. On many systems it is
already installed. `setup()` checks for it and warns if missing. If needed, install via
your package manager:

```bash
# Debian / Ubuntu
sudo apt install espeak-ng

# Fedora / RHEL
sudo dnf install espeak-ng

# macOS
brew install espeak
```

After installing, TTS should work without re-running `setup()`.

## Installation

```bash
cp -r local-piper-tts-multilang-secure ~/.openclaw/skills/local-piper-tts-multilang-secure
```

Then ask your agent to set it up — it will call `setup()` after asking for your confirmation.
`setup()` is a one-time operation that:
1. Creates a Python venv inside the skill directory
2. Installs `piper-tts` from PyPI into that venv
3. Checks for `espeak-ng` and warns if missing

## First run

After installation, tell your agent:
> "Set up the local TTS skill"

The agent will:
1. Call `status()` and explain what needs to be done
2. Ask for confirmation, then run `setup()`
3. Offer to download English voice models (ryan-medium and/or amy-medium)
4. Ask if you need any other languages (German, French, Spanish, Polish, Italian, Russian, …)
5. Download your chosen voices, generate a short sample for each, and send them to you
6. Ask which voice you prefer
7. Ask about preferred speech speed in % (default 100% = normal, e.g. 125% = faster), play a sample at your chosen speed

## Voice models

The skill ships with no voice models — you choose what to install.
English is recommended as a baseline. Browse available models at:
https://github.com/rhasspy/piper/blob/master/VOICES.md

### Recommended English defaults

| Stem | Gender | Size |
|---|---|---|
| `en_US-ryan-medium` | Male, American | ~65 MB |
| `en_US-amy-medium` | Female, American | ~65 MB |

Download programmatically:
```js
const { downloadVoices } = require('./index');
await downloadVoices(['en_US-ryan-medium', 'en_US-amy-medium']);
```

Or just ask your agent: *"Download the English voices"* — it will handle everything including
playing samples so you can choose.

To see what is installed:
```js
require('./index').listVoices()
// ["en_US-ryan-medium", "de_DE-thorsten-medium", ...]
```

Or ask your agent: *"What voices do you have available?"*

## Changing voices

Just tell your agent:
- *"I don't like this voice, use a different one"*
- *"Download a female English voice"*
- *"Switch to British accent"*
- *"Get a German voice"*

The agent will check what is installed, download what is needed, play a sample, and use the right model.

## Removing voices

Just tell your agent:
- *"Remove the German voice"*
- *"Delete the Ryan voice, I only use Amy"*
- *"Clean up unused voices"*

The agent will confirm which voice to remove and delete the model files. Each voice takes ~65 MB, so removing unused ones can free significant disk space.

Programmatically:
```js
require('./index').removeVoice('en_US-ryan-medium')
// { removed: 'en_US-ryan-medium', filesDeleted: ['en_US-ryan-medium.onnx', 'en_US-ryan-medium.onnx.json'] }
```

## Changing speech speed

Just tell your agent:
- *"Speak faster"*
- *"Too slow, speed it up"*
- *"Use 120% speed"*
- *"Back to normal"*

The agent will suggest options in %, play a sample, and apply the change. Speed is expressed as a percentage — 100% is normal. `lengthScale` is the inverse: `lengthScale = 1 / (speed% / 100)`.

| Speed | lengthScale |
|---|---|
| 125% (fast) | 0.8 |
| 115% | 0.87 |
| 100% (normal) | 1.0 |
| 80% (slow) | 1.25 |

Default is 100% (lengthScale 1.0).

To persist your preferred speed across sessions, ask your agent to save it — it will call `saveConfig({ lengthScale: 0.8 })` which writes to `config.json` inside the skill directory. The skill picks this up automatically on every subsequent call — no need to repeat your preference each session.

## Language detection

Detection logic lives in `piper-tts.sh` and works automatically based on character and script analysis:

**Non-Latin scripts (unambiguous):**
- Cyrillic → Russian (with Ukrainian detection via і/ї/є/ґ), Bulgarian, Serbian
- Greek → Greek
- Arabic script → Arabic (with Persian detection via پ/چ/ژ/گ)
- CJK ideographs → Chinese (with Japanese detection via Hiragana/Katakana)
- Hangul → Korean
- Georgian → Georgian

**Latin-script languages (by distinctive characters):**
- Vietnamese (ăơưđ)
- Polish (ąćęłńśźż)
- Romanian (șț)
- Turkish (ğışİ)
- Czech/Slovak (ěščřžďťň, ů for Czech)
- Hungarian (őű)
- Portuguese (ãõ)
- Spanish (ñ¿¡)
- Catalan (l·l)
- German (ß, äöü)
- Finnish (äö, when no Scandinavian markers)
- Scandinavian — Norwegian/Danish (æø), Swedish (åäö)
- French (œçèêëïî)
- Italian (àèìòù)

**Fallback:** English keywords → first English model → any installed model.

No detection needed when `voice` is specified explicitly.

## Security

- `execFile` throughout — no shell interpreter, user text cannot inject commands
- Voice path validated to stay within the skill directory — no path traversal
- Output filename sanitised with `path.basename()` — no directory traversal
- HTTPS-only downloads — non-HTTPS URLs and redirects are rejected
- URL path components validated against expected patterns
- Atomic downloads (write to .tmp, rename on success) — no corrupt models from interrupted downloads
- Piper installed in isolated venv — no system Python packages touched
- No credentials, no network calls during TTS (only during setup and voice downloads)

## Remove

```bash
rm -rf ~/.openclaw/skills/local-piper-tts-multilang-secure
```

This removes everything: skill code, venv, and all voice models.

## License

MIT