AWS SDK Polly - Text-to-speech synthesis with streaming audio and SSML support
Recipe
npm install @aws-sdk/client-polly// lib/polly.ts
import { PollyClient } from "@aws-sdk/client-polly";
export const pollyClient = new PollyClient({
region: process.env.AWS_REGION!,
credentials: {
accessKeyId: process.env.AWS_ACCESS_KEY_ID!,
secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY!,
},
});// app/api/speech/route.ts
import { SynthesizeSpeechCommand } from "@aws-sdk/client-polly";
import { pollyClient } from "@/lib/polly";
export async function POST(req: Request) {
const { text, voiceId = "Joanna" } = await req.json();
const command = new SynthesizeSpeechCommand({
Text: text,
OutputFormat: "mp3",
VoiceId: voiceId,
Engine: "neural",
});
const response = await pollyClient.send(command);
const stream = response.AudioStream as ReadableStream;
return new Response(stream, {
headers: {
"Content-Type": "audio/mpeg",
"Cache-Control": "public, max-age=3600",
},
});
}When to reach for this: You need to convert text to natural-sounding speech in a web app for accessibility, voice assistants, language learning, or content narration.
Working Example
// app/components/TextToSpeech.tsx
"use client";
import { useState, useRef } from "react";
const VOICES = [
{ id: "Joanna", name: "Joanna (US English)", lang: "en-US" },
{ id: "Matthew", name: "Matthew (US English)", lang: "en-US" },
{ id: "Amy", name: "Amy (British English)", lang: "en-GB" },
{ id: "Lea", name: "Léa (French)", lang: "fr-FR" },
{ id: "Vicki", name: "Vicki (German)", lang: "de-DE" },
{ id: "Lucia", name: "Lucia (Spanish)", lang: "es-ES" },
];
export default function TextToSpeech() {
const [text, setText] = useState("");
const [voiceId, setVoiceId] = useState("Joanna");
const [loading, setLoading] = useState(false);
const audioRef = useRef<HTMLAudioElement>(null);
async function handleSpeak() {
if (!text.trim()) return;
setLoading(true);
try {
const res = await fetch("/api/speech", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ text, voiceId }),
});
if (!res.ok) throw new Error("Speech synthesis failed");
const blob = await res.blob();
const url = URL.createObjectURL(blob);
if (audioRef.current) {
audioRef.current.src = url;
audioRef.current.play();
}
} catch (error) {
console.error("TTS error:", error);
} finally {
setLoading(false);
}
}
return (
<div className="max-w-md mx-auto p-6 space-y-4">
<textarea
value={text}
onChange={(e) => setText(e.target.value)}
placeholder="Enter text to speak..."
rows={4}
className="w-full border rounded px-3 py-2"
maxLength={3000}
/>
<div className="flex gap-3">
<select
value={voiceId}
onChange={(e) => setVoiceId(e.target.value)}
className="border rounded px-3 py-2"
>
{VOICES.map((v) => (
<option key={v.id} value={v.id}>
{v.name}
</option>
))}
</select>
<button
onClick={handleSpeak}
disabled={loading || !text.trim()}
className="bg-blue-600 text-white px-4 py-2 rounded disabled:opacity-50"
>
{loading ? "Generating..." : "Speak"}
</button>
</div>
<audio ref={audioRef} controls className="w-full" />
<p className="text-xs text-gray-500">{text.length}/3000 characters</p>
</div>
);
}What this demonstrates:
- Full text-to-speech UI with voice selection
- Streaming audio from Polly through a Next.js API route
- Audio blob creation and playback with HTML5 audio element
- Character limit handling (Polly limit is 3000 chars for standard, 6000 for neural)
Deep Dive
How It Works
- Polly converts text to audio using neural or standard speech synthesis engines
- The neural engine produces more natural-sounding speech but is available for a subset of voices
SynthesizeSpeechCommandreturns anAudioStreamthat can be piped directly to the response- Supported output formats:
mp3,ogg_vorbis,pcm, andjson(for speech marks) - Polly supports both plain text and SSML (Speech Synthesis Markup Language) for fine-grained control
- Each voice is tied to a specific language; you cannot change a voice's language
Variations
SSML for advanced speech control:
const command = new SynthesizeSpeechCommand({
Text: `<speak>
Welcome to our app.
<break time="500ms"/>
<prosody rate="slow" pitch="+10%">
This part is spoken slowly with a higher pitch.
</prosody>
<emphasis level="strong">This is important.</emphasis>
<say-as interpret-as="date" format="mdy">12/25/2025</say-as>
</speak>`,
TextType: "ssml",
OutputFormat: "mp3",
VoiceId: "Joanna",
Engine: "neural",
});Get available voices:
import { DescribeVoicesCommand } from "@aws-sdk/client-polly";
export async function GET() {
const command = new DescribeVoicesCommand({
Engine: "neural",
LanguageCode: "en-US",
});
const response = await pollyClient.send(command);
const voices = response.Voices?.map((v) => ({
id: v.Id,
name: v.Name,
gender: v.Gender,
languageName: v.LanguageName,
}));
return Response.json(voices);
}Speech marks (word timing data):
const command = new SynthesizeSpeechCommand({
Text: "Hello, how are you today?",
OutputFormat: "json",
VoiceId: "Joanna",
Engine: "neural",
SpeechMarkTypes: ["word", "sentence"],
});
const response = await pollyClient.send(command);
// Returns JSONL with timing data for each word/sentence
// {"time":0,"type":"sentence","start":0,"end":25,"value":"Hello, how are you today?"}
// {"time":0,"type":"word","start":0,"end":5,"value":"Hello"}Server Action for short phrases:
"use server";
import { SynthesizeSpeechCommand } from "@aws-sdk/client-polly";
import { pollyClient } from "@/lib/polly";
export async function synthesizeSpeech(text: string, voiceId: string = "Joanna") {
const command = new SynthesizeSpeechCommand({
Text: text.slice(0, 3000),
OutputFormat: "mp3",
VoiceId: voiceId,
Engine: "neural",
});
const response = await pollyClient.send(command);
const chunks: Uint8Array[] = [];
const stream = response.AudioStream as AsyncIterable<Uint8Array>;
for await (const chunk of stream) {
chunks.push(chunk);
}
const buffer = Buffer.concat(chunks);
return buffer.toString("base64");
}TypeScript Notes
VoiceIdis a union type of all available voice IDs (e.g.,"Joanna","Matthew")Engineis"neural"or"standard"OutputFormatis"mp3"or"ogg_vorbis"or"pcm"or"json"AudioStreamtype varies by environment; cast toReadableStreamin serverless
import type {
SynthesizeSpeechCommandInput,
VoiceId,
Engine,
} from "@aws-sdk/client-polly";
const voiceId: VoiceId = "Joanna";
const engine: Engine = "neural";
const input: SynthesizeSpeechCommandInput = {
Text: "Hello",
OutputFormat: "mp3",
VoiceId: voiceId,
Engine: engine,
};Gotchas
-
Neural engine not available for all voices — Not every voice supports the neural engine. Fix: Check
DescribeVoicesCommandwithEngine: "neural"to get the list of supported voices. Fall back to"standard"if unsure. -
Text length limits — Standard engine allows 6000 characters, neural allows 3000 per request. Fix: Split long text into chunks at sentence boundaries and synthesize each separately.
-
SSML must be well-formed XML — Invalid SSML tags cause
SynthesizeSpeechCommandto throw. Fix: SetTextType: "ssml"and wrap content in<speak>tags. Escape special characters (&,<). -
Audio format compatibility —
pcmoutput is raw audio data, not playable in a browser audio element. Fix: Usemp3orogg_vorbisfor browser playback. Usepcmonly for audio processing pipelines. -
Cost at scale — Polly charges per character synthesized. Fix: Cache audio results in S3 or a CDN for repeated phrases. Use the
Cache-Controlheader on responses. -
Streaming body type varies —
AudioStreamis typed differently in Node.js vs. edge runtimes. Fix: In Node.js API routes, cast toReadableStream. In edge runtime, it may already be a web stream.
Alternatives
| Library | Best For | Trade-off |
|---|---|---|
| AWS Polly | High-quality neural voices, SSML control | AWS account required, per-character cost |
| Web Speech API | Free browser-native TTS | Inconsistent quality, limited voice control |
| Google Cloud TTS | WaveNet voices, broad language support | Different SDK, similar pricing |
| ElevenLabs | Ultra-realistic voice cloning | Higher cost, separate API |
| OpenAI TTS | Simple API, good quality | Limited voice customization |
FAQs
What is the difference between the neural and standard Polly engines?
- Neural engine produces more natural, human-like speech
- Standard engine is available for more voices but sounds more robotic
- Neural has a 3000-character limit per request; standard allows 6000
- Not all voices support the neural engine -- check with
DescribeVoicesCommand
How do I stream Polly audio through a Next.js API route to the browser?
const response = await pollyClient.send(command);
const stream = response.AudioStream as ReadableStream;
return new Response(stream, {
headers: { "Content-Type": "audio/mpeg" },
});The browser receives the audio stream and can play it via an <audio> element.
What audio output formats does Polly support and which should I use for browser playback?
mp3-- widely supported, good for browser playbackogg_vorbis-- good quality, supported in most modern browserspcm-- raw audio data, not playable directly in a browserjson-- returns speech marks (timing data), not audio- Use
mp3for browser playback in most cases
How do I use SSML to control speech prosody, pauses, and emphasis?
const command = new SynthesizeSpeechCommand({
Text: `<speak>
Hello. <break time="500ms"/>
<prosody rate="slow">This is slow.</prosody>
<emphasis level="strong">Important!</emphasis>
</speak>`,
TextType: "ssml",
OutputFormat: "mp3",
VoiceId: "Joanna",
Engine: "neural",
});Gotcha: Why does my SSML request throw an error?
- SSML must be well-formed XML wrapped in
<speak>tags - Set
TextType: "ssml"in the command -- omitting this treats SSML tags as plain text - Escape special XML characters: use
&for&,<for< - Invalid or unclosed tags cause
SynthesizeSpeechCommandto throw
How do I list all available neural voices for a specific language?
import { DescribeVoicesCommand } from "@aws-sdk/client-polly";
const command = new DescribeVoicesCommand({
Engine: "neural",
LanguageCode: "en-US",
});
const response = await pollyClient.send(command);
const voices = response.Voices?.map(v => ({
id: v.Id,
name: v.Name,
gender: v.Gender,
}));What are speech marks and how do I get word-level timing data?
- Speech marks provide timing information for each word, sentence, or viseme
- Set
OutputFormat: "json"andSpeechMarkTypes: ["word", "sentence"] - The response is JSONL (one JSON object per line) with
time,start,end, andvaluefields - Useful for karaoke-style highlighting or lip sync
Gotcha: How do I handle the AudioStream type that varies across runtimes?
- In Node.js API routes,
AudioStreamis a Node.jsReadablestream - In edge runtime, it may already be a web
ReadableStream - Cast to
ReadableStreamwhen returning from aResponseconstructor - For Server Actions, iterate with
for awaitand collect into aBuffer
How do I type the Polly command inputs in TypeScript?
import type {
SynthesizeSpeechCommandInput,
VoiceId,
Engine,
} from "@aws-sdk/client-polly";
const input: SynthesizeSpeechCommandInput = {
Text: "Hello",
OutputFormat: "mp3",
VoiceId: "Joanna" satisfies VoiceId,
Engine: "neural" satisfies Engine,
};How should I handle the 3000-character limit for neural voice requests?
- Split long text into chunks at sentence boundaries
- Synthesize each chunk separately and concatenate the audio on the client
- Standard engine allows 6000 characters per request if neural is not required
How can I reduce Polly costs for repeated phrases?
- Cache audio results in S3 for phrases that are synthesized frequently
- Set
Cache-Controlheaders on API responses for browser caching - Use a CDN in front of your speech API route
- Only re-synthesize when the text or voice changes
How do I return audio from a Server Action as base64 for client-side playback?
"use server";
const response = await pollyClient.send(command);
const chunks: Uint8Array[] = [];
for await (const chunk of response.AudioStream as AsyncIterable<Uint8Array>) {
chunks.push(chunk);
}
return Buffer.concat(chunks).toString("base64");On the client, convert base64 to a blob URL for the <audio> element.
Related
- AWS SDK S3 — Cache synthesized audio in S3
- AWS SDK Lambda — Process audio in Lambda
- Vercel AI SDK — Combine AI text generation with speech output