AWS SDK Polly - Text-to-speech synthesis with streaming audio and SSML support

Recipe

npm install @aws-sdk/client-polly

// lib/polly.ts
import { PollyClient } from "@aws-sdk/client-polly";
 
export const pollyClient = new PollyClient({
  region: process.env.AWS_REGION!,
  credentials: {
    accessKeyId: process.env.AWS_ACCESS_KEY_ID!,
    secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY!,
  },
});

// app/api/speech/route.ts
import { SynthesizeSpeechCommand } from "@aws-sdk/client-polly";
import { pollyClient } from "@/lib/polly";
 
export async function POST(req: Request) {
  const { text, voiceId = "Joanna" } = await req.json();
 
  const command = new SynthesizeSpeechCommand({
    Text: text,
    OutputFormat: "mp3",
    VoiceId: voiceId,
    Engine: "neural",
  });
 
  const response = await pollyClient.send(command);
  const stream = response.AudioStream as ReadableStream;
 
  return new Response(stream, {
    headers: {
      "Content-Type": "audio/mpeg",
      "Cache-Control": "public, max-age=3600",
    },
  });
}

When to reach for this: You need to convert text to natural-sounding speech in a web app for accessibility, voice assistants, language learning, or content narration.

Working Example

// app/components/TextToSpeech.tsx
"use client";
import { useState, useRef } from "react";
 
const VOICES = [
  { id: "Joanna", name: "Joanna (US English)", lang: "en-US" },
  { id: "Matthew", name: "Matthew (US English)", lang: "en-US" },
  { id: "Amy", name: "Amy (British English)", lang: "en-GB" },
  { id: "Lea", name: "Léa (French)", lang: "fr-FR" },
  { id: "Vicki", name: "Vicki (German)", lang: "de-DE" },
  { id: "Lucia", name: "Lucia (Spanish)", lang: "es-ES" },
];
 
export default function TextToSpeech() {
  const [text, setText] = useState("");
  const [voiceId, setVoiceId] = useState("Joanna");
  const [loading, setLoading] = useState(false);
  const audioRef = useRef<HTMLAudioElement>(null);
 
  async function handleSpeak() {
    if (!text.trim()) return;
    setLoading(true);
 
    try {
      const res = await fetch("/api/speech", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ text, voiceId }),
      });
 
      if (!res.ok) throw new Error("Speech synthesis failed");
 
      const blob = await res.blob();
      const url = URL.createObjectURL(blob);
 
      if (audioRef.current) {
        audioRef.current.src = url;
        audioRef.current.play();
      }
    } catch (error) {
      console.error("TTS error:", error);
    } finally {
      setLoading(false);
    }
  }
 
  return (
    <div className="max-w-md mx-auto p-6 space-y-4">
      <textarea
        value={text}
        onChange={(e) => setText(e.target.value)}
        placeholder="Enter text to speak..."
        rows={4}
        className="w-full border rounded px-3 py-2"
        maxLength={3000}
      />
      <div className="flex gap-3">
        <select
          value={voiceId}
          onChange={(e) => setVoiceId(e.target.value)}
          className="border rounded px-3 py-2"
        >
          {VOICES.map((v) => (
            <option key={v.id} value={v.id}>
              {v.name}
            </option>
          ))}
        </select>
        <button
          onClick={handleSpeak}
          disabled={loading || !text.trim()}
          className="bg-blue-600 text-white px-4 py-2 rounded disabled:opacity-50"
        >
          {loading ? "Generating..." : "Speak"}
        </button>
      </div>
      <audio ref={audioRef} controls className="w-full" />
      <p className="text-xs text-gray-500">{text.length}/3000 characters</p>
    </div>
  );
}

What this demonstrates:

Full text-to-speech UI with voice selection
Streaming audio from Polly through a Next.js API route
Audio blob creation and playback with HTML5 audio element
Character limit handling (Polly limit is 3000 chars for standard, 6000 for neural)

Deep Dive

How It Works

Polly converts text to audio using neural or standard speech synthesis engines
The neural engine produces more natural-sounding speech but is available for a subset of voices
SynthesizeSpeechCommand returns an AudioStream that can be piped directly to the response
Supported output formats: mp3, ogg_vorbis, pcm, and json (for speech marks)
Polly supports both plain text and SSML (Speech Synthesis Markup Language) for fine-grained control
Each voice is tied to a specific language; you cannot change a voice's language

Variations

SSML for advanced speech control:

const command = new SynthesizeSpeechCommand({
  Text: `<speak>
    Welcome to our app.
    <break time="500ms"/>
    <prosody rate="slow" pitch="+10%">
      This part is spoken slowly with a higher pitch.
    </prosody>
    <emphasis level="strong">This is important.</emphasis>
    <say-as interpret-as="date" format="mdy">12/25/2025</say-as>
  </speak>`,
  TextType: "ssml",
  OutputFormat: "mp3",
  VoiceId: "Joanna",
  Engine: "neural",
});

Get available voices:

import { DescribeVoicesCommand } from "@aws-sdk/client-polly";
 
export async function GET() {
  const command = new DescribeVoicesCommand({
    Engine: "neural",
    LanguageCode: "en-US",
  });
 
  const response = await pollyClient.send(command);
  const voices = response.Voices?.map((v) => ({
    id: v.Id,
    name: v.Name,
    gender: v.Gender,
    languageName: v.LanguageName,
  }));
 
  return Response.json(voices);
}

Speech marks (word timing data):

const command = new SynthesizeSpeechCommand({
  Text: "Hello, how are you today?",
  OutputFormat: "json",
  VoiceId: "Joanna",
  Engine: "neural",
  SpeechMarkTypes: ["word", "sentence"],
});
 
const response = await pollyClient.send(command);
// Returns JSONL with timing data for each word/sentence
// {"time":0,"type":"sentence","start":0,"end":25,"value":"Hello, how are you today?"}
// {"time":0,"type":"word","start":0,"end":5,"value":"Hello"}

Server Action for short phrases:

"use server";
import { SynthesizeSpeechCommand } from "@aws-sdk/client-polly";
import { pollyClient } from "@/lib/polly";
 
export async function synthesizeSpeech(text: string, voiceId: string = "Joanna") {
  const command = new SynthesizeSpeechCommand({
    Text: text.slice(0, 3000),
    OutputFormat: "mp3",
    VoiceId: voiceId,
    Engine: "neural",
  });
 
  const response = await pollyClient.send(command);
  const chunks: Uint8Array[] = [];
  const stream = response.AudioStream as AsyncIterable<Uint8Array>;
 
  for await (const chunk of stream) {
    chunks.push(chunk);
  }
 
  const buffer = Buffer.concat(chunks);
  return buffer.toString("base64");
}

TypeScript Notes

VoiceId is a union type of all available voice IDs (e.g., "Joanna", "Matthew")
Engine is "neural" or "standard"
OutputFormat is "mp3" or "ogg_vorbis" or "pcm" or "json"
AudioStream type varies by environment; cast to ReadableStream in serverless

import type {
  SynthesizeSpeechCommandInput,
  VoiceId,
  Engine,
} from "@aws-sdk/client-polly";
 
const voiceId: VoiceId = "Joanna";
const engine: Engine = "neural";
 
const input: SynthesizeSpeechCommandInput = {
  Text: "Hello",
  OutputFormat: "mp3",
  VoiceId: voiceId,
  Engine: engine,
};

Gotchas

Neural engine not available for all voices — Not every voice supports the neural engine. Fix: Check DescribeVoicesCommand with Engine: "neural" to get the list of supported voices. Fall back to "standard" if unsure.
Text length limits — Standard engine allows 6000 characters, neural allows 3000 per request. Fix: Split long text into chunks at sentence boundaries and synthesize each separately.
SSML must be well-formed XML — Invalid SSML tags cause SynthesizeSpeechCommand to throw. Fix: Set TextType: "ssml" and wrap content in <speak> tags. Escape special characters (&, <).
Audio format compatibility — pcm output is raw audio data, not playable in a browser audio element. Fix: Use mp3 or ogg_vorbis for browser playback. Use pcm only for audio processing pipelines.
Cost at scale — Polly charges per character synthesized. Fix: Cache audio results in S3 or a CDN for repeated phrases. Use the Cache-Control header on responses.
Streaming body type varies — AudioStream is typed differently in Node.js vs. edge runtimes. Fix: In Node.js API routes, cast to ReadableStream. In edge runtime, it may already be a web stream.

Alternatives

Library	Best For	Trade-off
AWS Polly	High-quality neural voices, SSML control	AWS account required, per-character cost
Web Speech API	Free browser-native TTS	Inconsistent quality, limited voice control
Google Cloud TTS	WaveNet voices, broad language support	Different SDK, similar pricing
ElevenLabs	Ultra-realistic voice cloning	Higher cost, separate API
OpenAI TTS	Simple API, good quality	Limited voice customization

FAQs

What is the difference between the neural and standard Polly engines?

Neural engine produces more natural, human-like speech
Standard engine is available for more voices but sounds more robotic
Neural has a 3000-character limit per request; standard allows 6000
Not all voices support the neural engine -- check with DescribeVoicesCommand

How do I stream Polly audio through a Next.js API route to the browser?

const response = await pollyClient.send(command);
const stream = response.AudioStream as ReadableStream;
 
return new Response(stream, {
  headers: { "Content-Type": "audio/mpeg" },
});

The browser receives the audio stream and can play it via an <audio> element.

What audio output formats does Polly support and which should I use for browser playback?

mp3 -- widely supported, good for browser playback
ogg_vorbis -- good quality, supported in most modern browsers
pcm -- raw audio data, not playable directly in a browser
json -- returns speech marks (timing data), not audio
Use mp3 for browser playback in most cases

How do I use SSML to control speech prosody, pauses, and emphasis?

const command = new SynthesizeSpeechCommand({
  Text: `<speak>
    Hello. <break time="500ms"/>
    <prosody rate="slow">This is slow.</prosody>
    <emphasis level="strong">Important!</emphasis>
  </speak>`,
  TextType: "ssml",
  OutputFormat: "mp3",
  VoiceId: "Joanna",
  Engine: "neural",
});

Gotcha: Why does my SSML request throw an error?

SSML must be well-formed XML wrapped in <speak> tags
Set TextType: "ssml" in the command -- omitting this treats SSML tags as plain text
Escape special XML characters: use & for &, < for <
Invalid or unclosed tags cause SynthesizeSpeechCommand to throw

How do I list all available neural voices for a specific language?

import { DescribeVoicesCommand } from "@aws-sdk/client-polly";
 
const command = new DescribeVoicesCommand({
  Engine: "neural",
  LanguageCode: "en-US",
});
const response = await pollyClient.send(command);
const voices = response.Voices?.map(v => ({
  id: v.Id,
  name: v.Name,
  gender: v.Gender,
}));

What are speech marks and how do I get word-level timing data?

Speech marks provide timing information for each word, sentence, or viseme
Set OutputFormat: "json" and SpeechMarkTypes: ["word", "sentence"]
The response is JSONL (one JSON object per line) with time, start, end, and value fields
Useful for karaoke-style highlighting or lip sync

Gotcha: How do I handle the AudioStream type that varies across runtimes?

In Node.js API routes, AudioStream is a Node.js Readable stream
In edge runtime, it may already be a web ReadableStream
Cast to ReadableStream when returning from a Response constructor
For Server Actions, iterate with for await and collect into a Buffer

How do I type the Polly command inputs in TypeScript?

import type {
  SynthesizeSpeechCommandInput,
  VoiceId,
  Engine,
} from "@aws-sdk/client-polly";
 
const input: SynthesizeSpeechCommandInput = {
  Text: "Hello",
  OutputFormat: "mp3",
  VoiceId: "Joanna" satisfies VoiceId,
  Engine: "neural" satisfies Engine,
};

How should I handle the 3000-character limit for neural voice requests?

Split long text into chunks at sentence boundaries
Synthesize each chunk separately and concatenate the audio on the client
Standard engine allows 6000 characters per request if neural is not required

How can I reduce Polly costs for repeated phrases?

Cache audio results in S3 for phrases that are synthesized frequently
Set Cache-Control headers on API responses for browser caching
Use a CDN in front of your speech API route
Only re-synthesize when the text or voice changes

How do I return audio from a Server Action as base64 for client-side playback?

"use server";
const response = await pollyClient.send(command);
const chunks: Uint8Array[] = [];
for await (const chunk of response.AudioStream as AsyncIterable<Uint8Array>) {
  chunks.push(chunk);
}
return Buffer.concat(chunks).toString("base64");

On the client, convert base64 to a blob URL for the <audio> element.

AWS SDK S3 — Cache synthesized audio in S3
AWS SDK Lambda — Process audio in Lambda
Vercel AI SDK — Combine AI text generation with speech output