React SME Cookbook
All FAQs

Search Documentation

Search across all documentation pages

awspollytext-to-speechaudiottsssml

AWS SDK Polly - Text-to-speech synthesis with streaming audio and SSML support

Recipe

npm install @aws-sdk/client-polly
// lib/polly.ts
import { PollyClient } from "@aws-sdk/client-polly";
 
export const pollyClient = new PollyClient({
  region: process.env.AWS_REGION!,
  credentials: {
    accessKeyId: process.env.AWS_ACCESS_KEY_ID!,
    secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY!,
  },
});
// app/api/speech/route.ts
import { SynthesizeSpeechCommand } from "@aws-sdk/client-polly";
import { pollyClient } from "@/lib/polly";
 
export async function POST(req: Request) {
  const { text, voiceId = "Joanna" } = await req.json();
 
  const command = new SynthesizeSpeechCommand({
    Text: text,
    OutputFormat: "mp3",
    VoiceId: voiceId,
    Engine: "neural",
  });
 
  const response = await pollyClient.send(command);
  const stream = response.AudioStream as ReadableStream;
 
  return new Response(stream, {
    headers: {
      "Content-Type": "audio/mpeg",
      "Cache-Control": "public, max-age=3600",
    },
  });
}

When to reach for this: You need to convert text to natural-sounding speech in a web app for accessibility, voice assistants, language learning, or content narration.

Working Example

// app/components/TextToSpeech.tsx
"use client";
import { useState, useRef } from "react";
 
const VOICES = [
  { id: "Joanna", name: "Joanna (US English)", lang: "en-US" },
  { id: "Matthew", name: "Matthew (US English)", lang: "en-US" },
  { id: "Amy", name: "Amy (British English)", lang: "en-GB" },
  { id: "Lea", name: "Léa (French)", lang: "fr-FR" },
  { id: "Vicki", name: "Vicki (German)", lang: "de-DE" },
  { id: "Lucia", name: "Lucia (Spanish)", lang: "es-ES" },
];
 
export default function TextToSpeech() {
  const [text, setText] = useState("");
  const [voiceId, setVoiceId] = useState("Joanna");
  const [loading, setLoading] = useState(false);
  const audioRef = useRef<HTMLAudioElement>(null);
 
  async function handleSpeak() {
    if (!text.trim()) return;
    setLoading(true);
 
    try {
      const res = await fetch("/api/speech", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ text, voiceId }),
      });
 
      if (!res.ok) throw new Error("Speech synthesis failed");
 
      const blob = await res.blob();
      const url = URL.createObjectURL(blob);
 
      if (audioRef.current) {
        audioRef.current.src = url;
        audioRef.current.play();
      }
    } catch (error) {
      console.error("TTS error:", error);
    } finally {
      setLoading(false);
    }
  }
 
  return (
    <div className="max-w-md mx-auto p-6 space-y-4">
      <textarea
        value={text}
        onChange={(e) => setText(e.target.value)}
        placeholder="Enter text to speak..."
        rows={4}
        className="w-full border rounded px-3 py-2"
        maxLength={3000}
      />
      <div className="flex gap-3">
        <select
          value={voiceId}
          onChange={(e) => setVoiceId(e.target.value)}
          className="border rounded px-3 py-2"
        >
          {VOICES.map((v) => (
            <option key={v.id} value={v.id}>
              {v.name}
            </option>
          ))}
        </select>
        <button
          onClick={handleSpeak}
          disabled={loading || !text.trim()}
          className="bg-blue-600 text-white px-4 py-2 rounded disabled:opacity-50"
        >
          {loading ? "Generating..." : "Speak"}
        </button>
      </div>
      <audio ref={audioRef} controls className="w-full" />
      <p className="text-xs text-gray-500">{text.length}/3000 characters</p>
    </div>
  );
}

What this demonstrates:

  • Full text-to-speech UI with voice selection
  • Streaming audio from Polly through a Next.js API route
  • Audio blob creation and playback with HTML5 audio element
  • Character limit handling (Polly limit is 3000 chars for standard, 6000 for neural)

Deep Dive

How It Works

  • Polly converts text to audio using neural or standard speech synthesis engines
  • The neural engine produces more natural-sounding speech but is available for a subset of voices
  • SynthesizeSpeechCommand returns an AudioStream that can be piped directly to the response
  • Supported output formats: mp3, ogg_vorbis, pcm, and json (for speech marks)
  • Polly supports both plain text and SSML (Speech Synthesis Markup Language) for fine-grained control
  • Each voice is tied to a specific language; you cannot change a voice's language

Variations

SSML for advanced speech control:

const command = new SynthesizeSpeechCommand({
  Text: `<speak>
    Welcome to our app.
    <break time="500ms"/>
    <prosody rate="slow" pitch="+10%">
      This part is spoken slowly with a higher pitch.
    </prosody>
    <emphasis level="strong">This is important.</emphasis>
    <say-as interpret-as="date" format="mdy">12/25/2025</say-as>
  </speak>`,
  TextType: "ssml",
  OutputFormat: "mp3",
  VoiceId: "Joanna",
  Engine: "neural",
});

Get available voices:

import { DescribeVoicesCommand } from "@aws-sdk/client-polly";
 
export async function GET() {
  const command = new DescribeVoicesCommand({
    Engine: "neural",
    LanguageCode: "en-US",
  });
 
  const response = await pollyClient.send(command);
  const voices = response.Voices?.map((v) => ({
    id: v.Id,
    name: v.Name,
    gender: v.Gender,
    languageName: v.LanguageName,
  }));
 
  return Response.json(voices);
}

Speech marks (word timing data):

const command = new SynthesizeSpeechCommand({
  Text: "Hello, how are you today?",
  OutputFormat: "json",
  VoiceId: "Joanna",
  Engine: "neural",
  SpeechMarkTypes: ["word", "sentence"],
});
 
const response = await pollyClient.send(command);
// Returns JSONL with timing data for each word/sentence
// {"time":0,"type":"sentence","start":0,"end":25,"value":"Hello, how are you today?"}
// {"time":0,"type":"word","start":0,"end":5,"value":"Hello"}

Server Action for short phrases:

"use server";
import { SynthesizeSpeechCommand } from "@aws-sdk/client-polly";
import { pollyClient } from "@/lib/polly";
 
export async function synthesizeSpeech(text: string, voiceId: string = "Joanna") {
  const command = new SynthesizeSpeechCommand({
    Text: text.slice(0, 3000),
    OutputFormat: "mp3",
    VoiceId: voiceId,
    Engine: "neural",
  });
 
  const response = await pollyClient.send(command);
  const chunks: Uint8Array[] = [];
  const stream = response.AudioStream as AsyncIterable<Uint8Array>;
 
  for await (const chunk of stream) {
    chunks.push(chunk);
  }
 
  const buffer = Buffer.concat(chunks);
  return buffer.toString("base64");
}

TypeScript Notes

  • VoiceId is a union type of all available voice IDs (e.g., "Joanna", "Matthew")
  • Engine is "neural" or "standard"
  • OutputFormat is "mp3" or "ogg_vorbis" or "pcm" or "json"
  • AudioStream type varies by environment; cast to ReadableStream in serverless
import type {
  SynthesizeSpeechCommandInput,
  VoiceId,
  Engine,
} from "@aws-sdk/client-polly";
 
const voiceId: VoiceId = "Joanna";
const engine: Engine = "neural";
 
const input: SynthesizeSpeechCommandInput = {
  Text: "Hello",
  OutputFormat: "mp3",
  VoiceId: voiceId,
  Engine: engine,
};

Gotchas

  • Neural engine not available for all voices — Not every voice supports the neural engine. Fix: Check DescribeVoicesCommand with Engine: "neural" to get the list of supported voices. Fall back to "standard" if unsure.

  • Text length limits — Standard engine allows 6000 characters, neural allows 3000 per request. Fix: Split long text into chunks at sentence boundaries and synthesize each separately.

  • SSML must be well-formed XML — Invalid SSML tags cause SynthesizeSpeechCommand to throw. Fix: Set TextType: "ssml" and wrap content in <speak> tags. Escape special characters (&amp;, &lt;).

  • Audio format compatibilitypcm output is raw audio data, not playable in a browser audio element. Fix: Use mp3 or ogg_vorbis for browser playback. Use pcm only for audio processing pipelines.

  • Cost at scale — Polly charges per character synthesized. Fix: Cache audio results in S3 or a CDN for repeated phrases. Use the Cache-Control header on responses.

  • Streaming body type variesAudioStream is typed differently in Node.js vs. edge runtimes. Fix: In Node.js API routes, cast to ReadableStream. In edge runtime, it may already be a web stream.

Alternatives

LibraryBest ForTrade-off
AWS PollyHigh-quality neural voices, SSML controlAWS account required, per-character cost
Web Speech APIFree browser-native TTSInconsistent quality, limited voice control
Google Cloud TTSWaveNet voices, broad language supportDifferent SDK, similar pricing
ElevenLabsUltra-realistic voice cloningHigher cost, separate API
OpenAI TTSSimple API, good qualityLimited voice customization

FAQs

What is the difference between the neural and standard Polly engines?
  • Neural engine produces more natural, human-like speech
  • Standard engine is available for more voices but sounds more robotic
  • Neural has a 3000-character limit per request; standard allows 6000
  • Not all voices support the neural engine -- check with DescribeVoicesCommand
How do I stream Polly audio through a Next.js API route to the browser?
const response = await pollyClient.send(command);
const stream = response.AudioStream as ReadableStream;
 
return new Response(stream, {
  headers: { "Content-Type": "audio/mpeg" },
});

The browser receives the audio stream and can play it via an <audio> element.

What audio output formats does Polly support and which should I use for browser playback?
  • mp3 -- widely supported, good for browser playback
  • ogg_vorbis -- good quality, supported in most modern browsers
  • pcm -- raw audio data, not playable directly in a browser
  • json -- returns speech marks (timing data), not audio
  • Use mp3 for browser playback in most cases
How do I use SSML to control speech prosody, pauses, and emphasis?
const command = new SynthesizeSpeechCommand({
  Text: `<speak>
    Hello. <break time="500ms"/>
    <prosody rate="slow">This is slow.</prosody>
    <emphasis level="strong">Important!</emphasis>
  </speak>`,
  TextType: "ssml",
  OutputFormat: "mp3",
  VoiceId: "Joanna",
  Engine: "neural",
});
Gotcha: Why does my SSML request throw an error?
  • SSML must be well-formed XML wrapped in <speak> tags
  • Set TextType: "ssml" in the command -- omitting this treats SSML tags as plain text
  • Escape special XML characters: use &amp; for &, &lt; for <
  • Invalid or unclosed tags cause SynthesizeSpeechCommand to throw
How do I list all available neural voices for a specific language?
import { DescribeVoicesCommand } from "@aws-sdk/client-polly";
 
const command = new DescribeVoicesCommand({
  Engine: "neural",
  LanguageCode: "en-US",
});
const response = await pollyClient.send(command);
const voices = response.Voices?.map(v => ({
  id: v.Id,
  name: v.Name,
  gender: v.Gender,
}));
What are speech marks and how do I get word-level timing data?
  • Speech marks provide timing information for each word, sentence, or viseme
  • Set OutputFormat: "json" and SpeechMarkTypes: ["word", "sentence"]
  • The response is JSONL (one JSON object per line) with time, start, end, and value fields
  • Useful for karaoke-style highlighting or lip sync
Gotcha: How do I handle the AudioStream type that varies across runtimes?
  • In Node.js API routes, AudioStream is a Node.js Readable stream
  • In edge runtime, it may already be a web ReadableStream
  • Cast to ReadableStream when returning from a Response constructor
  • For Server Actions, iterate with for await and collect into a Buffer
How do I type the Polly command inputs in TypeScript?
import type {
  SynthesizeSpeechCommandInput,
  VoiceId,
  Engine,
} from "@aws-sdk/client-polly";
 
const input: SynthesizeSpeechCommandInput = {
  Text: "Hello",
  OutputFormat: "mp3",
  VoiceId: "Joanna" satisfies VoiceId,
  Engine: "neural" satisfies Engine,
};
How should I handle the 3000-character limit for neural voice requests?
  • Split long text into chunks at sentence boundaries
  • Synthesize each chunk separately and concatenate the audio on the client
  • Standard engine allows 6000 characters per request if neural is not required
How can I reduce Polly costs for repeated phrases?
  • Cache audio results in S3 for phrases that are synthesized frequently
  • Set Cache-Control headers on API responses for browser caching
  • Use a CDN in front of your speech API route
  • Only re-synthesize when the text or voice changes
How do I return audio from a Server Action as base64 for client-side playback?
"use server";
const response = await pollyClient.send(command);
const chunks: Uint8Array[] = [];
for await (const chunk of response.AudioStream as AsyncIterable<Uint8Array>) {
  chunks.push(chunk);
}
return Buffer.concat(chunks).toString("base64");

On the client, convert base64 to a blob URL for the <audio> element.