Quickstart

Score a guided speech recording in under 60 seconds.

1. Send a request

# Score a recording with curl
curl -X POST https://api.prosody.studio/v1/scores \
  -H "X-API-Key: $PROSODY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "audio_data": "'$(base64 < recording.wav)'",
    "sample_rate": 16000,
    "language": "en-US",
    "reference_text": "The quick brown fox"
  }'
// TypeScript SDK
import { ProsodyClient } from "@prosody/sdk";

const client = new ProsodyClient({ apiKey: process.env.PROSODY_API_KEY });
const result = await client.score({
  audio: readFileSync("recording.wav").toString("base64"),
  language: "en-US",
  referenceText: "The quick brown fox"
});
# Score a recording with Python
import requests, base64

with open("recording.wav", "rb") as f:
    audio = base64.b64encode(f.read()).decode()

resp = requests.post(
    "https://api.prosody.studio/v1/scores",
    headers={"X-API-Key": PROSODY_API_KEY},
    json={
        "audio_data": audio,
        "sample_rate": 16000,
        "language": "en-US",
        "reference_text": "The quick brown fox",
    },
)
result = resp.json()

2. Read the result

{
  "scores": {
    "pronunciation": 72.4,
    "script_adherence": 100.0,
    "overall": 72.4
  },
  "words": [
    {
      "word": "the",
      "status": "match",
      "acoustic_match": 68.1,
      "timing": { "start": 0.12, "end": 0.24, "duration_ms": 120 },
      "phonemes": [
        { "detected": "DH", "acoustic_match": 71.2, "timing": { "start": 0.12, "end": 0.18 } },
        { "detected": "AH", "acoustic_match": 65.0, "timing": { "start": 0.18, "end": 0.24 } }
      ]
    }
  ]
}

Works with any HTTP client. Public examples here use English (en-US) today. See the full POST /v1/scores reference below, or try the playground — no key required.

Postman

Import the public collection, pair it with the production environment, set api_key, and start with GET /health and POST /v1/scores.

Public collection

Prosody API (Public)

Consumer-facing endpoints only: health, scoring, alignment, languages, auth helpers, preview history/results, and beta streaming. Internal admin and debug routes are excluded.

Production environment

Cloud Run + Modal GPU

Preconfigured for https://api.prosody.studio. Set api_key and the collection script injects X-API-Key automatically.

Recommended consumer flow: use api_key for external evaluation. JWT login remains available for local/dev and user-session testing. WebSocket streaming is available but should be treated as beta in Postman. Stored result lookup is a manual flow that uses score_result_id, which is distinct from the scoring response request_id. GET /v1/history and GET /v1/results are preview endpoints and currently return mock data.

Concepts

Alignment

Prosody aligns every phoneme in the speaker's audio against a reference text. The alignment engine runs on GPU and produces per-phoneme timing boundaries in ~20ms. This is the foundation layer — it tells the system what was said and when.

Pronunciation signals

On top of alignment, Prosody generates acoustic scores that measure how well each phoneme was pronounced. Scores use a 0–100 scale across three perspectives:

  • Acoustic match — How closely the spoken audio matches expected phoneme patterns. Scored per phoneme, per word, and overall.
  • Script adherence — How closely the speaker followed the reference text. 100 means all expected words were detected.
  • Overall — Combined score from acoustic match and script adherence.

Reference-guided scoring

All scoring requires a reference_text — the sentence the learner was asked to read. The system aligns what was spoken against what was expected, then scores the match. This is the core pattern for guided speech products: the user sees a prompt, records audio, and the product needs aligned feedback back.

Batch vs streaming

Single — Score one recording at a time. Best for feedback after a speaker finishes talking.

Batch — Score up to 100 recordings in a single request. Best for grading homework sets, running test suites, or processing guided speech sessions at scale.

Streaming — Send audio over WebSocket and receive word-by-word scores as the learner speaks. 500–1000ms tick cadence. Best for live coaching interfaces. The public product still leads with batch; streaming is available in beta.

API Reference

Base URL

https://api.prosody.studio

All API requests use HTTPS. HTTP requests are rejected.

Authentication

Authenticate requests with an API key via the X-API-Key header.

curl -X POST https://api.prosody.studio/v1/scores \
  -H "X-API-Key: your_api_key" \
  -H "Content-Type: application/json" \
  -d '...'

Trial mode is available without a key — 10 requests per day per IP. Try the playground to test without signing up.

POST /v1/scores

Score a single audio recording against a reference text.

Request body

{
  "audio_data": "<base64-encoded audio>",
  "sample_rate": 16000,
  "language": "en-US",
  "reference_text": "The quick brown fox"
}
Field Type Required Description
audio_data string Yes Base64-encoded audio data
sample_rate integer Yes Audio sample rate in Hz (e.g. 16000)
language string Yes Language code: en-US
reference_text string Yes The expected text the speaker should have read

Query parameters

Parameter Type Default Description
detail string standard Response detail level: summary, standard, or full
silence_threshold_ms integer 120 Silence gap (ms) for word boundary detection

Response

{
  "scores": {
    "pronunciation": 72.4,
    "script_adherence": 100.0,
    "overall": 72.4
  },
  "words": [
    {
      "word": "the",
      "status": "match",
      "acoustic_match": 68.1,
      "timing": { "start": 0.12, "end": 0.24, "duration_ms": 120 },
      "phonemes": [
        {
          "expected": "DH",
          "detected": "DH",
          "acoustic_match": 71.2
        },
        {
          "expected": "AH",
          "detected": "AH",
          "acoustic_match": 65.0
        }
      ]
    }
  ]
}

SDK equivalent: client.score()

POST /v1/scores/batch

Score multiple recordings in a single request (up to 100 items).

Request body

{
  "items": [
    {
      "item_id": "sentence-1",
      "audio_data": "<base64>",
      "sample_rate": 16000,
      "language": "en-US",
      "reference_text": "Hello world"
    }
  ],
  "parallel": true,
  "max_concurrency": 4
}

Response

{
  "success_count": 5,
  "failure_count": 0,
  "total_time_ms": 1240,
  "results": [
    {
      "item_id": "sentence-1",
      "success": true,
      "result": { /* same as POST /v1/scores response */ }
    }
  ]
}

SDK equivalent: client.scoreBatch()

WebSocket streaming Beta

Stream audio for real-time scoring over a persistent WebSocket connection. Results arrive as words are recognized, with a 500–1000ms tick cadence.

Connection

wss://api.prosody.studio/v1/stream
  ?language=en-US
  &reference_text=The+quick+brown+fox

Protocol

  • Send binary audio frames (16kHz mono PCM, 16-bit).
  • Receive JSON messages with partial word scores as they are detected.
  • Send a text message {"type":"end"} to signal end of audio.
  • The server responds with a final complete result.

SDK equivalent: client.stream(). Streaming scoring is in beta. Contact us for access and integration guidance.

Response fields

Scores object

Field Type Description
pronunciation float GOP quality for words matched to the script. 0–100 scale.
script_adherence float How closely the speaker followed the reference text. 100 means all expected words were detected. 0–100 scale.
overall float Combined score from acoustic match and script adherence. 0–100 scale.

Word object

Field Type Description
word string The expected word from the reference text
start float Start time in seconds
end float End time in seconds
status string match, mismatch, or missing
acoustic_match float Per-word acoustic match score. 0–100 scale.
phonemes array Per-phoneme detail (see below)

Phoneme object

Field Type Description
expected string Expected phoneme (ARPAbet notation, e.g. DH)
detected string Detected phoneme (ARPAbet notation, e.g. DH)
acoustic_match float Per-phoneme acoustic match score. 0–100 scale.
start float Start time in seconds
end float End time in seconds

Audio formats

Audio is sent as base64-encoded data in the audio_data field.

Format Supported Notes
WAV (PCM 16-bit) Recommended Lossless, best quality
WebM (Opus) Yes Browser recording default
MP3 Yes Auto-detected and converted
FLAC Yes Lossless compressed
OGG (Vorbis) Yes Auto-detected and converted

All audio is internally converted to 16kHz mono PCM for alignment. Providing 16kHz mono WAV avoids conversion overhead.

Errors

Errors return a JSON body with error and message fields.

{
  "error": "bad_request",
  "message": "reference_text is required"
}
Status Error Description
400 bad_request Invalid audio, missing reference_text, or bad parameters
401 unauthorized Missing or invalid API key (not applicable in trial mode)
429 rate_limit_exceeded Too many requests. Check retry-after header
500 internal Server error during scoring
503 service_unavailable Alignment engine temporarily unavailable

Rate limits

Tier Limit Auth
Trial 10 requests / day None (IP-based)
Developer 60 requests / min API key
Growth Custom API key

Rate limit status is returned in response headers:

Header Description
x-ratelimit-limit Maximum requests allowed in the current window
x-ratelimit-remaining Requests remaining in the current window
x-ratelimit-reset Unix timestamp when the window resets
retry-after Seconds to wait (only present on 429 responses)

TypeScript SDK

The official TypeScript SDK (@prosody/sdk) wraps the Prosody HTTP API with full type safety, automatic retries, Zod schema validation, and browser audio utilities.

Install it directly, or use the HTTP API if you want a language-agnostic integration.

npm install @prosody/sdk

Privacy & data handling

  • Audio is never stored. Audio data is processed in-memory and discarded after scoring completes.
  • No third-party processing. All alignment and scoring happens within Prosody infrastructure. Audio is never sent to external services.
  • No training on your data. Audio and text submitted through the API are not used to train or improve models.
  • Minimal logging. API logs include request metadata (timestamps, status codes, latency) but not audio content or reference text.

For GDPR-specific questions or a Data Processing Agreement, contact francois@prosody.studio.