Audio
The Audio API covers Gemini-native audio understanding plus OpenAI-compatible speech synthesis, transcription, and translation endpoints.
Native Gemini Format
Use Gemini-compatible generateContent requests when you need multimodal audio understanding or generation with structured parts.
/v1/models/{model}:generateContentPath Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Target model ID, such as gemini-1.5-pro. |
Request Body
| Field | Type | Required | Description |
|---|---|---|---|
contents | array | Yes | Content array used in the request body. |
contents[] | object | Yes | The sample uses an empty object item inside contents. |
generationConfig | object | Yes | Generation configuration object used in the sample. |
generationConfig. | array | Yes | Requested response modalities array. |
generationConfig. | object | Yes | Speech configuration object. |
generationConfig. | object | Yes | Voice configuration wrapper. |
generationConfig. | object | Yes | Prebuilt voice settings. |
generationConfig. | string | Yes | Voice preset name used in the sample. |
Response Body
| Field | Type | Description |
|---|---|---|
candidates | array | Candidate responses returned by the model. |
candidates[].content | object | Generated content object. |
candidates[].content.role | string | Role returned in the generated content block. |
candidates[].content.parts | array | Returned parts. |
candidates[].finishReason | string | Finish reason string returned by the sample response. |
candidates[].safetyRatings | array | Safety evaluation results. |
usageMetadata | object | Token accounting. |
usageMetadata.promptTokenCount | integer | Prompt token count. |
usageMetadata.candidatesTokenCount | integer | Output token count. |
usageMetadata.totalTokenCount | integer | Total token count. |
promptFeedback | object | Prompt blocking feedback when applicable. |
Text-to-Speech
Convert text into natural speech with the OpenAI-compatible audio speech interface.
/v1/audio/speechRequest Body
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
model | string | Yes | - | Voice model, such as tts-1 or tts-1-hd. |
input | string | Yes | - | Text to synthesize, up to 4096 characters. |
voice | string | Yes | - | Voice preset, such as alloy, echo, fable, onyx, nova, or shimmer. |
response_format | string | No | mp3 | Output audio format. |
speed | number | No | 1.0 | Speaking speed from 0.25 to 4.0. |
Response
The endpoint returns a binary audio stream. Save the response body directly to a local file or cloud storage target.
Audio Transcriptions
Transcribe uploaded audio into text with the OpenAI-compatible Whisper-style interface.
/v1/audio/transcriptionsForm Data
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
file | file | Yes | - | Audio file up to 25 MB. |
model | string | Yes | - | Model ID, such as whisper-1. |
language | string | No | - | ISO-639-1 language code, such as en, zh, or ko. |
prompt | string | No | - | Optional prompt for biasing the transcript. |
response_format | string | No | json | json, text, srt, verbose_json, or vtt. |
temperature | number | No | 0 | Sampling temperature from 0 to 1. |
Supported Formats
FLAC, MP3, MP4, MPEG, MPGA, M4A, OGG, WAV, and WebM.
Response Body
| Field | Type | Description |
|---|---|---|
text | string | Transcript text. |
When response_format is set to verbose_json, the response also includes task, language, duration, and per-segment timing metadata.
Audio Translations
Translate uploaded audio into English with the OpenAI-compatible translation endpoint.
/v1/audio/translationsForm Data
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
file | file | Yes | - | Source audio file. |
model | string | Yes | - | Model ID, such as whisper-1. |
prompt | string | No | - | Optional English prompt. |
response_format | string | No | json | json, text, srt, verbose_json, or vtt. |
temperature | number | No | 0 | Sampling temperature from 0 to 1. |
Response Body
| Field | Type | Description |
|---|---|---|
text | string | English translation of the uploaded audio. |
