Generate content (Gemini)
Audio
Native Gemini Format
Gemini-native generateContent interface for text chat, multimodal media recognition (images, audio, video), speech synthesis, and image generation with structured parts. Use generationConfig to request specific response modalities such as speech (speechConfig) or images (imageConfig).
POST
Generate content (Gemini)
This page uses the same
generateContent operation as Generate content (Gemini), with the playground above pre-filled for plain text chat. The notes below describe the Gemini-native fields you can add to generationConfig to request audio understanding or generation with structured parts.
Set
generationConfig.responseModalities to ["AUDIO"] to request audio output, and configure generationConfig.speechConfig.voiceConfig.prebuiltVoiceConfig.voiceName to choose a prebuilt voice for generated speech.Gemini-native request fields
| Field | Type | Required | Description |
|---|---|---|---|
generationConfig.responseModalities | array | Yes | Requested response modalities, e.g. ["AUDIO"]. |
generationConfig.speechConfig | object | No | Speech configuration object. |
generationConfig.speechConfig.voiceConfig | object | No | Voice configuration wrapper. |
generationConfig.speechConfig.voiceConfig.prebuiltVoiceConfig | object | No | Prebuilt voice settings. |
generationConfig.speechConfig.voiceConfig.prebuiltVoiceConfig.voiceName | string | No | Prebuilt voice preset name, e.g. Kore. |
Example: requesting speech audio
Response fields
The response follows the standardgenerateContent shape. When audio output is requested, the returned parts contain inline audio data instead of text:
Candidate responses returned by the model.
Token accounting, including
promptTokenCount, candidatesTokenCount, and totalTokenCount.Prompt blocking feedback when applicable.
Example response
200
Authorizations
Your DGrid API key. All endpoints use Authorization: Bearer <DGRID_API_KEY>.
Path Parameters
Target model ID, such as gemini-1.5-pro.
Body
application/json

