Skip to content

Audio

The Audio API covers Gemini-native audio understanding plus OpenAI-compatible speech synthesis, transcription, and translation endpoints.

Native Gemini Format

Use Gemini-compatible generateContent requests when you need multimodal audio understanding or generation with structured parts.

POST
https://api.dgrid.ai
POST/v1/models/{model}:generateContent
Authorization
Authorization: Bearer <DGRID_API_KEY>
Request
application/json
Response
200 · application/json

Path Parameters

ParameterTypeRequiredDescription
modelstringYesTarget model ID, such as gemini-1.5-pro.

Request Body

FieldTypeRequiredDescription
contentsarrayYesContent array used in the request body.
contents[]objectYesThe sample uses an empty object item inside contents.
generationConfigobjectYesGeneration configuration object used in the sample.
generationConfig.responseModalitiesarrayYesRequested response modalities array.
generationConfig.speechConfigobjectYesSpeech configuration object.
generationConfig.speechConfig.voiceConfigobjectYesVoice configuration wrapper.
generationConfig.speechConfig.voiceConfig.prebuiltVoiceConfigobjectYesPrebuilt voice settings.
generationConfig.speechConfig.voiceConfig.prebuiltVoiceConfig.voiceNamestringYesVoice preset name used in the sample.

Response Body

FieldTypeDescription
candidatesarrayCandidate responses returned by the model.
candidates[].contentobjectGenerated content object.
candidates[].content.rolestringRole returned in the generated content block.
candidates[].content.partsarrayReturned parts.
candidates[].finishReasonstringFinish reason string returned by the sample response.
candidates[].safetyRatingsarraySafety evaluation results.
usageMetadataobjectToken accounting.
usageMetadata.promptTokenCountintegerPrompt token count.
usageMetadata.candidatesTokenCountintegerOutput token count.
usageMetadata.totalTokenCountintegerTotal token count.
promptFeedbackobjectPrompt blocking feedback when applicable.

Text-to-Speech

Convert text into natural speech with the OpenAI-compatible audio speech interface.

POST
https://api.dgrid.ai
POST/v1/audio/speech
Authorization
Authorization: Bearer <DGRID_API_KEY>
Request
application/json
Response
200 · audio/mpeg

Request Body

FieldTypeRequiredDefaultDescription
modelstringYes-Voice model, such as tts-1 or tts-1-hd.
inputstringYes-Text to synthesize, up to 4096 characters.
voicestringYes-Voice preset, such as alloy, echo, fable, onyx, nova, or shimmer.
response_formatstringNomp3Output audio format.
speednumberNo1.0Speaking speed from 0.25 to 4.0.

Response

The endpoint returns a binary audio stream. Save the response body directly to a local file or cloud storage target.

Audio Transcriptions

Transcribe uploaded audio into text with the OpenAI-compatible Whisper-style interface.

POST
https://api.dgrid.ai
POST/v1/audio/transcriptions
Authorization
Authorization: Bearer <DGRID_API_KEY>
Request
multipart/form-data
Response
200 · application/json

Form Data

FieldTypeRequiredDefaultDescription
filefileYes-Audio file up to 25 MB.
modelstringYes-Model ID, such as whisper-1.
languagestringNo-ISO-639-1 language code, such as en, zh, or ko.
promptstringNo-Optional prompt for biasing the transcript.
response_formatstringNojsonjson, text, srt, verbose_json, or vtt.
temperaturenumberNo0Sampling temperature from 0 to 1.

Supported Formats

FLAC, MP3, MP4, MPEG, MPGA, M4A, OGG, WAV, and WebM.

Response Body

FieldTypeDescription
textstringTranscript text.

When response_format is set to verbose_json, the response also includes task, language, duration, and per-segment timing metadata.

Audio Translations

Translate uploaded audio into English with the OpenAI-compatible translation endpoint.

POST
https://api.dgrid.ai
POST/v1/audio/translations
Authorization
Authorization: Bearer <DGRID_API_KEY>
Request
multipart/form-data
Response
200 · application/json

Form Data

FieldTypeRequiredDefaultDescription
filefileYes-Source audio file.
modelstringYes-Model ID, such as whisper-1.
promptstringNo-Optional English prompt.
response_formatstringNojsonjson, text, srt, verbose_json, or vtt.
temperaturenumberNo0Sampling temperature from 0 to 1.

Response Body

FieldTypeDescription
textstringEnglish translation of the uploaded audio.