Skip to main content

Voice Input

CopilotKit provides built-in voice transcription support, allowing users to speak their messages instead of typing. Audio is recorded in the browser, sent to your backend, and transcribed using services like OpenAI Whisper.

Overview

Voice input in CopilotKit:
  • Browser Recording - Captures audio using the Web Audio API
  • Automatic Transcription - Converts speech to text via transcription services
  • Seamless Integration - Transcribed text appears in the chat input automatically
  • Multiple Providers - Use OpenAI Whisper or implement custom transcription services

Quick Start

1. Install Dependencies

npm
npm install @copilotkit/voice openai
yarn
yarn add @copilotkit/voice openai
pnpm
pnpm add @copilotkit/voice openai

2. Configure Backend

Add the transcription service to your runtime:
import { CopilotRuntime } from "@copilotkit/runtime";
import { TranscriptionServiceOpenAI } from "@copilotkit/voice";
import OpenAI from "openai";

const runtime = new CopilotRuntime({
  agents: { 
    default: myAgent 
  },
  transcriptionService: new TranscriptionServiceOpenAI({
    openai: new OpenAI({ 
      apiKey: process.env.OPENAI_API_KEY 
    })
  })
});

3. Use in Frontend

The chat component automatically shows a microphone button when transcription is configured:
import { CopilotChat } from "@copilotkit/react";

function MyApp() {
  return (
    <CopilotKitProvider runtimeUrl="/api/copilotkit">
      <CopilotChat />
    </CopilotKitProvider>
  );
}
That’s it! Users can now click the microphone icon to record voice messages.

OpenAI Whisper Configuration

The TranscriptionServiceOpenAI class provides full control over Whisper’s behavior:
new TranscriptionServiceOpenAI({
  openai: new OpenAI({ apiKey: "..." }),  // Required: OpenAI client
  model: "whisper-1",                     // Optional: Model selection (default: "whisper-1")
  language: "en",                         // Optional: ISO-639-1 language code
  prompt: "Technical discussion context", // Optional: Context for better accuracy
  temperature: 0                          // Optional: Sampling temperature (0-1)
})

Configuration Options

OptionTypeDescription
openaiOpenAIRequired. OpenAI client instance with API key
modelstringWhisper model to use. Default: "whisper-1"
languagestringAudio language in ISO-639-1 format (e.g., "en", "es", "fr"). Improves accuracy and latency
promptstringOptional text to guide transcription style or provide context. Should match audio language
temperaturenumberSampling temperature between 0 and 1. Lower = more deterministic, higher = more creative. Default: 0

Language Support

Specify the language to improve accuracy:
new TranscriptionServiceOpenAI({
  openai: new OpenAI({ apiKey: process.env.OPENAI_API_KEY }),
  language: "es",  // Spanish
  prompt: "Conversación técnica sobre desarrollo de software"
})
Supported languages include:
  • en - English
  • es - Spanish
  • fr - French
  • de - German
  • ja - Japanese
  • zh - Chinese
  • And 50+ more languages
See OpenAI’s language support for the complete list.

Context Prompts

Use the prompt option to improve accuracy for domain-specific terminology:
// Medical transcription
new TranscriptionServiceOpenAI({
  openai: new OpenAI({ apiKey: process.env.OPENAI_API_KEY }),
  language: "en",
  prompt: "Medical consultation about cardiovascular health, medications, and treatment plans."
})

// Technical discussion
new TranscriptionServiceOpenAI({
  openai: new OpenAI({ apiKey: process.env.OPENAI_API_KEY }),
  language: "en",
  prompt: "Software development discussion about React, TypeScript, and API design."
})

Audio Recording

CopilotKit uses the browser’s Web Audio API to capture audio.

Supported Audio Formats

The following MIME types are supported:
  • audio/webm (default in most browsers)
  • audio/mp3 / audio/mpeg
  • audio/mp4
  • audio/wav
  • audio/ogg
  • audio/flac
  • audio/aac
The browser automatically selects the best available format.

Recording Visualization

CopilotKit includes a built-in audio visualizer that displays a waveform during recording:
import { CopilotChatAudioRecorder } from "@copilotkit/react";

function CustomInput() {
  return (
    <CopilotChatAudioRecorder 
      inputClass="cpk:h-11 cpk:w-full cpk:px-5"
      inputShowControls={false}
    />
  );
}
The visualizer renders a canvas with animated bars that respond to audio levels:
// Waveform configuration (internal)
const config = {
  barWidth: 2,
  minHeight: 2,
  maxHeight: 20,
  gap: 2,
  numSamples: Math.ceil(canvasWidth / (barWidth + gap))
};

Custom Styling

Style the audio recorder component:
/* Override waveform color */
[data-copilotkit] .copilot-chat-audio-recorder canvas {
  color: var(--primary);  /* Inherits text color */
}

/* Custom container styles */
.copilot-chat-audio-recorder {
  border-radius: 0.5rem;
  background: var(--background);
  padding: 0.5rem;
}

Custom Transcription Service

Implement your own transcription service for alternative providers:
import { 
  TranscriptionService,
  TranscribeFileOptions 
} from "@copilotkit/runtime";

class CustomTranscriptionService extends TranscriptionService {
  private apiKey: string;
  
  constructor(apiKey: string) {
    super();
    this.apiKey = apiKey;
  }
  
  async transcribeFile(options: TranscribeFileOptions): Promise<string> {
    const { audioFile, mimeType, size } = options;
    
    // Convert File to buffer/blob for your API
    const arrayBuffer = await audioFile.arrayBuffer();
    
    // Call your transcription API
    const response = await fetch("https://your-api.com/transcribe", {
      method: "POST",
      headers: {
        "Authorization": `Bearer ${this.apiKey}`,
        "Content-Type": mimeType
      },
      body: arrayBuffer
    });
    
    if (!response.ok) {
      throw new Error(`Transcription failed: ${response.statusText}`);
    }
    
    const result = await response.json();
    return result.text;
  }
}

// Use custom service
const runtime = new CopilotRuntime({
  agents: { default: myAgent },
  transcriptionService: new CustomTranscriptionService(
    process.env.CUSTOM_API_KEY
  )
});

TranscribeFileOptions

The transcribeFile method receives:
interface TranscribeFileOptions {
  audioFile: File;    // Audio file from the browser
  mimeType: string;   // MIME type (e.g., "audio/webm")
  size: number;       // File size in bytes
}

Request Handling

CopilotKit handles transcription requests in two modes:

REST Mode (Multipart Form Data)

POST /api/copilotkit/transcribe
Content-Type: multipart/form-data

audio: [File]

Single Endpoint Mode (JSON with Base64)

POST /api/copilotkit
Content-Type: application/json

{
  "audio": "base64-encoded-audio-data",
  "mimeType": "audio/webm",
  "filename": "recording.webm"
}
Both modes are handled automatically by the runtime.

Error Handling

CopilotKit provides detailed error responses:
Error CodeHTTP StatusDescription
SERVICE_NOT_CONFIGURED503No transcription service configured
INVALID_AUDIO_FORMAT400Unsupported audio format
AUDIO_TOO_LONG400Audio file exceeds duration limit
AUDIO_TOO_SHORT400Audio file too short to transcribe
RATE_LIMITED429Too many transcription requests
AUTH_FAILED401Invalid API credentials
PROVIDER_ERROR500Transcription service error
NETWORK_ERROR502Network connectivity issue
INVALID_REQUEST400Malformed request

Custom Error Handling

Implement error handling in your custom service:
class CustomTranscriptionService extends TranscriptionService {
  async transcribeFile(options: TranscribeFileOptions): Promise<string> {
    try {
      // Validate file size
      if (options.size > 25 * 1024 * 1024) {
        throw new Error("Audio file too large (max 25MB)");
      }
      
      // Validate format
      if (!options.mimeType.startsWith("audio/")) {
        throw new Error("Invalid audio format");
      }
      
      // Transcribe
      const text = await this.callAPI(options);
      
      // Validate result
      if (!text || text.trim().length === 0) {
        throw new Error("Transcription returned empty result");
      }
      
      return text;
    } catch (error) {
      console.error("Transcription error:", error);
      throw error;
    }
  }
}

Advanced Configuration

Rate Limiting

Implement rate limiting for transcription requests:
import { BeforeRequestMiddlewareFn } from "@copilotkit/runtime";

const transcriptionRateLimiter = new Map<string, number[]>();

const rateLimitMiddleware: BeforeRequestMiddlewareFn = async ({ 
  request, 
  path 
}) => {
  if (path !== "/api/copilotkit/transcribe") return request;
  
  const userId = request.headers.get("X-User-Id") || "anonymous";
  const now = Date.now();
  const userRequests = transcriptionRateLimiter.get(userId) || [];
  
  // Remove requests older than 1 minute
  const recentRequests = userRequests.filter(t => now - t < 60000);
  
  // Allow max 10 transcription requests per minute
  if (recentRequests.length >= 10) {
    throw new Response("Rate limit exceeded for transcription", { 
      status: 429 
    });
  }
  
  recentRequests.push(now);
  transcriptionRateLimiter.set(userId, recentRequests);
  
  return request;
};

const runtime = new CopilotRuntime({
  agents: { default: myAgent },
  transcriptionService: transcriptionService,
  beforeRequestMiddleware: rateLimitMiddleware
});

Caching Transcriptions

Cache transcriptions to reduce API costs:
class CachedTranscriptionService extends TranscriptionService {
  private cache = new Map<string, string>();
  private baseService: TranscriptionService;
  
  constructor(baseService: TranscriptionService) {
    super();
    this.baseService = baseService;
  }
  
  async transcribeFile(options: TranscribeFileOptions): Promise<string> {
    // Generate cache key from audio file hash
    const arrayBuffer = await options.audioFile.arrayBuffer();
    const hashBuffer = await crypto.subtle.digest("SHA-256", arrayBuffer);
    const hashArray = Array.from(new Uint8Array(hashBuffer));
    const cacheKey = hashArray.map(b => b.toString(16).padStart(2, "0")).join("");
    
    // Check cache
    const cached = this.cache.get(cacheKey);
    if (cached) {
      console.log("Returning cached transcription");
      return cached;
    }
    
    // Transcribe and cache
    const text = await this.baseService.transcribeFile(options);
    this.cache.set(cacheKey, text);
    
    return text;
  }
}

// Use cached service
const runtime = new CopilotRuntime({
  agents: { default: myAgent },
  transcriptionService: new CachedTranscriptionService(
    new TranscriptionServiceOpenAI({ 
      openai: new OpenAI({ apiKey: process.env.OPENAI_API_KEY }) 
    })
  )
});

Logging and Analytics

Track transcription usage:
class AnalyticsTranscriptionService extends TranscriptionService {
  private baseService: TranscriptionService;
  
  constructor(baseService: TranscriptionService) {
    super();
    this.baseService = baseService;
  }
  
  async transcribeFile(options: TranscribeFileOptions): Promise<string> {
    const startTime = Date.now();
    
    try {
      const text = await this.baseService.transcribeFile(options);
      
      // Log success metrics
      await analytics.track({
        event: "transcription_completed",
        properties: {
          duration: Date.now() - startTime,
          audioSize: options.size,
          audioFormat: options.mimeType,
          transcriptionLength: text.length,
          success: true
        }
      });
      
      return text;
    } catch (error) {
      // Log failure metrics
      await analytics.track({
        event: "transcription_failed",
        properties: {
          duration: Date.now() - startTime,
          audioSize: options.size,
          audioFormat: options.mimeType,
          error: error.message,
          success: false
        }
      });
      
      throw error;
    }
  }
}

Best Practices

Specify Language

Always specify the language parameter when you know the audio language - it significantly improves accuracy

Provide Context

Use the prompt parameter for domain-specific vocabulary and better transcription quality

Handle Errors Gracefully

Implement proper error handling and provide user-friendly error messages

Rate Limit Requests

Implement rate limiting to prevent abuse and control API costs

Testing Voice Input

Test transcription in development:
// Mock transcription service for testing
class MockTranscriptionService extends TranscriptionService {
  async transcribeFile(options: TranscribeFileOptions): Promise<string> {
    // Return mock transcription
    return "This is a mock transcription for testing purposes.";
  }
}

// Use in development
const runtime = new CopilotRuntime({
  agents: { default: myAgent },
  transcriptionService: process.env.NODE_ENV === "production"
    ? new TranscriptionServiceOpenAI({ openai })
    : new MockTranscriptionService()
});

OpenAI Whisper API

Official OpenAI Whisper documentation

Runtime Middleware

Implement authentication and rate limiting for transcription