Skip to main content

Command Palette

Search for a command to run...

Building a Real-Time AI Interviewer with LangGraph and WebSockets

Updated
7 min read
Building a Real-Time AI Interviewer with LangGraph and WebSockets
P
I'm a Senior Frontend Developer. I write about Web, Next, React, JavaScript, etc

This vlog details the architecture and implementation of an InterviewAI platform. Supporting low-latency voice streaming, autonomous code evaluation, and strict state persistence requires moving beyond linear LLM calls. I solved this by integrating LangGraph for stateful orchestration, a custom Node.js WebSocket gateway for real-time media processing, AssemblyAI for Speech-to-Text (STT), and EdgeTTS for Text-to-Speech (TTS).


1. System Architecture Overview

The architecture utilizes a WebSocket gateway to coordinate asynchronous streams between the client, the AI workflow (LangGraph), and external media APIs.


2. The WebSocket Gateway: Connection and Authorization

Because regular HTTP requests cannot handle bi-directional audio streams efficiently, the primary communication channel is WebSockets. A major challenge with WebSockets is securely passing authentication without relying on ambient cookies that may face cross-origin restrictions.

Ticket-Based Authentication

To secure the connection, we use a ticket-based handshake.

  1. The client makes a standard authenticated HTTP GET request to /api/v1/auth/ticket.

  2. The server generates a short-lived (e.g., 60 seconds) cryptographically signed token and returns it.

  3. The client connects to the WebSocket URL, appending the ticket as a query parameter (or utilizing the Subprotocol header): wss://api.example.com?ticket=abc123_.

  4. The WebSocket upgrade handler verifies the ticket and establishes the session.

Once connected, the frontend captures audio using the browser's MediaRecorder API, resamples it to 16kHz mono (the optimal format for fast STT processing), and pushes base64-encoded chunks over the socket every 250ms.


3. Real-Time STT and Silence Gating

Audio chunks are immediately proxied to AssemblyAI's streaming endpoint. The STT engine returns two types of events:

  • Partial Transcripts: Used strictly to update the UI so the user sees real-time feedback.

  • Final Transcripts: Sent when the engine detects a break in speech.

Implementing the Silence Issue

Relying solely on the STT engine's "end of turn" signal is insufficient for interviews, as candidates frequently pause to think. If the AI responds immediately after a short breath, it disrupts the user. I implemented a software "Silence Issue" using a debounced timer.

// socket.ts tracking logic
let lastNonFinalTurnAt = 0;
let pendingFinalTimer: NodeJS.Timeout | null = null;
const USER_SILENCE_WINDOW_MS = 6000; 

// Every time a partial transcript arrives, we record the timestamp
rt.on("transcript.partial", () => {
   lastNonFinalTurnAt = Date.now();
});

// When a final transcript arrives, we don't act immediately
rt.on("transcript.final", (payload) => {
   const silenceMs = Date.now() - lastNonFinalTurnAt;
   const remainingMs = Math.max(0, USER_SILENCE_WINDOW_MS - silenceMs);

   // Clear any existing timer
   if (pendingFinalTimer) clearTimeout(pendingFinalTimer);

   // Debounce: Only invoke the graph after definitive silence
   pendingFinalTimer = setTimeout(async () => {
       await executeLangGraphStep(payload.text);
   }, remainingMs);
});

This guarantees the user is given a full 6 seconds of silence before the system locks their microphone and hands control over to the AI.


4. LangGraph Orchestration and State Preservation

LangGraph is responsible for the system's logic routing and state retention.

Persistent State Schema

Every interview thread has a state containing message history, question counters, and phase flags. This state is backed by MongoDBSaver.

// graph.ts schema definition
import { StateGraph, Annotation, messagesStateReducer } from "@langchain/langgraph";

const InterviewState = new StateSchema({
  messages: MessagesValue,
  resume: z.string().optional().default(""),
  interviewType: z.string().default("technical"),
  difficultyLevel: z.string().default("intermediate"), 
  questionCount: z.number().default(0),
  maxQuestions: z.number().default(5),
  isFinished: z.boolean().default(false),
  isCodingMode: z.boolean().default(false),
   //...other state fields
  });

Because MongoDB handles the checkpointing automatically, any WebSocket reconnect event (handling network drops) easily pulls the correct state via the thread_id:

const state = await graphApp.getState({ configurable: { thread_id: threadId } });
// state.values contains the exact snapshot of the interview

Graph Routing

The graph defines nodes i.e., functions acting on the state and edges i.e., logic deciding which node comes next.

const workflow = new StateGraph(InterviewState)
  .addNode("technical", technicalNode)
  .addNode("tools", new ToolNode(availableTools))
  .addNode("question_counter", questionCounterNode)
  
  .addEdge(START, "technical")
  .addConditionalEdges("technical", shouldContinue, {
    tools: "tools",
    question_counter: "question_counter",
  })
  .addEdge("tools", "technical") 
  .addEdge("question_counter", END);

Notice the loop: technical -> tools -> technical. If the LLM utilizes a tool (e.g., executing Python code), the graph moves to tools, executes the system script, and routes the output back to the technical node so the LLM can interpret the output before speaking.


5. Agentic Tool Execution

When an interviewer requests code, the user submits it via the frontend editor over a dedicated WebSocket event (e.g., { type: "code_submission", code: "..." }).

Instead of building hard-coded evaluation logic, we defined a standard LangChain Tool:

const codeEvaluatorTool = tool(
  async ({ code, language }) => {
    const result = await executeInSandbox(code, language); 
    return `Output: \({result.stdout}\nErrors: \){result.stderr}`;
  },
  {
    name: "evaluate_code",
    description: "Executes the candidate's code and returns output.",
    schema: CodeEvalSchema,
  }
);

When the technicalNode sees the user's submission, it generates a tool_call. The conditional edge routing (shouldContinue) inspects the LLM's response, sees the tool call request, and hands control to the ToolNode. The LLM receives the runtime output and formats feedback automatically.


6. Safe State Classification

Extracting precise JSON commands alongside conversational text from LLMs is unreliable. They often surround JSON in markdown blocks, add conversational filler, or miss brackets.

To ensure state variables (isNewQuestion, isCodingMode) are strictly typed, we utilize a two-pass classification system in the node logic.

// generate the raw conversation string
const assistantMessage = await llm.invoke([...messages]);
const assistantText = assistantMessage.content;

// run a smaller, faster structured-output request solely for flags
const meta = await invokeStructuredLLMWithFallback(
  TechnicalMetaSchema,
  [
    new SystemMessage("Analyze the following text and determine interview state..."),
    new HumanMessage(assistantText)
  ]
);

// Update the exact state safely
return {
  messages: [new AIMessage(assistantText)], // Passed to UI
  isCodingMode: meta.isCodingMode,          // Passed to Graph State
  isNewQuestion: meta.isNewQuestion,        // Passed to Graph State
};

This isolation ensures the context (state.values) never gets corrupted by LLM hallucinations.


7. Sentence-Level TTS Pre-fetching

Minimizing latency on Text-to-Speech is the final hurdle. Generative responses take time, and TTS takes additional time. If an AI generates a multi-paragraph response, waiting for the entire block to synthesize before playing audio creates an unacceptable delay.

We solve this using a sentence-chunking pipeline on the frontend browser:

  1. Splitting text: Upon receiving the full text payload via WebSocket, a regex expression (/[^.!?]+[.!?]+(?=\s|\()|[^.!?]+\)/g) divides it into an array of sentences.

  2. Concurrent Fetching: The frontend requests the audio file for Sentence[0] from the backend's TTS proxy endpoint immediately.

  3. Queue Execution: As soon as Sentence[0] begins playback in the HTMLAudioElement, a background background process fetches the audio for Sentence[1].

// React Audio Queue Abstraction
const processSpeechQueue = async () => {
    if (isPlaying || speechQueue.length === 0) return;
    isPlaying = true;
    
    // Play current chunk
    const currentChunk = speechQueue.shift();
    await playAudio(currentChunk.url);
    
    // As playback ends, recursive call to next item (which is already pre-fetched)
    isPlaying = false;
    processSpeechQueue(); 
};

This overlapping architecture typically reduces "Time to First Word" to under 800ms, effectively making the interaction feel seamless.

IMPORTANT-

This is the architecture that I have used. I won't say this is the best or the most correct architecture. I have a lot of things to learn. But in the above blog I have mentioned whatever things I have learned till now and used. I will make more improvements in the architecture as I go along in build the InterviewAI platform.

Btw, do checkout the platform and break it- InterviewAI


Conclusion

Combining LangGraph and WebSockets allows developers to build systems that better reflect real human interaction. By separating conversational generation from structural state management, the system becomes more modular and easier to scale. Incorporating dynamic, agentic tools enables more flexible and intelligent behavior during conversations. Additionally, latency optimizations such as debounced silence gates and TTS prefetching—help improve responsiveness.

Together, these approaches make it possible to build voice interfaces that are highly responsive, resilient, and context-aware.