Skip to main content

What It Is

Instead of answering static questions, candidates are given a goal, context, and a deliverable — then scored on how they prompt and guide the AI to get there. Where to find it: Create Interviewer → Interview Stages → LLM Assessment (Beta)

How It Works

When a candidate starts the assessment, they see:
Screenshot 2026 05 19 At 12 40 58 PM
  • title describing the task
  • goal explaining what they need to accomplish
  • scenario with the context they are working within
  • deliverable — the final output they need to produce
They then have a back-and-forth conversation with the AI to work toward that deliverable. The assessment ends when they reach the round limit or time limit. HeyMilo scores the conversation against an auto-generated rubric tailored to the task.

Candidate Modes

ModeWhat the Candidate Does
StandardCandidate works collaboratively with AI to complete a task, prompting, iterating, and producing a deliverable
Test AI Error DetectionA weaker AI is used that may produce mistakes. Candidate is scored on identifying and correcting them. Best for AI/ML, safety, and prompt-engineering roles

Setting It Up

When adding an LLM Assessment stage, you’ll see the following setup options:
LLM Assessment setup options
  • Task background: describe the task the candidate will work through. The more specific you are, the more targeted the assessment. HeyMilo uses this (plus the job description) to generate the scenario, deliverable, and rubric.
  • Assessment style: standard mode has the candidate work collaboratively with AI. Toggle on Test AI error detection to switch to a weaker AI that may make mistakes. The candidate is scored on identifying them. Best for AI/ML, safety, and prompt-engineering roles. Note: this can’t be changed after the scenario is generated.
  • Conversation behavior:
    • AI sends a greeting first: toggle off if you want the candidate to make the first move.
    • AI invents missing details: when on, the AI fills in gaps like customer names or dates to keep the scenario flowing.
  • Time limit: recommended 20 to 30 minutes for screening.
  • Back-and-forth rounds: one round equals one candidate message plus one AI reply.
  • Allow retake on timeout: if the candidate runs out of time before sending much, they can restart once.
Once configured, click Save & generate scenario. HeyMilo auto-generates the scenario title, goal, scenario description, deliverable, and scoring rubric. You can regenerate or edit any of these before activating.

Scoring

Each completed assessment produces:
Screenshot 2026 05 19 At 12 48 01 PM
  • An overall score out of 4.0
  • Dimension scores across criteria like:
    • Task Decomposition
    • Instruction Clarity and Constraint Control
    • Error Recovery and Iteration
    • Difficulty
    • Realism
    • Safety
  • An overall narrative summarizing how the candidate performed
  • A full chat transcript with evidence quotes tied to each dimension
Scores appear on the candidate profile under the LLM Assessment tab.

Best Use Cases

LLM Assessment works well for any role where candidates need to use AI to produce a real deliverable:
  • Data annotation and labeling
  • Customer support and escalation protocols
  • Content and copy production
  • Code generation and technical problem-solving
  • Research and analysis tasks

See It in Action

Real scored transcripts showing how candidates approach an LLM Assessment — including dimension scores, conversation, and what a strong vs. weak run looks like. Example: Labeling Assistant Example: AI Data Annotator (Find AI Errors)