
New rules for evaluating AI hiring tools
October 10, 2025
It's conference season. In 2025, that means many talent leaders will be evaluating AI tools for hiring. If you’re preparing to sit down with vendors, you’ll probably hear familiar questions surface:
Those questions made sense a few years ago. They were shaped by real scandals and public missteps. But they don’t always fit the reality of how large language models (LLMs) are being used in hiring today.
The skepticism around AI in hiring didn’t appear out of thin air. A series of very public failures gave people reason to be cautious:
Those examples created a lasting impression. Many buyers came away thinking of AI as unsafe, biased, and impossible to understand.
It’s worth remembering that these failures were built on machine learning approaches that were common at the time. Those models were trained on historical data, they operated like black boxes, and they often just reinforced the patterns of the past.
Large Language Models (LLMs) have brought us to a different place. Instead of ranking résumés or copying old decisions, they can understand natural language, probe reasoning, and evaluate responses to specific questions. Old machine learning tried to predict “fit.” LLMs are better at analyzing individual answers to structured questions and then showing the evidence behind the score.
Models like GPT-4 or Claude were trained on a mix of licensed data, publicly available internet data, books, articles, and code. The important point is that they were not trained on résumés or past hiring outcomes.
The bigger issue that should concern buyers in 2025 is not the base training itself, but how each vendor uses the LLM in practice. The real questions should be: How is the model fine-tuned for this use case? How are prompts designed to evaluate skills instead of people? How are the outputs validated for fairness and consistency?
The safest and most compliant approach to hiring starts with a simple principle: assess individual skills directly. Do not infer them from résumés, job titles, or a sense of pedigree.
That means:
(yes, this one little section is the content marketing piece. Skip to the next section to avoid it) Once you have this structure in place, AI interviewing makes it possible to do at scale what humans cannot. Large language models don’t have to judge people or résumés. They can analyze responses, one question at a time, and provide depth that a résumé will never give you.
A résumé says, “This person held a certain title, so they must know X.” An AI interview shows, “Here is how this candidate actually responded to a question about X.”
One is an assumption. The other is evidence.
Instead of asking, “What data did you train on?” ask, “How are individual question responses mapped to skills, and how is expertise scored?”
Instead of asking, “Won’t this amplify bias?” ask, “How do you make sure each skill is scored independently and transparently, with evidence I can review?”
Instead of asking, “What were LLMs trained on?” ask, “How is the model adapted to evaluate skills in a fair, explainable way for this role?”
The scandals of the last decade explain the skepticism that still lingers. But ranking résumés, whether by humans or by machines, is not the right path forward.
The stronger path is structured skill assessment. Large language models make that possible at scale by evaluating responses to job-relevant questions, not résumés or people. Each answer can be scored independently, the results are transparent, and recruiters have real evidence to work with.
This isn’t about guessing who is qualified. It’s about giving every qualified candidate the chance to show what they know, fairly and consistently.
The latest news, interviews, and resources from industry leaders in AI.
Go to Blog












