Tools
Search
Import
Library
Explore
Videos
Channels
Figures
Atmrix
About
Tools
Search
Import
Library
Explore
Videos
Channels
Figures
Atmrix
About
Go Back
HC
Harrison Chase
05/28/25
@ LangChain
There are three types of evaluators: deterministic code-based evaluations, LLM-as-judge techniques for more complex assessments, and human annotation for real-time feedback.
Video
L
How to Solve the #1 Blocker for Getting AI Agents in Production | LangChain Interrupt
@ LangChain
05/28/25
Related Takeaways
HC
Harrison Chase
05/28/25
@ LangChain
Using LLMs as judges for evaluating outputs is promising, but it requires careful prompt engineering to ensure accurate grading of responses.
HC
Harrison Chase
05/28/25
@ LangChain
There are three types of evaluations: offline evals, online evals, and in-the-loop evals, each serving different purposes in the evaluation lifecycle.
HC
Harrison Chase
05/28/25
@ LangChain
Evaluations consist of two parts: data and evaluators, which can vary depending on whether you're using ground truth or reference-free methods.
HC
Harrison Chase
05/28/25
@ LangChain
We're launching open-source evaluators that simplify the process of evaluating code, retrieval-augmented generation, extraction, and tool calling, making it easier for developers to get started.
IS
Ion Stoica
05/29/25
@ a16z
In the context of AI evaluation, benchmarks are like supervised learning, while LMArena operates more like reinforcement learning, allowing for more dynamic and real-world testing.
IS
Ion Stoica
05/29/25
@ a16z
The unique aspect of LMArena is its ability to evolve over time, adapting to the changing landscape of AI evaluation.
a
a16z Cast
05/29/25
@ a16z
Expert evaluations are valuable, but they must be complemented by broader community input to avoid bias in AI assessments.
a
a16z Cast
05/29/25
@ a16z
Expert evaluations are valuable, but they must be complemented by broader community input to avoid bias in AI assessments.
L
LangChain Cast
05/30/25
@ LangChain
Using large language models as judges, combined with human review, allows for scaling without overwhelming human subject matter experts.