Evaluation how-to guides
These guides answer “How do I….?” format questions. They are goal-oriented and concrete, and are meant to help you complete a specific task. For conceptual explanations see the Conceptual guide. For end-to-end walkthroughs see Tutorials. For comprehensive descriptions of every class and function see the API reference.
Offline evaluation
Evaluate and improve your application before deploying it.
Run an evaluation
- Run an evaluation
- Run an evaluation asynchronously
- Run an evaluation comparing two experiments
- Evaluate a
langchain
runnable - Evaluate a
langgraph
graph - Run an evaluation of an existing experiment
- Run an evaluation via the REST API
- Run an evaluation from the prompt playground
Define an evaluator
- Define a custom evaluator
- Define an LLM-as-a-judge evaluator
- Use an off-the-shelf evaluator via the SDK (Python only)
- Use an off-the-shelf evaluator via the UI
- Evaluate aggregate experiment results
- Evaluate intermediate steps
- Return multiple metrics in one evaluator
- Return categorical vs numerical metrics
- Check your evaluator setup
Configure the evaluation data
Configure an evaluation job
Unit testing
Unit test your system to identify bugs and regressions.
Online evaluation
Evaluate and monitor your system's live performance on production data.
Automatic evaluation
Set up evaluators that automatically run for all experiments against a dataset.
Analyzing experiment results
Use the UI & API to understand your experiment results.
- Compare experiments with the comparison view
- Filter experiments
- View pairwise experiments
- Fetch experiment results in the SDK
- Upload experiments run outside of LangSmith with the REST API
Dataset management
Manage datasets in LangSmith used by your evaluations.
- Manage datasets from the UI
- Manage datasets programmatically
- Version datasets
- Share or unshare a dataset publicly
- Export filtered traces from an experiment to a dataset
Annotation queues and human feedback
Collect feedback from subject matter experts and users to improve your applications.