Introducing vitals, a toolkit for evaluating LLM products in R

We’re bear-y excited to announce the release of vitals on CRAN. vitals is a framework for large language model evaluation in R. It’s specifically aimed at ellmer users who want to measure the effectiveness of their LLM products like custom chat apps and querychat apps.

You can install it from CRAN with:

install.packages("vitals")

This blog post will demonstrate the basics of evaluating LLM products with vitals. Specifically, we’ll focus on a dataset of challenging R coding problems, evaluating how well different models from leading AI labs can solve them. This post just scratches the surface of what’s possible with vitals; check out the package website to learn more.

The basics

At their core, LLM evals are composed of three pieces:

Datasets contain a set of labelled samples. Datasets are just a tibble with, minimally, columns input and target. input is a prompt that could be submitted by a user and target is either literal value(s) or grading guidance.
Solvers evaluate the input in the dataset and produce a final result (hopefully) approximating target. In vitals, the simplest solver is just an ellmer chat (e.g. ellmer::chat_anthropic()) wrapped in generate(), i.e. generate(ellmer::chat_anthropic()), which will call the Chat object’s $chat() method and return whatever it returns. When evaluating your own LLM products like shinychat and querychat apps, the underlying ellmer chat is your solver.
Scorers evaluate the final output of solvers. They may use text comparisons, model grading, or other custom schemes to determine how well the solver approximated the target based on the input.

This blog post will explore these three components using are, an example dataset that ships with the package.

First, loading some packages:

library(vitals)
library(ellmer)
library(dplyr)
library(ggplot2)

An R eval dataset

While the package is capable of evaluating LLM products for arbitrary capabilities, the package ships with an example dataset are that evaluates R coding performance. From the are docs:

An R Eval is a dataset of challenging R coding problems. Each input is a question about R code which could be solved on first-read only by human experts and, with a chance to read documentation and run some code, by fluent data scientists. Solutions are in target and enable a fluent data scientist to evaluate whether the solution deserves full, partial, or no credit.

glimpse(are)
#> Rows: 29
#> Columns: 7
#> $ id        <chr> "after-stat-bar-heights", "conditional-…
#> $ input     <chr> "This bar chart shows the count of diff…
#> $ target    <chr> "Preferably: \n\n```\nggplot(data = dia…
#> $ domain    <chr> "Data analysis", "Data analysis", "Data…
#> $ task      <chr> "New code", "New code", "New code", "De…
#> $ source    <chr> "https://jrnold.github.io/r4ds-exercise…
#> $ knowledge <list> "tidyverse", "tidyverse", "tidyverse",…

At a high level:

id: A unique identifier for the problem.
input: The question to be answered.
target: The solution, often with a description of notable features of a correct solution.
domain, task, and knowledge are pieces of metadata describing the kind of R coding challenge.
source: Where the problem came from, as a URL. Many of these coding problems are adapted “from the wild” and include the kinds of context usually available to those answering questions.

For the purposes of actually carrying out the initial evaluation, we’re specifically interested in the input and target columns. Let’s print out the first entry in full so you can get a taste of a typical problem in this dataset:

cat(are$input[1])
#> This bar chart shows the count of different cuts of diamonds, and each bar is
#> stacked and filled  according to clarity:
#> 
#> 
#> ```
#> 
#> ggplot(data = diamonds) + 
#>   geom_bar(mapping = aes(x = cut, fill = clarity))
#> ```
#> 
#> 
#> Could you change this code so that the proportion of diamonds with a given cut
#> corresponds to the bar height and not the count? Each bar should still be
#> filled according to clarity.

Here’s the suggested solution:

cat(are$target[1])
#> Preferably: 
#> 
#> ```
#> ggplot(data = diamonds) + 
#>   geom_bar(aes(x = cut, y = after_stat(count) / sum(after_stat(count)), fill = clarity))
#> ```
#> 
#> or:
#> 
#> ```
#> ggplot(data = diamonds) +
#>   geom_bar(mapping = aes(x = cut, y = ..prop.., group = clarity, fill = clarity))
#> ```
#> 
#> or:
#> 
#> ```
#> ggplot(data = diamonds) +
#>   geom_bar(mapping = aes(x = cut, y = after_stat(count / sum(count)), group = clarity, fill = clarity))
#> ```
#> 
#> The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0, but it
#> still works and should receive full credit:
#> 
#> ```
#> ggplot(data = diamonds) + 
#>   geom_bar(aes(x = cut, y = ..count.. / sum(..count..), fill = clarity))
#> ```
#> 
#> Simply setting `position = "fill"` will result in each bar having a height of 1
#> and is not correct.

Evaluation tasks

First, we’ll create a few ellmer chat objects that use different LLMs:

claude <- chat_anthropic(model = "claude-sonnet-4-20250514")
gpt <- chat_openai(model = "gpt-4.1")
gemini <- chat_google_gemini(model = "gemini-2.5-pro")

LLM evaluation with vitals happens in two main steps:

Use Task$new() to situate a dataset, solver, and scorer in a Task.

tsk <- Task$new(
  dataset = are,
  solver = generate(),
  scorer = model_graded_qa(
    partial_credit = TRUE, 
    scorer_chat = claude
  ),
  name = "An R Eval"
)

tsk
#> An evaluation task An-R-Eval.

Use Task$eval() to evaluate the solver, evaluate the scorer, and then explore a persistent log of the results in the interactive log viewer.

tsk_claude <- tsk$clone()$eval(solver_chat = claude)

$clone()ing the object makes a copy so that the underlying tsk is unchanged—we do this so that we can reuse the tsk object to evaluate other potential solver_chats. After evaluation, the task contains information from the solving and scoring steps. Here’s what the model responded to that first question with:

cat(tsk_claude$get_samples()$result[1])
#> You can change the code to show proportions instead of counts by adding `position = "fill"` to the `geom_bar()` function:
#> 
#> ```r
#> ggplot(data = diamonds) + 
#>   geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
#> ```
#> 
#> This will:
#> - Make each bar have the same height (representing 100% or proportion of 1)
#> - Show the relative proportions of each clarity type within each cut
#> - Still maintain the stacked bar format with clarity as the fill color
#> 
#> The y-axis will now show proportions from 0 to 1 instead of raw counts, making it easier to compare the relative distribution of clarity across different cuts of diamonds.

The task also contains score information from the scoring step. We’ve used model_graded_qa() as our scorer, which uses another model to evaluate the quality of our solver’s solutions against the reference solutions in the target column. model_graded_qa() is a model-graded scorer provided by the package. This step compares Claude’s solutions against the reference solutions in the target column, assigning a score to each solution using another model. That score is either C (correct) or I (incorrect), though since we’ve set partial_credit = TRUE, the model can also choose to allot the response P (partially correct). vitals will use the same model that generated the final response as the model to score solutions.

Hold up, though—we’re using an LLM to generate responses to questions, and then using the LLM to grade those responses?

The meme of 3 spiderman pointing at each other.

This technique is called “model grading” or “LLM-as-a-judge.” Done correctly, model grading is an effective and scalable solution to scoring. That said, it’s not without its faults. Here’s what the grading model thought of the response:

cat(tsk_claude$get_samples()$scorer_chat[[1]]$last_turn()@text)
#> Looking at this task, I need to understand what's being asked and what the submission provides.
#> 
#> The task asks to change the code so that "the proportion of diamonds with a given cut corresponds to the bar height." This means each bar's height should represent what fraction of the total dataset has that particular cut.
#> 
#> However, the submission provides `position = "fill"`, which creates bars that all have the same height (1.0 or 100%) and shows the relative proportions of clarity types *within* each cut category. This is fundamentally different from what was requested.
#> 
#> The criterion clearly states that the preferred solutions should show the proportion of the total dataset that each cut represents, using approaches like:
#> - `y = after_stat(count) / sum(after_stat(count))`
#> - `y = ..prop..` with appropriate grouping
#> - Similar statistical transformations
#> 
#> The criterion explicitly states that "Simply setting `position = "fill"` will result in each bar having a height of 1 and is not correct."
#> 
#> The submission's approach would result in:
#> - All bars having the same height (1.0)
#> - Showing clarity proportions within each cut
#> - Not showing the relative frequency of different cuts in the dataset
#> 
#> This does not meet the requirement that "the proportion of diamonds with a given cut corresponds to the bar height."
#> 
#> While the submission provides working R code and a clear explanation of what `position = "fill"` does, it solves a different problem than what was asked.
#> 
#> GRADE: I

Especially the first few times you run an eval, you’ll want to inspect its results closely. The vitals package ships with an app, the Inspect log viewer (see a demo here), that allows you to drill down into the solutions and grading decisions from each model for each sample. In the first couple runs, you’ll likely find revisions you can make to your grading guidance in target and with the LLM judge that align model responses with your intent.

Any arguments to the solver or scorer can be passed to $eval(), allowing for straightforward parameterization of tasks. For example, if I wanted to evaluate OpenAI’s GPT 4.1 on this task rather than Anthropic’s Claude 4 Sonnet, I could write:

tsk_gpt <- tsk$clone()$eval(solver_chat = gpt)

Or, similarly for Google’s Gemini 2.5 Pro:

tsk_gemini <- tsk$clone()$eval(solver_chat = gemini)

Analysis

To generate analysis-ready data frames, pass any number of Tasks to vitals_bind():

tsk_eval <- 
  vitals_bind(
    claude = tsk_claude, 
    gpt = tsk_gpt, 
    gemini = tsk_gemini
  )

tsk_eval
#> # A tibble: 87 × 4
#>    task   id                          score metadata
#>    <chr>  <chr>                       <ord> <list>  
#>  1 claude after-stat-bar-heights      I     <tibble>
#>  2 claude conditional-grouped-summary P     <tibble>
#>  3 claude correlated-delays-reasoning I     <tibble>
#>  4 claude curl-http-get               C     <tibble>
#>  5 claude dropped-level-legend        I     <tibble>
#>  6 claude filter-multiple-conditions  C     <tibble>
#>  7 claude geocode-req-perform         P     <tibble>
#>  8 claude group-by-summarize-message  C     <tibble>
#>  9 claude grouped-filter-summarize    P     <tibble>
#> 10 claude grouped-geom-line           P     <tibble>
#> # ℹ 77 more rows

From here, you’re in Happy Data Frame Land.🌈 To start off, we can quickly juxtapose those evaluation results:

tsk_eval |>
  rename(model = task) |>
  mutate(
    score = factor(
      case_when(
        score == "I" ~ "Incorrect",
        score == "P" ~ "Partially correct",
        score == "C" ~ "Correct"
      ),
      levels = c("Incorrect", "Partially correct", "Correct"),
      ordered = TRUE
    )
  ) |>
  ggplot(aes(y = model, fill = score)) +
  geom_bar() +
  scale_fill_brewer(breaks = rev, palette = "RdYlGn")

A ggplot2 horizontal stacked bar chart comparing the three models across three performance categories. Each model shows very similar performance: approximately 13 correct responses (green), 6 partially correct responses (yellow), and 10 incorrect responses (red).

Are these differences just a result of random noise, though? While the package doesn’t implement any analysis-related functionality itself, we’ve written up some recommendations on analyzing evaluation data on the package website.

Acknowledgements

Many thanks to JJ Allaire, Hadley Wickham, Max Kuhn, and Mine Çetinkaya-Rundel for their help in bringing this package to life.