Skip to main content

Judging Code

  • An example of promoting Claude to evaluate code based and consistently give it a score based on a predetermined rubric
  • Turned out to be very consistent
  • Maybe LLMs are for reading code instead of generating code?

Excerpts from Judging Code by Thorsten Ball:

I want to show you something.

We start by running this:

$ cargo new code-judge

And we enter and get ready to write some code:

$ cd code-judge
$ $EDITOR .

Next, mise en place. We open Cargo.toml and add the following:

[dependencies]
ureq = { version = "2.9", features = ["json"] }
serde_json = "1.0"
serde = { version = "1.0", features = ["derive"] }
anyhow = "1.0"

With that, we gain the ability to send HTTP requests, serialize & deserialize JSON, and to handle errors without cursing. We’re ready to write some code.

We open src/main.rs and add everything we need to talk to Claude:

// src/main.rs

use anyhow::Result;
use serde::{Deserialize, Serialize};
use serde_json::json;

#[derive(Debug, Serialize, Deserialize)]
struct ContentItem {
    text: String,
    #[serde(rename = "type")]
    content_type: String,
}

#[derive(Debug, Serialize, Deserialize)]
struct ClaudeResponse {
    content: Vec<ContentItem>,
}

fn get_claude_response(prompt: &str) -> Result<String> {
    let api_key = std::env::var("ANTHROPIC_API_KEY").expect("ANTHROPIC_API_KEY is not set");
    let model = "claude-3-5-sonnet-latest";

    let mut response: ClaudeResponse = ureq::post("https://api.anthropic.com/v1/messages")
        .set("x-api-key", &api_key)
        .set("anthropic-version", "2023-06-01")
        .set("content-type", "application/json")
        .send_json(json!({
            "model": model,
            "temperature": 0.0,
            "messages": [{
                "role": "user",
                "content": prompt
            }],
            "max_tokens": 1024
        }))?
        .into_json()?;

    Ok(response.content.remove(0).text)
}

The mouthpiece is in place. Now we need to say something.

What do we want from Claude? Judgement.

// src/main.rs

struct Judgement {
    score: f64,
    message: String,
}

How do we get it? By mashing together some strings and asking Claude:

// src/main.rs

fn judge_code(code: &str, assertions: Vec<&str>) -> Result<Judgement> {
    let mut fenced_code = String::from("\`\`\`");
    fenced_code.push_str(code);
    fenced_code.push_str("\`\`\`");

    let formatted_assertions = assertions
        .iter()
        .map(|a| format!("- {}", a))
        .collect::<Vec<_>>()
        .join("\n");

    let prompt = include_str!("../prompts/judge.md")
        .replace("<code>", &fenced_code)
        .replace("<assertions>", &formatted_assertions);

    let response = get_claude_response(&prompt)?;

    let (message, score_text) = response
        .rsplit_once('\n')
        .ok_or(anyhow::anyhow!("Failed to parse score"))?;
    let score = score_text.parse::<f64>()?;

    Ok(Judgement {
        score,
        message: message.trim().into(),
    })
}

Right there in the middle, there’s a reference to a file we’re still missing. Time to create it:

$ mkdir prompts
$ touch prompts/judge.md

What goes into a file called prompts/judge.md? Nothing less than the spell that will cast Claude into a judge of code:

## Task

You are an expert code judger. Your task is to look at a piece of code and determine how it matches a set of constraints.

Your response should follow this structure:

1. Brief code analysis
2. List of constraints met
3. List of constraints not met
4. Final score

Be terse, be succinct.

Score the code between 0 and 5 using these criteria:

- 5: All must-have constraints + all nice-to-have constraints met, or all must-have constraints met if there are no nice-to-have constraints
- 4: All must-have constraints + majority of nice-to-have constraints met
- 3: All must-have constraints + some nice-to-have constraints met
- 2: All must-have constraints met but failed some nice-to-have constraints
- 1: Some must-have constraints met
- 0: No must-have constraints met or code is invalid/doesn't compile

Must-have constraints are marked with [MUST] prefix in the constraints list.

The last line of your reply **MUST** be a single number between 0 and 5.

## Code

Here is the snippet of code you are evaluating:

<code>

## Constraints

Here are the constraints:

<assertions>

The spell in place, the next step is to put ourselves into position to cast it.

// src/main.rs

const RED: &'static str = "\x1b[31m";
const GREEN: &'static str = "\x1b[32m";
const RESET: &'static str = "\x1b[0m";

fn main() -> Result<()> {
    let assertions = vec![];
    let code = include_str!("../data/code-to-judge");

    let result = judge_code(code, assertions)?;
    println!(
        "========= Result =======\nMessage: {}\n\nScore: {}{}{}\n",
        result.message,
        if result.score < 2.0 { RED } else { GREEN },
        result.score,
        RESET
    );
    Ok(())
}

Some color never hurt. But, again, things are missing: assertions is empty and data/code-to-judge — what is that?

It’s the final two pieces in this little demonstration and this is also where some audience participation is allowed, but to keep things simple, how about this:

$ mkdir data
$ wget thorstenball.com -O data/code-to-judge

My personal website, ready to be judged. The last thing that’s missing is the law by which it’s judged. Let’s add it:

// src/main.rs

fn main() -> Result<()> {
    let assertions = vec![
        "[MUST] The year of the copyright notice has to be 2025.",
        "[MUST] The link to the Twitter profile has to be to @thorstenball",
        "Menu item linking to Register Spill must be marked as new",
        "Should mention that Thorsten is happy to receive emails",
        "Has photo of Thorsten",
    ];
    // [...]
}

What will Claude say?

Time to ask it:

$ export ANTHROPIC_API_KEY="onetwothree"
$ cargo run

And, after taking a beat, it tells us:

Message: 1. Analysis:
Simple personal website with navigation menu, about section, and contact information. Clean HTML structure with proper meta tags and styling links.

2. Constraints met:
- Copyright year is 2025
- Twitter profile links to @thorstenball
- Register Spill menu item is marked with "new!"
- Explicitly states "I love getting email from you"
- Has profile picture (avatar.jpg)

3. Constraints not met:
- None

4. Final score:
All must-have constraints are met (copyright year and Twitter handle) and all nice-to-have constraints are met (Register Spill marking, email happiness, photo).

Score: 5

The perfect score. What if the law changes? What if we want the code to say I want to receive phone calls (a lie)?

// src/main.rs

fn main() -> Result<()> {
    let assertions = vec![
        "[MUST] The year of the copyright notice has to be 2025.",
        "[MUST] The link to the Twitter profile has to be to @thorstenball",
        "Menu item linking to Register Spill must be marked as new",
        "Has photo of Thorsten",
        // New:
        "Should mention that Thorsten is happy to receive phone calls",
    ];
    // [...]
}

Claude says:

Message: 1. Brief code analysis:
Simple personal website HTML with header, navigation menu, main content, and footer. Contains social links, brief bio, and contact information.

2. Constraints met:
- [MUST] Copyright year is 2025
- [MUST] Twitter profile links to @thorstenball
- Register Spill menu item is marked with <mark>new!</mark>
- Has photo (avatar.jpg)

3. Constraints not met:
- Does not mention anything about phone calls (only email contact is provided)

4. Final score:
All must-have constraints are met, plus 3 out of 4 nice-to-have constraints.

Score: 4

Down to 4 out of 5. Now, what if we make this a [MUST] and add another, say, arbitrary constraint?

// src/main.rs

fn main() -> Result<()> {
    let assertions = vec![
        "[MUST] The year of the copyright notice has to be 2025.",
        "[MUST] The link to the Twitter profile has to be to @thorstenball",
        "Menu item linking to Register Spill must be marked as new",
        "Has photo of Thorsten",
        // New
        "[MUST] Should mention that Thorsten is happy to receive phone calls",
        "It should have hidden text in the HTML comments",
    ];
    // [...]
}

He must be happy to receive phone calls and… hidden text?

Claude’s judgment:

Message: 1. Brief code analysis:
Simple personal website HTML with header, navigation menu, main content, and footer. Contains personal info, social links, and contact details.

2. Constraints met:
- Copyright year is 2025
- Twitter profile links to @thorstenball
- Register Spill menu item is marked with "new!"
- Has photo (avatar.jpg)
- Contains hidden HTML comments

3. Constraints not met:
- [MUST] Does not mention accepting phone calls (only email contact is provided)

4. Final score:
Code meets 2 out of 3 must-have constraints, and 3 out of 3 nice-to-have constraints. Since one must-have constraint is missing, the score must be 1.

Score: 1

Harsh.

Harsh, but marvelous, isn’t it?

I’ve used LLMs-as-Judges quite a bit in the past few weeks at work and seeing LLMs work like that, be reliable like that, be a fuzzy-to-non-fuzzy adapter — it made me reconsider what I thought LLMs were useful for.

Reliable? Yes. The temperature is 0 and even if I ask Claude ten times, it will very likely produce the same thing, as long as all inputs stay the same:

$ cargo build
$ for i in $(seq 1 10); do ./target/debug/code-judge; done
Score: 1
Score: 1
Score: 1
Score: 1
Score: 1
Score: 1
Score: 1
Score: 1
Score: 1
Score: 1

That’s more reliable than most integration tests I’ve seen.

Seeing LLMs work like that made me think of all the questions I had in the past about data, about code, about text, that were very hard to answer in code but so easy to express in prose: does this page show the sign-in button? does this function call that one? is that thing hidden and that one extended? is this documented? is there commented-out code in here?

And then it hit me: maybe I don’t need to express them in code anymore.