Understanding model temperature from first principles

Nadeesha Cabral

So, what is model temperature really? Let's go through a simple experiment to probe the inner workings of a language model.

For this experiment, we'll be using the open-source llama3.2 3B model. If you want to follow along, you can do so with Ollama.

LLMs work by predicting the next token in a sequence. The way it does this is by taking the input tokens and predicting the next one.

What are Sundays for?

For example, you can ask a language model to complete the sentence "Sundays are for...". There's no objectively correct answer here.

If you stop a random person on the street, each person would answer the question differently. Let's say you survey 10,000 people, and come up with the answers they give:

  • Relaxing: 80%
  • Reading: 10%
  • Working: 8%
  • Getting chores done: 1.5%
  • Skydiving: 0.5%

Based on these answers, you can roughly say that if you ask 100 people, ~80 of them will say that it's for relaxing. But to find one person who'll answer with "Skydiving" - you might have to stop about 200 people.

So, when you stop a random person on the street and ask a question, there's a probability distribution of the answers that person might give.

LLMs work similarly, where based on our input ("Sundays are for...") and a few other parameters, there's a number of "choices" that an LLM can make, and there's a probability of the LLM making that choice.

Probing the LLM

Anyone working with LLMs have an intuition for temperature. It's a knob that controls the randomness of the LLM's output.

To illustrate this point, let's prompt our model to "Complete this sentence: Sundays are for ___", with a temperature of 1. 1 is an arbitrary choice, but it's a high enough temperature to illustrate the point.

We'll ask this question 10,000 times, and we'll see what the most popular answers are.

for i in {1..10000}; do
  curl http://localhost:11434/api/generate -d '{
    "model": "llama3.2",
    "prompt": "Complete this sentence: Sundays are for _______.",
    "stream": false,
    "options": {
      "temperature": 1
    }
  }' | jq '.response' >> sunday.txt
done

Here are the top 10 results:

AnswerCount%
relaxation and unwinding.106921.38
relaxation.98519.7
relaxation and recharging.3006.0
relaxation and self-care.2605.2
relaxation and rejuvenation.2334.66
relaxation and leisure.1262.52
...relaxation and unwinding after a busy week.1072.14
...relaxation and unwinding after a busy week!1002.0
Sundays are for relaxation.991.98
Sundays are for relaxation and unwinding.721.44

By setting a temperature of 1, we've asked for a lot of variability in the output. But even in this case, you can see that the model has a preference for "relaxation and unwinding" and "relaxation". In the grand scheme of things, our sample size is still too small to definitively say that the model prefers "relaxation and unwinding" over "relaxation".

What if we set the temperature to 0?

We can repeat the same experiment, but this time set the temperature to 0.

for i in {1..100}; do
  curl http://localhost:11434/api/generate -d '{
    "model": "llama3.2",
    "prompt": "Complete this sentence: Sundays are for _______.",
    "stream": false,
    "options": {
      "temperature": 0
    }
  }' | jq '.response' >> sunday_0.txt
done

Here's the result:

AnswerCount%
relaxation and unwinding.100100

We can see that the model is now consistently choosing "relaxation and unwinding".

So, when we set the temperature to 0, the model is choosing the most likely answer, each and every time. And it will ignore the rest of the possible answers.

When we set the temperature to 1, the model's answer is "more varied", and it will start suggesting more and more unlikely answers.

Temperatures > 1

Most closed-source/proprietary models have a temperature range of 0-1. But you can actually set the temperature to be greater than 1.

If you set the temperature to be greater than 1, the model will become more random, and will start suggesting more and more "unlikely" answers.

for i in {1..10000}; do
  curl http://localhost:11434/api/generate -d '{
    "model": "llama3.2",
    "prompt": "Complete this sentence: Sundays are for _______.",
    "stream": false,
    "options": {
      "temperature": 10000000
    }
  }' | jq '.response' >> results_2_sunday.txt
done

Probability distribution vs. temperature

Based on the above experiment, you might think that temperature is just the probability distribution of the LLM's output. But that's not quite right.

Temperature is a knob that controls how much the model leans into the probability distribution.

A neural network outputs a bunch of things called "logits". These are the raw, unnormalized scores for each token in the model's vocabulary. A "score" here is just a number (real numbers, positive or negative).

The logits are then:

  1. Divided by the temperature
  2. Passed through an exponential function to attenuate the difference between the numbers
  3. Squashed between 0 and 1, and normalized so that they sum up to 1

For example, take these 3 numbers:

2, 1, 0.5

temperature=0.1 (Cold):

  1. Scaled logits: [20.0, 10.0, 5.0]
  2. After exp(): [485,165,148]
  3. Probabilities: [0.99, 0.01, ~0]
Logit   [=====20=====]
        [====10====]
        [===5===]

Exp     [=============================485==============================]
        [==========165==========]
        [=========148=========]

Probs   [============================0.99============================]
        [==0.01==]
        [≈0]

temperature=1 (Normal):

  1. Scaled logits: [2.0, 1.0, 0.5]
  2. After exp(): [7.39, 2.72, 1.65]
  3. Probabilities: [0.63, 0.23, 0.14]
Logit   [==2.0==]
        [=1.0=]
        [0.5]

Exp     [=======7.39=======]
        [===2.72===]
        [=1.65=]

Prob    [======0.63======]
        [==0.23==]
        [=0.14=]

temperature=2 (Hot):

  1. Scaled logits: [1.0, 0.5, 0.25]
  2. After exp(): [2.72, 1.65, 1.28]
  3. Probabilities: [0.48, 0.29, 0.23]
Logit   [=1.0=]
        [0.5]
        [0.25]

Exp     [===2.72===]
        [==1.65==]
        [=1.28=]

Prob    [====0.48====]
        [===0.29===]
        [==0.23==]

As you can see, when the temperature is low, the highest logit becomes much more likely. When the temperature is high, the probabilities are more uniform (or flat).

What about temperature=0?

When the temperature is 0, the output of the model is deterministic. Since we can't divide by 0, we can't normalize the logits. So, we just ignore the rest of the logits and choose the highest one each time.

This is usually handled as a special case in the implementation of the LLM.