Skip to content

Interactive Softmax Playground
How LLMs choose their words

When ChatGPT or Claude generate text, they don't 'think' - they instantly calculate probabilities for each possible word. This playground allows you to visualize this mathematical calculation and understand how language models transform probabilities into concrete decisions.

The context of the exercise

Imagine asking a language model to write a story about a cat. It has started the text but hasn't finished it yet. How to write the continuation? That's what we're going to discover below.

Context

" The window was open and sunlight streamed into the garden where flowers bloomed. A small mouse scurried across the kitchen floor, unaware of the danger nearby. The curious cat slowly approached the ... "

Parameters

Token Base Logit Probability

Educational Simplifications

→ In real life, models work with tokens, not words (a token can be a complete word, part of a word, or even a character).

→ Base logits have been designed for this playground and are not real model values.

→ Parameter domains are tailored for this playground and may differ from real models.

Metrics

Perplexity : 0
Entropy : 0
Confidence : 0
Effective K : 0

Perplexity

Perplexity measures how "surprised" the model is by the probability distribution.

Interpretation:

  • Lower perplexity (closer to 1) = more confident, less uncertain
  • Higher perplexity = more uncertain, more diverse predictions
  • Perplexity of 2 means the model is as uncertain as if it were choosing between 2 equally likely options

How Softmax Works

The softmax function converts raw scores (logits) into a probability distribution. It's commonly used in language models to determine which token to generate next.

Parameter Descriptions

Not all language models implement all the parameters presented here, and some have additional controls. For example, GPT-4 doesn't support top-k sampling but offers logit_bias that can be used to modify specific token logits. Different models may have varying parameter ranges, default values, or entirely different sampling strategies. Always consult the specific model's documentation for available parameters and their effects.

Temperature (T)

Controls the randomness of token selection. Lower values (0.2-0.8) make the model more deterministic and focused on high-probability tokens. Higher values (1.2-2.0) increase creativity and diversity by flattening the probability distribution.

Top-k

Limits selection to the top k most probable tokens. Setting k=1 makes the model always choose the most likely token (greedy decoding). Higher k values allow more variety while still filtering out very unlikely tokens.

Top-p

Nucleus sampling: selects from the smallest set of tokens whose cumulative probability exceeds p. This dynamically adjusts the number of considered tokens based on the probability distribution, providing more natural diversity than fixed top-k.

Presence Penalty (λₚ)

Reduces the probability of tokens that haven't appeared in the recent context. Positive values encourage the model to use new vocabulary and avoid repetition. Useful for creative writing and avoiding repetitive patterns.

Frequency Penalty (λᶠ)

Reduces the probability of tokens based on how frequently they've appeared in the recent context. Higher values more aggressively penalize repeated tokens, helping to maintain variety and reduce redundancy in the generated text.

Calculation Steps

  1. Apply penalties: Adjust logits based on frequency and presence penalties.
  2. Apply temperature: Divide logits by temperature to control randomness.
  3. Compute softmax: Convert to probabilities using the exponential function.
  4. Apply filters: Use top-k and top-p to limit the token selection.
  5. Renormalize: Recalculate probabilities after filtering.

Mathematical Formulas

Adjusted Logit: z'ᵢ = zᵢ + λₚ·1[countᵢ = 0] - λᶠ·countᵢ

Temperature: z̃ᵢ = z'ᵢ / T

Softmax: pᵢ = e^(z̃ᵢ - max(z̃)) / Σⱼ e^(z̃ⱼ - max(z̃))

Entropy: H = -Σᵢ pᵢ log(pᵢ)

Perplexity: PP = e^H