Hyperparameters in LLM
There are several hyperparameters that are generally used to tweak the output of Large Language Models (LLMs). This post deals with a few common hyperparameters that can be tweaked via APIs.
Temperature
Temperature controls the "sharpness" or randomness of the LLM's output distribution. It's commonly set between 0 and 1, though some systems allow for higher values. Mathematically, temperature (t) is used to scale the logits before applying the softmax function.
softmax(logits/t)
A higher temperature (t -> 1 or greater):
- Makes the probability distribution more uniform.
- Reduces the difference between high and low probabilities.
- Increases the likelihood of selecting less probable, more "creative" outcomes.
- Leads to more diverse and sometimes more "surprising" selections.
- Can lead to greater robustness by considering a wider range of possibilities.
A lower temperature (t -> 0):
- Makes the probability distribution more "peaked" or concentrated.
- Increases the difference between high and low probabilities.
- Decreases the likelihood of selecting less probable outcomes.
- Leads to more predictable and focused selections, often picking the most probable token.
- Can make the model's output more deterministic and less varied.
For example, suppose we have a probability distribution over 5 elements: [0.1, 0.2, 0.3, 0.2, 0.2]. With a high temperature, the distribution might become more uniform, e.g., [0.18, 0.2, 0.22, 0.2, 0.2]. Conversely, with a low temperature, the distribution becomes more peaked around the highest probability, e.g., [0.01, 0.1, 0.78, 0.1, 0.01].
Top P Sampling
Top P sampling, also known as Nucleus Sampling, selects the smallest set of elements whose cumulative probability exceeds a threshold P (e.g., 0.8). This is useful when we want to:
- Avoid selecting low probability elements.
- Capture the most significant portion of the distribution.
When to use this method: You need a more flexible and adaptive sampling strategy. By setting a cumulative probability threshold, Top P ensures you're always considering the most relevant tokens, regardless of how many there are, making it more adaptive than Top K. If P is too high (e.g., 0.99), it allows for a very wide range of less probable tokens, potentially leading to more nonsensical output. If P is too low (e.g., 0.1), it becomes very restrictive, similar to a very low K value.
Top K Sampling
Top K sampling selects the top K elements with the highest probabilities from the distribution. This method is useful when we want to:
- Focus on the most likely outcomes.
- Improve computational efficiency.
- Reduce the dimensionality of the output space.
When to use this method: You have a clear idea of the fixed number of elements you want to select. Top K is simpler and faster because it fixes the number of choices, making it ideal when you need a predictable range of highly probable outcomes. If K is too high (e.g., K=1000 for a small vocabulary), it effectively includes almost all tokens, diminishing its control. If K is too low (e.g., K=1), it makes the model very deterministic, always picking the single most probable token.
Other Hyperparameters
Max length: This limits the total length of the generated text. It's crucial for controlling verbosity and preventing runaway generation from your LLM.
Repetition penalty: This reduces the likelihood of repeating tokens. It's important for preventing monotonous or nonsensical output and ensuring the generated text sounds more natural.
Understanding and experimenting with these hyperparameters is crucial for effectively controlling the LLM's behavior and generating outputs tailored to specific needs.