Muralidhar Kashipathi's blog

Hyperparameters in LLM

There are several hyperparameters that are generally used to tweak the output of Large Language Models (LLMs). This post deals with a few common hyperparameters that can be tweaked via APIs.

Temperature

Temperature controls the "sharpness" or randomness of the LLM's output distribution. It's commonly set between 0 and 1, though some systems allow for higher values. Mathematically, temperature (t) is used to scale the logits before applying the softmax function.

softmax(logits/t)

For example, suppose we have a probability distribution over 5 elements: [0.1, 0.2, 0.3, 0.2, 0.2]. With a high temperature, the distribution might become more uniform, e.g., [0.18, 0.2, 0.22, 0.2, 0.2]. Conversely, with a low temperature, the distribution becomes more peaked around the highest probability, e.g., [0.01, 0.1, 0.78, 0.1, 0.01].


Top P Sampling

Top P sampling, also known as Nucleus Sampling, selects the smallest set of elements whose cumulative probability exceeds a threshold P (e.g., 0.8). This is useful when we want to:

When to use this method: You need a more flexible and adaptive sampling strategy. By setting a cumulative probability threshold, Top P ensures you're always considering the most relevant tokens, regardless of how many there are, making it more adaptive than Top K. If P is too high (e.g., 0.99), it allows for a very wide range of less probable tokens, potentially leading to more nonsensical output. If P is too low (e.g., 0.1), it becomes very restrictive, similar to a very low K value.


Top K Sampling

Top K sampling selects the top K elements with the highest probabilities from the distribution. This method is useful when we want to:

When to use this method: You have a clear idea of the fixed number of elements you want to select. Top K is simpler and faster because it fixes the number of choices, making it ideal when you need a predictable range of highly probable outcomes. If K is too high (e.g., K=1000 for a small vocabulary), it effectively includes almost all tokens, diminishing its control. If K is too low (e.g., K=1), it makes the model very deterministic, always picking the single most probable token.


Other Hyperparameters


Understanding and experimenting with these hyperparameters is crucial for effectively controlling the LLM's behavior and generating outputs tailored to specific needs.

#AI & ML #ai #blog #llm