SoftMax: Statistical Physics Meets Probabilistic AI
The SoftMax function is often used in the final layer of neural networks to convert K numerical values to K probabilities of the data input belonging to each of K classes. It's pretty cool that one can do this with any set of real numbers! It's a standard part of the Transformers that underly LLMs like ChatGPT for example.
Incidentally, the probabilities produced by the softmax function are of equivalent* form to the probabilities implied by the well-known Boltzmann distribution of statistical physics. In the latter case, these probabilities correspond to the chances of finding a system at varying energies when it is at fixed temperature. *Note that technically, the Boltzmann distribution is a generalized form of the softmax function where different choices of physical temperature correspond to different choices of base in the softmax context (that is, exponentiating by a base number of e^(-Beta) rather than e itself where Beta=1/(Boltzmann's constant * Temperature)).
It's a pleasant coincidence to see the key statistical mechanics math reappear in data science and AI applications. It seems to me that in ML one basically wants a mapping of the form f(x_i)/(sum(f(x_i)) which obeys certain niceness properties like f(x) >= 0. Happily, taking ratios of this form lends itself naturally to a probability distribution interpretation as the sum of these ratios is necessarily 1 by construction. You also want f(x) to be monotonically increasing across all of the real numbers so that larger input values are mapped to higher probabilities.
Makes me wonder-- are there any other obvious smooth functions f that satisfy:
- f(x) >= 0
- f(x) monotonically increasing
- domain all of real numbers
- smooth (continuously differentiable etc)
Apart from pure exponentials and their alternate non-e bases: 1/(1+e^(-x)), tanh(x), ln(1+e^x).
Interestingly, the above alternate forms of f that meet the stated criteria are all function compositions that utilize the exponential function in different ways. It's worth mentioning that the ReLU activation function would not meet the stated criteria despite appearing in ML activation functions as it is not differentiable at the origin (being a piecewise function)-- this leads to issues with optimization calculations like backpropagation and gradient descent which attempt to take the derivative of the neural network.
DeepLearning.AI's Coursera class on Generative AI and LLMs has a great diagram expanding on the usage of a shrewdly named temperature parameter in softmax. It looks like softmax functions of the below form are being used where the parameter Beta is in direct analogy to temperature in the Boltzmann distribution of statistical physics!
In physics, at low temperature the system converges to that with least energy i.e. the most likely, whereas at high temperatures the distribution becomes more uniform as all microstates become comparably likely. In LLMs, it's the same thing! At low temperature, we have a distribution strongly peaked on the most likely word and a flatter more uniform distribution at high temperature.
References:
Boltzmann distribution - Wikipedia
Activation function - Wikipedia
DeepLearning.AI's Coursera class on Generative AI and LLMs Lecture Notes Week 1 | Coursera
Comments
Post a Comment