Data structure: a (very) long list of numbers defining a line

  • It is possible to speculate how a neural network stores information/meaning.
  • One can think of information as being stored on a line with a direction in a very high-dimensional co-ordinate system (i.e. “vector”).
  • Using a 3-D graph as a visual example, perhaps an arrow in a certain direction measures how “sports-related” something is.
  • From an origin, 1-D graph (line) is defined by 1 number, 2-D graph by 2 numbers, 3-D graph with 3 numbers.
  • The embedding in the GPT model has something like 12,000 dimensions. Something of such a high dimension is practically impossible to visualize accurately (e.g. the relationships between length, volume and surface area would be very surprising). See for example the interactive Tesseract explanation.
  • But the essential idea remains that the magnitude and direction of the vector likely encodes some helpful information.

Training: backpropagation and gradient descent

  • A large pre-labelled dataset and some clever mathematical tricks are required to train the weights and biases of each layer of the neural network.
  • This is something that would only make sense in the late 20th century, with massive computing power readily available.
  • It would be of interest to estimate how many hours it would take if human “computers” were tasked to do the calculations, e.g. in 1900. But in practical terms this could not be done.