AI IQ: How Smart Are You About AI?

What does the 'attention mechanism' in transformers fundamentally compute?

The gradient of the loss function with respect to each individual token embedding in the sequence A weighted sum of values, where weights come from query-key similarity The probability distribution over the entire vocabulary for predicting the next token in the sequence A depthwise separable convolution applied across the input token embeddings