x = “I hate apples, I only drink _” keys = key_weights \* x query = query_weight...

x = “I hate apples, I only drink _”

keys = key_weights * x

query = query_weights * x

values = value_weights * what_to_look_at

For self attention, what_to_look_at = x

For regular attention, where_to_look_at could be a database, memory or anything else.

So in this example if we’re trying to predict the second “apple” the first “apples” is very helpful. If we’re predicting “juice” then we’d use one head of self-attention to look at the first “apples” and a second head to also look at the second “apple”

That’s my understanding at least