multi_head_attention#
- ivy.multi_head_attention(query, /, *, key=None, value=None, batch_first=True, num_heads=8, scale=None, attention_mask=None, in_proj_weights=None, q_proj_weights=None, k_proj_weights=None, v_proj_weights=None, out_proj_weights=None, in_proj_bias=None, out_proj_bias=None, is_causal=False, key_padding_mask=None, bias_k=None, bias_v=None, static_k=None, static_v=None, add_zero_attn=False, return_attention_weights=False, average_attention_weights=True, dropout=0.0, training=False, out=None)[source]#
Apply multi-head attention to inputs x. This is an implementation of multi-headed attention as described in the paper “Attention is all you Need” (Vaswani et al., 2017). If query, key, value are the same, then this is self-attention. Each timestep in query attends to the corresponding sequence in key, and returns a fixed-width vector. This layer first projects query, key and value. These are (effectively) a list of tensors of length num_attention_heads, where the corresponding shapes are (batch_size, <query dimensions>, key_dim), (batch_size, <key/value dimensions>, key_dim), (batch_size, <key/value dimensions>, value_dim). Then, the query and key tensors are dot-producted and scaled. These are softmaxed to obtain attention probabilities. The value tensors are then interpolated by these probabilities, then concatenated back to a single tensor. Finally, the result tensor with the last dimension as value_dim can take a linear projection and return.
- Parameters:
query (
Union
[Array
,NativeArray
]) – The query embeddings. Shape: (L, Q) or (N, L, Q), where L is the number of queries, N is the batch size, Q is the query embedding dimension.key (
Optional
[Union
[Array
,NativeArray
]], default:None
) – The key embeddings. Shape: (S, K) or (N, S, K), where S is the number of keys, N is the batch size, K is the key embedding dimension.value (
Optional
[Union
[Array
,NativeArray
]], default:None
) – The value embeddings. Shape (S, V) or (N, S, V), where S is the number of keys, N is the batch size, V is the value embedding dimension.batch_first (
bool
, default:True
) – If False, query, key and value will have shapes (L, N, Q), (S, N, K) and (S, N, V) respectively (if batched).num_heads (
int
, default:8
) – The number of attention heads to use.scale (
Optional
[float
], default:None
) – The value by which to scale the query-key similarity measure before softmax.attention_mask (
Optional
[Union
[Array
,NativeArray
]], default:None
) – The mask to apply to the query-key values. Shape: (L, S) or (N*num_heads, L, S).in_proj_weights (
Optional
[Union
[Array
,NativeArray
]], default:None
) – The weights used to project query, key and value. Shape: (3*E, E’), where E is the new embedding dimension and E’ is the input embedding dimension, i.e. E’ = Q = K = V.q_proj_weights (
Optional
[Union
[Array
,NativeArray
]], default:None
) – The weights used to project query if in_proj_weights is None. Shape: (E, Q).k_proj_weights (
Optional
[Union
[Array
,NativeArray
]], default:None
) – The weights used to project key if in_proj_weights is None. Shape: (E, K).v_proj_weights (
Optional
[Union
[Array
,NativeArray
]], default:None
) – The weights used to project value if in_proj_weights is None. Shape: (E, V).out_proj_weights (
Optional
[Union
[Array
,NativeArray
]], default:None
) – The weights used to project the attention output. Shape: (O, E), where O is the output embedding dimension.in_proj_bias (
Optional
[Union
[Array
,NativeArray
]], default:None
) – The bias used when projecting query, key and value. Shape: (3*E,).out_proj_bias (
Optional
[Union
[Array
,NativeArray
]], default:None
) – The bias used when projecting the output. Shape: (O,).is_causal (
bool
, default:False
) – If True, use a causal attention mask and ignore the provided attention_mask.key_padding_mask (
Optional
[Union
[Array
,NativeArray
]], default:None
) – A binary mask to apply to the key sequence. Shape: (S,) or (N, S).bias_k (
Optional
[Union
[Array
,NativeArray
]], default:None
) – An additional bias added to the key sequence. Shape: (E,).bias_v (
Optional
[Union
[Array
,NativeArray
]], default:None
) – An additional bias added to the value sequence. Shape: (E,).static_k (
Optional
[Union
[Array
,NativeArray
]], default:None
) – A static key to be used in the attention operators. Shape: (N*num_heads, S, E//num_heads).static_v (
Optional
[Union
[Array
,NativeArray
]], default:None
) – A static value to be used in the attention operators. Shape: (N*num_heads, S, E//num_heads).add_zero_attn (
bool
, default:False
) – A boolean flag indicating whether to add a batch of zeros to key and value.return_attention_weights (
bool
, default:False
) – If True, return the attention weights alongside the attention output.average_attention_weights (
bool
, default:True
) – If True, the returned attention weights will be averaged across heads. Otherwise, the attention weights will be provided separately per head. Note that this flag only has an effect when return_attention_weights=True.dropout (
float
, default:0.0
) – Specifies the dropout probability. Dropout is applied on the attention weights.training (
bool
, default:False
) – If True, dropout is used, otherwise dropout is not activated.out (
Optional
[Array
], default:None
) – optional output array, for writing the result to. It must have a shape that the inputs broadcast to.
- Return type:
Union
[Array
,NativeArray
]- Returns:
ret – The output following the application of multi-head attention. Either output or (output, attention_weights). output will have shape (L, E) if the inputs were unbatched or (N, L, E) otherwise, and attention_weights will have shape (L, S) or (N, L, S) respectively. If batch_first is False and the inputs were batched, the output will have shape (L, N, E).
Both the description and the type hints above assumes an array input for simplicity,
but this function is nestable, and therefore also accepts
ivy.Container
instances in place of any of the arguments.
- Array.multi_head_attention(self, /, *, key=None, value=None, num_heads=8, scale=None, attention_mask=None, in_proj_weights=None, q_proj_weights=None, k_proj_weights=None, v_proj_weights=None, out_proj_weights=None, in_proj_bias=None, out_proj_bias=None, is_causal=False, key_padding_mask=None, bias_k=None, bias_v=None, static_k=None, static_v=None, add_zero_attn=False, return_attention_weights=False, average_attention_weights=True, dropout=0.0, training=False, out=None)[source]#
- Return type:
Array
- Container.multi_head_attention(self, /, *, key=None, value=None, num_heads=8, scale=None, attention_mask=None, in_proj_weights=None, q_proj_weights=None, k_proj_weights=None, v_proj_weights=None, out_proj_weights=None, in_proj_bias=None, out_proj_bias=None, is_causal=False, key_padding_mask=None, bias_k=None, bias_v=None, static_k=None, static_v=None, add_zero_attn=False, return_attention_weights=False, average_attention_weights=True, dropout=0.0, training=False, key_chains=None, to_apply=True, prune_unapplied=False, map_sequences=False, out=None)[source]#
- Return type:
Container