In Transformer models, which mechanism is primarily responsible for capturing long-range dependencies in sequences?
Convolutional layers
Recurrent layers
Attention mechanisms

Machine Learning Übungen werden geladen ...