To address these limitations, a variety of approaches have been proposed to enhance cross-modal interaction and reinforce temporal coherence 7. Some methods employ spatio-temporal attention mechanisms ...