Abstract: The essence of improving the effect of cross-modal image–text retrieval (CIR) lies in the finer-grained modeling of homogeneous features between modalities. However, in remote sensing (RS) ...