Enhanced Swin Transformer and Edge Spatial Attention for Remote Sensing Image Semantic Segmentation

Fuxiang Liu; Zhiqiang Hu; Lei Li; Hanlu Li; Xinxin Liu

doi:10.1109/LSP.2025.3550858

Enhanced Swin Transformer and Edge Spatial Attention for Remote Sensing Image Semantic Segmentation

Fuxiang Liu, Zhiqiang Hu, Lei Li^*, Hanlu Li, Xinxin Liu

^*Corresponding author for this work

School of Aerospace Engineering

Research output: Contribution to journal › Article › peer-review

Abstract

Combining convolutional neural networks (CNNs) and transformers is a crucial direction in remote sensing image semantic segmentation. However, due to differences in the spatial information focus and feature extraction methods, existing feature transfer and fusion strategies do not effectively integrate the advantages of both approaches. To address these issues, we propose a CNN-transformer hybrid network for precise remote sensing image semantic segmentation. We propose a novel Swin Transformer block to optimize feature extraction and enable the model to handle remote sensing images of arbitrary sizes. Additionally, we design an Edge Spatial Attention module to focus attention on local edge structures, effectively integrating global features and local details. This facilitates efficient information flow between the Transformer encoder and CNN decoder. Finally, a multi-scale convolutional decoder is employed to fully leverage both global information from the Transformer and local features from the CNN, leading to accurate segmentation results. Our network achieved state-of-the-art performance on the Vaihingen and Potsdam datasets, reaching mIoU and F1 scores of 67.37% and 79.82%, as well as 72.39% and 83.68%, respectively.

Original language	English
Pages (from-to)	1296-1300
Number of pages	5
Journal	IEEE Signal Processing Letters
Volume	32
DOIs	http://doi.org/10.1109/LSP.2025.3550858
Publication status	Published - 2025

Keywords

Edge detection
Swin transformer
remote sensing image
semantic segmentation

Access to Document

10.1109/LSP.2025.3550858

Cite this

@article{ea76fbed7250468896dc3cce08c280f5,

title = "Enhanced Swin Transformer and Edge Spatial Attention for Remote Sensing Image Semantic Segmentation",

abstract = "Combining convolutional neural networks (CNNs) and transformers is a crucial direction in remote sensing image semantic segmentation. However, due to differences in the spatial information focus and feature extraction methods, existing feature transfer and fusion strategies do not effectively integrate the advantages of both approaches. To address these issues, we propose a CNN-transformer hybrid network for precise remote sensing image semantic segmentation. We propose a novel Swin Transformer block to optimize feature extraction and enable the model to handle remote sensing images of arbitrary sizes. Additionally, we design an Edge Spatial Attention module to focus attention on local edge structures, effectively integrating global features and local details. This facilitates efficient information flow between the Transformer encoder and CNN decoder. Finally, a multi-scale convolutional decoder is employed to fully leverage both global information from the Transformer and local features from the CNN, leading to accurate segmentation results. Our network achieved state-of-the-art performance on the Vaihingen and Potsdam datasets, reaching mIoU and F1 scores of 67.37\% and 79.82\%, as well as 72.39\% and 83.68\%, respectively.",

keywords = "Edge detection, Swin transformer, remote sensing image, semantic segmentation",

author = "Fuxiang Liu and Zhiqiang Hu and Lei Li and Hanlu Li and Xinxin Liu",

note = "Publisher Copyright: {\textcopyright} 1994-2012 IEEE.",

year = "2025",

doi = "10.1109/LSP.2025.3550858",

language = "English",

volume = "32",

pages = "1296--1300",

journal = "IEEE Signal Processing Letters",

issn = "1070-9908",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Enhanced Swin Transformer and Edge Spatial Attention for Remote Sensing Image Semantic Segmentation

AU - Liu, Fuxiang

AU - Hu, Zhiqiang

AU - Li, Lei

AU - Li, Hanlu

AU - Liu, Xinxin

PY - 2025

Y1 - 2025

N2 - Combining convolutional neural networks (CNNs) and transformers is a crucial direction in remote sensing image semantic segmentation. However, due to differences in the spatial information focus and feature extraction methods, existing feature transfer and fusion strategies do not effectively integrate the advantages of both approaches. To address these issues, we propose a CNN-transformer hybrid network for precise remote sensing image semantic segmentation. We propose a novel Swin Transformer block to optimize feature extraction and enable the model to handle remote sensing images of arbitrary sizes. Additionally, we design an Edge Spatial Attention module to focus attention on local edge structures, effectively integrating global features and local details. This facilitates efficient information flow between the Transformer encoder and CNN decoder. Finally, a multi-scale convolutional decoder is employed to fully leverage both global information from the Transformer and local features from the CNN, leading to accurate segmentation results. Our network achieved state-of-the-art performance on the Vaihingen and Potsdam datasets, reaching mIoU and F1 scores of 67.37% and 79.82%, as well as 72.39% and 83.68%, respectively.

AB - Combining convolutional neural networks (CNNs) and transformers is a crucial direction in remote sensing image semantic segmentation. However, due to differences in the spatial information focus and feature extraction methods, existing feature transfer and fusion strategies do not effectively integrate the advantages of both approaches. To address these issues, we propose a CNN-transformer hybrid network for precise remote sensing image semantic segmentation. We propose a novel Swin Transformer block to optimize feature extraction and enable the model to handle remote sensing images of arbitrary sizes. Additionally, we design an Edge Spatial Attention module to focus attention on local edge structures, effectively integrating global features and local details. This facilitates efficient information flow between the Transformer encoder and CNN decoder. Finally, a multi-scale convolutional decoder is employed to fully leverage both global information from the Transformer and local features from the CNN, leading to accurate segmentation results. Our network achieved state-of-the-art performance on the Vaihingen and Potsdam datasets, reaching mIoU and F1 scores of 67.37% and 79.82%, as well as 72.39% and 83.68%, respectively.

KW - Edge detection

KW - Swin transformer

KW - remote sensing image

KW - semantic segmentation

UR - http://www.scopus.com/pages/publications/105001800247

U2 - 10.1109/LSP.2025.3550858

DO - 10.1109/LSP.2025.3550858

M3 - Article

AN - SCOPUS:105001800247

SN - 1070-9908

VL - 32

SP - 1296

EP - 1300

JO - IEEE Signal Processing Letters

JF - IEEE Signal Processing Letters

ER -

Enhanced Swin Transformer and Edge Spatial Attention for Remote Sensing Image Semantic Segmentation

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this