A CNN-Transformer Knowledge Distillation for Remote Sensing Scene Classification
- Creators
- Nabi M.
- Maggiolo L.
- Moser G.
- Serpico S. B.
- Others:
- Nabi, M.
- Maggiolo, L.
- Moser, G.
- Serpico, S. B.
Description
Scene classification of remote sensing images is a challenging task due to the complexity and variety of natural scenes. In recent years, Convolutional Neural Networks (CNNs) have achieved impressive performances in many remote sensing scene classification benchmarks. However, in CNNs the long-range visual dependencies are often neglected due to the local filter design, leading to suboptimal performances in cluttered scenes such as urban areas. Recently proposed Transformer architecture resolved this issue by taking a broader neighborhood into account through the multi-head self-attention component. In this paper, we propose a novel method which borrows ideas from 'knowledge distillation' and applied to recent vision Transformers. Specifically, we propose a compound loss computed on a Transformer-based student and a CNN teacher in a joint fashion and utilize it for the task of single-label scene classification. Because of the student's capability in capturing long-range visual dependencies, along with the inductive bias inherited from the teacher, our proposed model improves the classification accuracy on four well-known datasets compared to state-of-the-art approaches.
Additional details
- URL
- https://hdl.handle.net/11567/1102957
- URN
- urn:oai:iris.unige.it:11567/1102957
- Origin repository
- UNIGE