A 28nm 192.3TFLOPS/W Accurate/Approximate Dual-Mode-Transpose Digital 6T-SRAM CIM Macro for Floating-Point Edge Training and Inference

Yiyang Yuan; Bingxin Zhang; Yiming Yang; Yishan Luo; Qirui Chen; Shidong Lv; Hao Wu; Cailian Ma; Ming Li; Jinshan Yue; Xinghua Wang; Guozhong Xing; Pui In Mak; Xiaoran Li; Feng Zhang

doi:10.1109/ISSCC49661.2025.10904659

A 28nm 192.3TFLOPS/W Accurate/Approximate Dual-Mode-Transpose Digital 6T-SRAM CIM Macro for Floating-Point Edge Training and Inference

Yiyang Yuan, Bingxin Zhang, Yiming Yang, Yishan Luo, Qirui Chen, Shidong Lv, Hao Wu, Cailian Ma, Ming Li, Jinshan Yue, Xinghua Wang, Guozhong Xing, Pui In Mak, Xiaoran Li^*, Feng Zhang^*

^*此作品的通讯作者

集成电路与电子学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

4 引用（Scopus）

摘要

SRAM CIM macros have been developed to enhance energy efficiency (EF) in edge-AI applications. However, most research has predominantly focused on inference [1-6], with relatively little exploration into the training process. Fine-tuning, during training, is critical for improving the accuracy of neural network models, which directly impacts the user experience. Unlike remote- or cloud-based training, on-device training offers advantages such as real-time response, power saving, and user-privacy protection. The training process primarily involves two phases: feed-forward (FF) and backpropagation (BP). The FF phase resembles the inference process, while BP phase requires the multiplication of the error gradient with the transposed weight matrix. Although several studies have introduced transpose CIM (T-CIM) to support both FF and BP [7-9], several challenges remain: (1) Previous work uses separate circuits for FF and BP, thereby diminishing area and energy efficiency due to lack off multiply-accumulate (MAC) circuit reuse; (2) Prior work is limited to integer (INT) formats; research indicates that INT representation can significantly reduce accuracy during training due to its lower resolution [10]. Fortunately, developments in pre-aligned FP-CIM schemes have been made [1,11,12], but these still suffer from accuracy losses due to mantissa truncation during the pre-alignment process. (3) Reliance on analog-CIM schemes leads to accuracy losses due to process, voltage, and temperature (PVT) variations, which further degrade training accuracy. Although digital CIMs (DCIM) can mitigate these issues, optimizing the tradeoffs between SRAM arrays and MAC circuits to simultaneously achieve high-memory density (MD) and area efficiency (AF) remains challenging [1]. Increasing the number of SRAM sub-bank rows and employing bit-parallel techniques can improve MD and AF [2], while approximate computing can enhance access speed and EF [3,4], as illustrated in Fig. 14.5.1.

源语言	英语
主期刊名	2025 IEEE International Solid-State Circuits Conference, ISSCC 2025
出版商	Institute of Electrical and Electronics Engineers Inc.
页	258-260
页数	3
ISBN（电子版）	9798331541019
DOI	http://doi.org/10.1109/ISSCC49661.2025.10904659
出版状态	已出版 - 2025
活动	72nd IEEE International Solid-State Circuits Conference, ISSCC 2025 - San Francisco, 美国期限: 16 2月 2025 → 20 2月 2025

出版系列

姓名	Digest of Technical Papers - IEEE International Solid-State Circuits Conference
ISSN（印刷版）	0193-6530

会议

会议	72nd IEEE International Solid-State Circuits Conference, ISSCC 2025
国家/地区	美国
市	San Francisco
时期	16/02/25 → 20/02/25

联合国可持续发展目标

此成果有助于实现下列可持续发展目标：

访问文件

10.1109/ISSCC49661.2025.10904659

其它文件与链接

链接到 Scopus 的出版物

引用此

Yuan, Y., Zhang, B., Yang, Y., Luo, Y., Chen, Q., Lv, S., Wu, H., Ma, C., Li, M., Yue, J., Wang, X., Xing, G., Mak, P. I., Li, X., & Zhang, F. (2025). A 28nm 192.3TFLOPS/W Accurate/Approximate Dual-Mode-Transpose Digital 6T-SRAM CIM Macro for Floating-Point Edge Training and Inference. 在 2025 IEEE International Solid-State Circuits Conference, ISSCC 2025 (页码 258-260). (Digest of Technical Papers - IEEE International Solid-State Circuits Conference). Institute of Electrical and Electronics Engineers Inc.. http://doi.org/10.1109/ISSCC49661.2025.10904659

Yuan, Yiyang ; Zhang, Bingxin ; Yang, Yiming 等. / A 28nm 192.3TFLOPS/W Accurate/Approximate Dual-Mode-Transpose Digital 6T-SRAM CIM Macro for Floating-Point Edge Training and Inference. 2025 IEEE International Solid-State Circuits Conference, ISSCC 2025. Institute of Electrical and Electronics Engineers Inc., 2025. 页码 258-260 (Digest of Technical Papers - IEEE International Solid-State Circuits Conference).

@inproceedings{ed18628d6a3442e8b823677e040480d8,

title = "A 28nm 192.3TFLOPS/W Accurate/Approximate Dual-Mode-Transpose Digital 6T-SRAM CIM Macro for Floating-Point Edge Training and Inference",

abstract = "SRAM CIM macros have been developed to enhance energy efficiency (EF) in edge-AI applications. However, most research has predominantly focused on inference [1-6], with relatively little exploration into the training process. Fine-tuning, during training, is critical for improving the accuracy of neural network models, which directly impacts the user experience. Unlike remote- or cloud-based training, on-device training offers advantages such as real-time response, power saving, and user-privacy protection. The training process primarily involves two phases: feed-forward (FF) and backpropagation (BP). The FF phase resembles the inference process, while BP phase requires the multiplication of the error gradient with the transposed weight matrix. Although several studies have introduced transpose CIM (T-CIM) to support both FF and BP [7-9], several challenges remain: (1) Previous work uses separate circuits for FF and BP, thereby diminishing area and energy efficiency due to lack off multiply-accumulate (MAC) circuit reuse; (2) Prior work is limited to integer (INT) formats; research indicates that INT representation can significantly reduce accuracy during training due to its lower resolution [10]. Fortunately, developments in pre-aligned FP-CIM schemes have been made [1,11,12], but these still suffer from accuracy losses due to mantissa truncation during the pre-alignment process. (3) Reliance on analog-CIM schemes leads to accuracy losses due to process, voltage, and temperature (PVT) variations, which further degrade training accuracy. Although digital CIMs (DCIM) can mitigate these issues, optimizing the tradeoffs between SRAM arrays and MAC circuits to simultaneously achieve high-memory density (MD) and area efficiency (AF) remains challenging [1]. Increasing the number of SRAM sub-bank rows and employing bit-parallel techniques can improve MD and AF [2], while approximate computing can enhance access speed and EF [3,4], as illustrated in Fig. 14.5.1.",

author = "Yiyang Yuan and Bingxin Zhang and Yiming Yang and Yishan Luo and Qirui Chen and Shidong Lv and Hao Wu and Cailian Ma and Ming Li and Jinshan Yue and Xinghua Wang and Guozhong Xing and Mak, \{Pui In\} and Xiaoran Li and Feng Zhang",

note = "Publisher Copyright: {\textcopyright} 2025 IEEE.; 72nd IEEE International Solid-State Circuits Conference, ISSCC 2025 ; Conference date: 16-02-2025 Through 20-02-2025",

year = "2025",

doi = "10.1109/ISSCC49661.2025.10904659",

language = "English",

series = "Digest of Technical Papers - IEEE International Solid-State Circuits Conference",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "258--260",

booktitle = "2025 IEEE International Solid-State Circuits Conference, ISSCC 2025",

address = "United States",

}

Yuan, Y, Zhang, B, Yang, Y, Luo, Y, Chen, Q, Lv, S, Wu, H, Ma, C, Li, M, Yue, J, Wang, X, Xing, G, Mak, PI, Li, X & Zhang, F 2025, A 28nm 192.3TFLOPS/W Accurate/Approximate Dual-Mode-Transpose Digital 6T-SRAM CIM Macro for Floating-Point Edge Training and Inference. 在 2025 IEEE International Solid-State Circuits Conference, ISSCC 2025. Digest of Technical Papers - IEEE International Solid-State Circuits Conference, Institute of Electrical and Electronics Engineers Inc., 页码 258-260, 72nd IEEE International Solid-State Circuits Conference, ISSCC 2025, San Francisco, 美国, 16/02/25. http://doi.org/10.1109/ISSCC49661.2025.10904659

A 28nm 192.3TFLOPS/W Accurate/Approximate Dual-Mode-Transpose Digital 6T-SRAM CIM Macro for Floating-Point Edge Training and Inference. / Yuan, Yiyang; Zhang, Bingxin; Yang, Yiming 等.
2025 IEEE International Solid-State Circuits Conference, ISSCC 2025. Institute of Electrical and Electronics Engineers Inc., 2025. 页码 258-260 (Digest of Technical Papers - IEEE International Solid-State Circuits Conference).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - A 28nm 192.3TFLOPS/W Accurate/Approximate Dual-Mode-Transpose Digital 6T-SRAM CIM Macro for Floating-Point Edge Training and Inference

AU - Yuan, Yiyang

AU - Zhang, Bingxin

AU - Yang, Yiming

AU - Luo, Yishan

AU - Chen, Qirui

AU - Lv, Shidong

AU - Wu, Hao

AU - Ma, Cailian

AU - Li, Ming

AU - Yue, Jinshan

AU - Wang, Xinghua

AU - Xing, Guozhong

AU - Mak, Pui In

AU - Li, Xiaoran

AU - Zhang, Feng

PY - 2025

Y1 - 2025

N2 - SRAM CIM macros have been developed to enhance energy efficiency (EF) in edge-AI applications. However, most research has predominantly focused on inference [1-6], with relatively little exploration into the training process. Fine-tuning, during training, is critical for improving the accuracy of neural network models, which directly impacts the user experience. Unlike remote- or cloud-based training, on-device training offers advantages such as real-time response, power saving, and user-privacy protection. The training process primarily involves two phases: feed-forward (FF) and backpropagation (BP). The FF phase resembles the inference process, while BP phase requires the multiplication of the error gradient with the transposed weight matrix. Although several studies have introduced transpose CIM (T-CIM) to support both FF and BP [7-9], several challenges remain: (1) Previous work uses separate circuits for FF and BP, thereby diminishing area and energy efficiency due to lack off multiply-accumulate (MAC) circuit reuse; (2) Prior work is limited to integer (INT) formats; research indicates that INT representation can significantly reduce accuracy during training due to its lower resolution [10]. Fortunately, developments in pre-aligned FP-CIM schemes have been made [1,11,12], but these still suffer from accuracy losses due to mantissa truncation during the pre-alignment process. (3) Reliance on analog-CIM schemes leads to accuracy losses due to process, voltage, and temperature (PVT) variations, which further degrade training accuracy. Although digital CIMs (DCIM) can mitigate these issues, optimizing the tradeoffs between SRAM arrays and MAC circuits to simultaneously achieve high-memory density (MD) and area efficiency (AF) remains challenging [1]. Increasing the number of SRAM sub-bank rows and employing bit-parallel techniques can improve MD and AF [2], while approximate computing can enhance access speed and EF [3,4], as illustrated in Fig. 14.5.1.

AB - SRAM CIM macros have been developed to enhance energy efficiency (EF) in edge-AI applications. However, most research has predominantly focused on inference [1-6], with relatively little exploration into the training process. Fine-tuning, during training, is critical for improving the accuracy of neural network models, which directly impacts the user experience. Unlike remote- or cloud-based training, on-device training offers advantages such as real-time response, power saving, and user-privacy protection. The training process primarily involves two phases: feed-forward (FF) and backpropagation (BP). The FF phase resembles the inference process, while BP phase requires the multiplication of the error gradient with the transposed weight matrix. Although several studies have introduced transpose CIM (T-CIM) to support both FF and BP [7-9], several challenges remain: (1) Previous work uses separate circuits for FF and BP, thereby diminishing area and energy efficiency due to lack off multiply-accumulate (MAC) circuit reuse; (2) Prior work is limited to integer (INT) formats; research indicates that INT representation can significantly reduce accuracy during training due to its lower resolution [10]. Fortunately, developments in pre-aligned FP-CIM schemes have been made [1,11,12], but these still suffer from accuracy losses due to mantissa truncation during the pre-alignment process. (3) Reliance on analog-CIM schemes leads to accuracy losses due to process, voltage, and temperature (PVT) variations, which further degrade training accuracy. Although digital CIMs (DCIM) can mitigate these issues, optimizing the tradeoffs between SRAM arrays and MAC circuits to simultaneously achieve high-memory density (MD) and area efficiency (AF) remains challenging [1]. Increasing the number of SRAM sub-bank rows and employing bit-parallel techniques can improve MD and AF [2], while approximate computing can enhance access speed and EF [3,4], as illustrated in Fig. 14.5.1.

UR - http://www.scopus.com/pages/publications/105000823391

U2 - 10.1109/ISSCC49661.2025.10904659

DO - 10.1109/ISSCC49661.2025.10904659

M3 - Conference contribution

AN - SCOPUS:105000823391

T3 - Digest of Technical Papers - IEEE International Solid-State Circuits Conference

SP - 258

EP - 260

BT - 2025 IEEE International Solid-State Circuits Conference, ISSCC 2025

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 72nd IEEE International Solid-State Circuits Conference, ISSCC 2025

Y2 - 16 February 2025 through 20 February 2025

ER -

Yuan Y, Zhang B, Yang Y, Luo Y, Chen Q, Lv S 等. A 28nm 192.3TFLOPS/W Accurate/Approximate Dual-Mode-Transpose Digital 6T-SRAM CIM Macro for Floating-Point Edge Training and Inference. 在 2025 IEEE International Solid-State Circuits Conference, ISSCC 2025. Institute of Electrical and Electronics Engineers Inc. 2025. 页码 258-260. (Digest of Technical Papers - IEEE International Solid-State Circuits Conference). doi: 10.1109/ISSCC49661.2025.10904659

A 28nm 192.3TFLOPS/W Accurate/Approximate Dual-Mode-Transpose Digital 6T-SRAM CIM Macro for Floating-Point Edge Training and Inference

摘要

出版系列

会议

联合国可持续发展目标

访问文件

其它文件与链接

指纹

引用此