Online Estimation and Improvement of Cache Soft Error Vulnerability
Subject Areas : SpecialMohammad Moeini Jahromi 1 , Mohammad Hasan Ahmadilivani 2 , Mostafa Salehi 3 *
1 - School of Electrical and Computer Engineering, Engineering Faculty, University of Tehran, Tehran, Iran
2 - School of Electrical and Computer Engineering, Engineering Faculty, University of Tehran, Tehran, Iran
3 - School of Electrical and Computer Engineering, Engineering Faculty, University of Tehran, Tehran, Iran
Keywords: Reliability, Soft Error, Error Masking, Cache, Cache Vulnerability, Reliability-Performance Trade-off, Cache Size, Online Cache Vulnerability Estimation.,
Abstract :
Due to the high density of transistors, memories are highly susceptible to soft errors. The processor's cache, by holding execution data and having frequent interactions with it, greatly impacts system reliability. This importance is even higher in embedded systems and safety-critical applications. One of the most significant factors affecting the reliability of the cache is its size. Smaller caches have better reliability due to their smaller area and shorter data retention, but reducing the cache size makes program execution times longer. This increases the probability of a soft error. Furthermore, reliability of cache is not uniform during program execution, and fixed size of memory cannot optimize its reliability during this time. In this regard, the main issue in improving cache vulnerability is to determine an optimum size of cache and its change time according to change overhead. Accordingly, this paper defines a model for estimating cache vulnerability, which determines vulnerability based on cache data and the type of access to it. Based on the proposed model, an algorithm has been implemented that estimates cache vulnerability online during execution. To model time in this approach, counters are used that model access times during decision-making intervals. By estimating based on blocks instead of memory words and determining the sizes of the counters and decision intervals, the proposed method has been optimized. The accuracy of the vulnerability trend estimation compared to the reference model is 95.22%. Additionally, by using the estimated vulnerability trend during execution and the effective cache size of each program, an algorithm for reconfiguring the cache to improve its vulnerability has been proposed. Implementation showed that with only 5.4% area overhead and 6% time overhead, we can have a reconfigurable memory equipped with a vulnerability management algorithm, which has a lower runtime vulnerability than a fixed cache size and overall vulnerability improvement of 36%.
[1] Shekhar Borkar, “Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degra dation,” IEEE Micro, vol. 25, no. 6, pp. 10-16, Nov. 2005.
[2] B. W. Johnson, “Fault-Tolerant Microprocessor-Based Systems,” IEEE Micro, vol. 4, no. 6, pp. 6-21, Dec. 1984.
[3] C. Slayman, “Soft error trends and mitigation techniques in memory devices,” in Proceedings of Annual Reliability and Maintainability Symposium, Lake Buena Vista, FL, USA, 24-27 Jan. 2011.
[4] Isreal Koren and C. Mani Krishna, Fault-Tolerant Systems, Elsevier, 2007.
[5] Michael Nicolaidis, Soft Errors in Modern Electronic Systems, Springer, 2010.
[6] Lukasz G Szafaryn, Brett H Meyer, and Kevin Skadron. Evaluating overheads of multibit soft-error protection in the processor core. IEEE Micro, (4):56–65, 2013.
[7] Adam Neale, Maarten Jonkman, Manoj Sachdev, “Adjacent-MBU Tolerant SEC-DED-TAEC-yAED Codes for Embedded SRAMs,” IEEE Transactions on Circuit and Systems II, vol. 62, no. 4, pp. 387-391, Apr. 2015.
[8] Seungyeob Lee, Joon-Sung Yang, “MVP ECC: Manufacturing process Variation aware unequal Protection ECC for memory reliability,” in Proceedings of Design, Automation & Test in Europe Conference & Exhibition (DATE), Switzerland, Mar. 27-31 2017.
[9] A. klockMann, G. Geogakos, M. Goessel, “A new 3-bit Burst-Error Correcting Code,” in Proceedings of IOLTS, Greece, 3-5 Jul. 2017.
[10] R. Afrin and M. S. Sadi, “An efficient approach to enhance memory reliability,” in Proceedings of the 4th International Conference on Advances in Electrical Engineering (ICAEE), Dhaka, Bangladesh, 28-30 Sep. 2017.
[11] I. Alam, C. Schoeny, L. Dolecek and P. Gupta, “Parity++: Lightweight Error Correction for Last Level Caches,” in Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), Luxembourg, 25-28 Jun. 2018.
[12] Alaa R. Alameldeen, Ilya Wagner, Zeshan Chishti, Wei Wu, Chris Wilkerson, Shih-Lien Lu, “Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes”, in Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA), CA, USA, 4-8 Jun. 2011.
[13] JeongKyu Hong, Soontae Kim, “Smart ECC Allocation Cache Utilizing Cache Data Space,” IEEE Transaction on Computers, vol. 66, issue 2, pp. 368-374, Feb. 2017.
[14] Henry Duve, Xun Jian, Rakesh Kumar, “Correction Prediction: Reducing Error Correction Latency for On-chip Memories,” in Proceedings of 21st International International Symposium on High Performance Computer Architecture (HPCA), CA, USA, 7-11 Feb. 2015.
[15] P. Benedicte, C. Hernandez, J. Abella and F. J. Cazorla, “LAEC: Look-Ahead Error Correction Codes in Embedded Processors L1 Data Cache,” in Proceedings of Design, Automation & Test in Europe Conference & Exhibition (DATE), Florence, Italy, 2019.
[16] Luc Jaulmes, Miquel Moretó, Mateo Valero, Mattan Erez and Marc Casas. ‘Runtime-Guided ECC Protection Using Online Estimation of Memory Vulnerability’. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’20 (Atlanta, GA, US, Nov. 2020).
[17] Wei Zhang, “Computing cache vulnerability to transient errors and its implication,” in Proceedings of the 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DTF’05), USA, 3-5 October 2005.
[18] Kooli, M., Di Natale, G. & Bosio, A. “Memory-Aware Design Space Exploration for Reliability Evaluation in Computing Systems,” Journal of Electronic Testing 35, pp. 145–162, 2019.
[19] Jun Yan and Wei Zhang. “Evaluating instruction cache vulnerability to transient errors,” In Proceedings of the 2006 workshop on MEmory performance: DEaling with Applications, systems and architectures (MEDEA '06), New York, USA, Sep. 2006.
[20] Yuan Cai, M. T. Schmitz, A. Ejlali, B. M. Al-Hashimi and S. M. Reddy, “Cache size selection for performance, energy and reliability of time-constrained systems,” in Proceesings of Asia and South Pacific Conference on Design Automation, Yokohama, Japan, 24-27 Jan. 2006.
[21] M.H. Ahmadilivani, M. M. Jahromi, M.E. Salehi, M. Kargar, “ECS an endeavor towards providing similar cache reliability behavior in different programs”, Microelectronics Reliability, Volume 152, January 2024.
[22] A. Vijayan, S. Kiamehr, M. Ebrahimi, K. Chakrabarty and M. B. Tahoori, “Online Soft-Error Vulnerability Estimation for Memory Arrays and Logic Cores,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 2, pp. 499-511, Feb. 2018.
[23] WeixunWang and Prabhat Mishra. 2011. Dynamic reconfiguration of two-level cache hierarchy in real-time embedded systems. J. Low Power Electron. 7, 1 (2011), 17–28.
[24] Weixun Wang, Prabhat Mishra, and Ann Gordon-Ross. 2012. Dynamic cache reconfiguration for soft real-time systems. ACM Trans. Embedded Comput. Syst. 11, 2 (2012), 28.
[25] Alif Ahmed, Yuanwen Huang, and Prabhat Mishra. 2019. Cache reconfiguration using machine learning for vulnerability-aware energy optimization. ACM Transactions on Embedded Computing Systems (TECS) 18, 2 (2019), 15.
[26] A. Biswas et al., “Explaining cache SER anomaly using DUE AVF measurement,” in International Symposium on High Performance Computer Architecture (HPCA), 2010, pp. 1–12.
[27] G.-H. Asadi et al., “Balancing performance and reliability in the memory hierarchy,” in International Symposium on Performance Analysis of Systems and Software (ISPASS), 2005, pp. 269–279.
[28] S. S. Mukherjee et al., “Cache scrubbing in microprocessors: Myth or necessity?” in IEEE Pacific Rim International Symposium on Dependable Computing, 2004, pp. 37–42.
[29] S. Mittal et al., “Improving energy efficiency of Embedded DRAM Caches for High-end Computing Systems,” in 23rd International ACM Symposium on High Performance Parallel and Distributing Computing (HPDC), 2014, pp. 99–110.
[30] S. Mittal and J. S. Vetter, ‘‘A survey of techniques for modeling and improving reliability of computing systems,’’ IEEE Trans. Parallel Distrib. Syst., vol. 27, no. 4, pp. 1226–1238, Apr. 2016.
[31] S. Kim et al., “Area efficient architectures for information integrity in cache memories,” ACM SIGARCH Computer Architecture News, vol. 27, no. 2, pp. 246–255, 1999.
[32] W. Zhang et al., “ICR: In-Cache Replication for Enhancing Data Cache Reliability,” in DSN, 2003, pp. 291–300.
[33] W. Zhang, “Replication cache: a small fully associative cache to improve data cache reliability,” IEEE Transactions on Computers, vol. 54, no. 12, pp. 1547–1555, 2005.
[34] S. Mittal et al., “MASTER: A Multicore Cache Energy Saving Technique using Dynamic Cache Reconfiguration,” IEEE Transactions on VLSI Systems, 2014.
[35] S. Mittal et al., “A Survey of Architectural Approaches for Managing Embedded DRAM and Non-volatile On-chip Caches,” IEEE Transactions on Parallel and Distributed Systems (TPDS), 2014.
[36] S. Kim, “Area-efficient error protection for caches,” in Design, automation and test in Europe, 2006, pp. 1282–1287.
[37] K. Lee et al., “Mitigating soft error failures for multimedia applications by selective data protection,” in international conference on Compilers, architecture and synthesis for embedded systems, 2006, pp. 411–420.
[38] S. Kaxiras et al., “Cache decay: exploiting generational behavior to reduce cache leakage power,” in International symposium on Computer architecture (ISCA), 2001, pp. 240–251.
[39] L. Li et al., “Soft error and energy consumption interactions: a data cache perspective,” in International Symposium on Low Power Electronics and Design (ISLPED), 2004, pp. 132–137.
[40] B. T. Gold et al., “Mitigating multi-bit soft errors in L1 caches using last-store prediction,” in Proceedings of the Workshop on Architectural Support for Gigascale Integration, 2007.
[41] I. Kadayif et al., “Modeling and improving data cache reliability,” in ACM SIGMETRICS Performance Evaluation Review, vol. 35, no. 1, 2007.
[42] V. Sridharan et al., “Reducing data cache susceptibility to soft errors,” IEEE Transactions on Dependable and Secure Computing, vol. 3, no. 4, pp. 353–364, 2006.
[43] Shubu Mukhrejee, Architecture design for soft errors, Elsevier, 2008.
[44] S. Mukhejee, J. Emer, S.K. Reinhardt, The soft error problem: an architectural perspective, in: Proceedings of 11th International Symposium on High-performance Architecture, USA, 12–16 Feb, 2005, https://doi.org/10.1109/ HPCA.2005.37.
[45] A. Biswas et al., “Computing architectural vulnerability factors for address-based structures,” in International Symposium on Computer Architecture (ISCA), 2005, pp. 532–543.
[46] M.H. Ahmadilivani, M.E. Salehi, M. Kargar, Effect of cache run-time parameters on the reliability of embedded systems, in: 2020 CSI/CPSSI International Symposium on Real-Time and Embedded Systems and Technologies (RTEST), Tehran, Iran, 10–11 Jun, 2020.
[47] Gem5 Simulator (2019). Gem5 home page [online]. Available: https://www.gem5. org/.
[48] Matthew R Guthaus, Jeffrey S Ringenberg, Dan Ernst, Todd M Austin, Trevor Mudge, and Richard B Brown. MiBench: A free, commercially representative embedded benchmark suite. In International Workshop on Workload Characterization, pages 3–14, 2001.
[49] F. Kriebel, A. Subramaniyan, S. Rehman, S. J. B. Ahandagbe, M. Shafique and J. Henkel, “R2Cache: Reliability-aware reconfigurable last-level cache architecture for multi-cores,” in proceedings of International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Amsterdam, Netherlands, 4-9 Oct. 2015.