تخمین و بهبود آنلاین آسیب پذیری خطای نرم حافظه نهان
محورهای موضوعی : تخصصیمحمد معینی جهرمی 1 , محمد حسن احمدی لیوانی 2 , مصطفی ارسالی صالحی نسب 3 *
1 - دانشکده مهندسي برق و کامپیوتر، دانشکدگان فنی، دانشگاه تهران، تهران، ایران
2 - دانشکده مهندسي برق و کامپیوتر، دانشکدگان فنی، دانشگاه تهران، تهران، ایران
3 - دانشکده مهندسي برق و کامپیوتر، دانشکدگان فنی، دانشگاه تهران، تهران، ایران
کلید واژه: قابلیت اطمینان, خطای نرم, پوشش خطاها, حافظه نهان, آسیبپذیری حافظه نهان, مصالحه قابلیت اطمینان و کارآیی, حجم حافظه نهان, تخمین آسیبپذیری در زمان اجرا.,
چکیده مقاله :
حافظهها به دلیل چگالی بالای ترانزیستورها در آنها به شدت در معرض خطاهای نرم قرار دارند. حافظه نهان پردازنده به دلیل نگه داشتن اطلاعات اجرایی و تعاملات زیاد با آن، قابلیت اطمینان سیستم را به شدت تحت تأثیر قرار میدهد. در سیستمهای نهفته و کاربردهای ایمنی-بحرانی، اهمیت آن به مراتب بیشتر میشود. از مهمترین پارامترهای تأثیرگذار بر قابلیت اطمینان حافظه نهان، حجم آن است. حافظه نهان با حجم کمتر، به واسطه مساحت کوچکتر و ماندگاری کمتر دادهها در آن قابلیت اطمینان بیشتری دارد اما، کاهش حجم حافظه نهان، مدت اجرای برنامهها را طولانیتر میکند. این افزایش زمان اجرای برنامهها، احتمال بروز خطای نرم را افزایش میدهد. از سویی، قابلیت اطمینان حافظه نهان در طول اجرای یک برنامه یکنواخت نیست و ثابت بودن حجم حافظه نمیتواند قابلیت اطمینان آن را در طول اجرا بهینه کند. در این راستا، مسأله اصلی در بهبود آسیبپذیری حافظه نهان، تعیین اندازه حافظه نهان و زمان تغییر آن با توجه به سربار تغییرات است. بر همین مبنا، در این مقاله مدلی برای تخمین آسیبپذیری حافظه نهان تعریف شده است که بر اساس دادههای حافظه نهان و نوع دسترسی به آنها، آسیبپذیری آن تعیین میشود. بر اساس مدل ارائه شده، الگوریتمی پیادهسازی شده است که آسیبپذیری حافظه نهان را در زمان اجرا به صورت آنلاین تخمین میزند. برای مدلسازی زمان در این روش، از شمارندههایی استفاده شده است که در طول بازههای تصمیمگیری، زمان دسترسیها را مدل میکنند. با استفاده از تخمین بلوک بجای کلمات حافظه و تعیین اندازه شمارندهها و بازههای تصمیمگیری، روش ارائه شده، بهینهسازی شده است. دقت تخمین روند آسیبپذیری نسبت به مدل رفرنس، 22/95% میباشد. همچنین با استفاده از تخمین روند آسیبپذیری در زمان اجرا و اندازه موثر حافظه نهان هر برنامه، الگوریتمی جهت بازپیکربندی حافظه نهان در جهت بهبود آسیبپذیری آن ارائه شده است. پیادهسازی این طراحی نشان داده است که تنها با سربار مساحت %4/5 و سربار زمانی %6 میتوان یک حافظه با قابلیت بازپیکربندی و مجهز به الگوریتم مدیریت آسیبپذیری داشت که آسیبپذیری آن در زمان اجرا از آسیبپذیری حافظه نهان با حجم ثابت کمتر و آسیبپذیری کل آن نیز %36 بهتر باشد.
Due to the high density of transistors, memories are highly susceptible to soft errors. The processor's cache, by holding execution data and having frequent interactions with it, greatly impacts system reliability. This importance is even higher in embedded systems and safety-critical applications. One of the most significant factors affecting the reliability of the cache is its size. Smaller caches have better reliability due to their smaller area and shorter data retention, but reducing the cache size makes program execution times longer. This increases the probability of a soft error. Furthermore, reliability of cache is not uniform during program execution, and fixed size of memory cannot optimize its reliability during this time. In this regard, the main issue in improving cache vulnerability is to determine an optimum size of cache and its change time according to change overhead. Accordingly, this paper defines a model for estimating cache vulnerability, which determines vulnerability based on cache data and the type of access to it. Based on the proposed model, an algorithm has been implemented that estimates cache vulnerability online during execution. To model time in this approach, counters are used that model access times during decision-making intervals. By estimating based on blocks instead of memory words and determining the sizes of the counters and decision intervals, the proposed method has been optimized. The accuracy of the vulnerability trend estimation compared to the reference model is 95.22%. Additionally, by using the estimated vulnerability trend during execution and the effective cache size of each program, an algorithm for reconfiguring the cache to improve its vulnerability has been proposed. Implementation showed that with only 5.4% area overhead and 6% time overhead, we can have a reconfigurable memory equipped with a vulnerability management algorithm, which has a lower runtime vulnerability than a fixed cache size and overall vulnerability improvement of 36%.
[1] Shekhar Borkar, “Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degra dation,” IEEE Micro, vol. 25, no. 6, pp. 10-16, Nov. 2005.
[2] B. W. Johnson, “Fault-Tolerant Microprocessor-Based Systems,” IEEE Micro, vol. 4, no. 6, pp. 6-21, Dec. 1984.
[3] C. Slayman, “Soft error trends and mitigation techniques in memory devices,” in Proceedings of Annual Reliability and Maintainability Symposium, Lake Buena Vista, FL, USA, 24-27 Jan. 2011.
[4] Isreal Koren and C. Mani Krishna, Fault-Tolerant Systems, Elsevier, 2007.
[5] Michael Nicolaidis, Soft Errors in Modern Electronic Systems, Springer, 2010.
[6] Lukasz G Szafaryn, Brett H Meyer, and Kevin Skadron. Evaluating overheads of multibit soft-error protection in the processor core. IEEE Micro, (4):56–65, 2013.
[7] Adam Neale, Maarten Jonkman, Manoj Sachdev, “Adjacent-MBU Tolerant SEC-DED-TAEC-yAED Codes for Embedded SRAMs,” IEEE Transactions on Circuit and Systems II, vol. 62, no. 4, pp. 387-391, Apr. 2015.
[8] Seungyeob Lee, Joon-Sung Yang, “MVP ECC: Manufacturing process Variation aware unequal Protection ECC for memory reliability,” in Proceedings of Design, Automation & Test in Europe Conference & Exhibition (DATE), Switzerland, Mar. 27-31 2017.
[9] A. klockMann, G. Geogakos, M. Goessel, “A new 3-bit Burst-Error Correcting Code,” in Proceedings of IOLTS, Greece, 3-5 Jul. 2017.
[10] R. Afrin and M. S. Sadi, “An efficient approach to enhance memory reliability,” in Proceedings of the 4th International Conference on Advances in Electrical Engineering (ICAEE), Dhaka, Bangladesh, 28-30 Sep. 2017.
[11] I. Alam, C. Schoeny, L. Dolecek and P. Gupta, “Parity++: Lightweight Error Correction for Last Level Caches,” in Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), Luxembourg, 25-28 Jun. 2018.
[12] Alaa R. Alameldeen, Ilya Wagner, Zeshan Chishti, Wei Wu, Chris Wilkerson, Shih-Lien Lu, “Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes”, in Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA), CA, USA, 4-8 Jun. 2011.
[13] JeongKyu Hong, Soontae Kim, “Smart ECC Allocation Cache Utilizing Cache Data Space,” IEEE Transaction on Computers, vol. 66, issue 2, pp. 368-374, Feb. 2017.
[14] Henry Duve, Xun Jian, Rakesh Kumar, “Correction Prediction: Reducing Error Correction Latency for On-chip Memories,” in Proceedings of 21st International International Symposium on High Performance Computer Architecture (HPCA), CA, USA, 7-11 Feb. 2015.
[15] P. Benedicte, C. Hernandez, J. Abella and F. J. Cazorla, “LAEC: Look-Ahead Error Correction Codes in Embedded Processors L1 Data Cache,” in Proceedings of Design, Automation & Test in Europe Conference & Exhibition (DATE), Florence, Italy, 2019.
[16] Luc Jaulmes, Miquel Moretó, Mateo Valero, Mattan Erez and Marc Casas. ‘Runtime-Guided ECC Protection Using Online Estimation of Memory Vulnerability’. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’20 (Atlanta, GA, US, Nov. 2020).
[17] Wei Zhang, “Computing cache vulnerability to transient errors and its implication,” in Proceedings of the 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DTF’05), USA, 3-5 October 2005.
[18] Kooli, M., Di Natale, G. & Bosio, A. “Memory-Aware Design Space Exploration for Reliability Evaluation in Computing Systems,” Journal of Electronic Testing 35, pp. 145–162, 2019.
[19] Jun Yan and Wei Zhang. “Evaluating instruction cache vulnerability to transient errors,” In Proceedings of the 2006 workshop on MEmory performance: DEaling with Applications, systems and architectures (MEDEA '06), New York, USA, Sep. 2006.
[20] Yuan Cai, M. T. Schmitz, A. Ejlali, B. M. Al-Hashimi and S. M. Reddy, “Cache size selection for performance, energy and reliability of time-constrained systems,” in Proceesings of Asia and South Pacific Conference on Design Automation, Yokohama, Japan, 24-27 Jan. 2006.
[21] M.H. Ahmadilivani, M. M. Jahromi, M.E. Salehi, M. Kargar, “ECS an endeavor towards providing similar cache reliability behavior in different programs”, Microelectronics Reliability, Volume 152, January 2024.
[22] A. Vijayan, S. Kiamehr, M. Ebrahimi, K. Chakrabarty and M. B. Tahoori, “Online Soft-Error Vulnerability Estimation for Memory Arrays and Logic Cores,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 2, pp. 499-511, Feb. 2018.
[23] WeixunWang and Prabhat Mishra. 2011. Dynamic reconfiguration of two-level cache hierarchy in real-time embedded systems. J. Low Power Electron. 7, 1 (2011), 17–28.
[24] Weixun Wang, Prabhat Mishra, and Ann Gordon-Ross. 2012. Dynamic cache reconfiguration for soft real-time systems. ACM Trans. Embedded Comput. Syst. 11, 2 (2012), 28.
[25] Alif Ahmed, Yuanwen Huang, and Prabhat Mishra. 2019. Cache reconfiguration using machine learning for vulnerability-aware energy optimization. ACM Transactions on Embedded Computing Systems (TECS) 18, 2 (2019), 15.
[26] A. Biswas et al., “Explaining cache SER anomaly using DUE AVF measurement,” in International Symposium on High Performance Computer Architecture (HPCA), 2010, pp. 1–12.
[27] G.-H. Asadi et al., “Balancing performance and reliability in the memory hierarchy,” in International Symposium on Performance Analysis of Systems and Software (ISPASS), 2005, pp. 269–279.
[28] S. S. Mukherjee et al., “Cache scrubbing in microprocessors: Myth or necessity?” in IEEE Pacific Rim International Symposium on Dependable Computing, 2004, pp. 37–42.
[29] S. Mittal et al., “Improving energy efficiency of Embedded DRAM Caches for High-end Computing Systems,” in 23rd International ACM Symposium on High Performance Parallel and Distributing Computing (HPDC), 2014, pp. 99–110.
[30] S. Mittal and J. S. Vetter, ‘‘A survey of techniques for modeling and improving reliability of computing systems,’’ IEEE Trans. Parallel Distrib. Syst., vol. 27, no. 4, pp. 1226–1238, Apr. 2016.
[31] S. Kim et al., “Area efficient architectures for information integrity in cache memories,” ACM SIGARCH Computer Architecture News, vol. 27, no. 2, pp. 246–255, 1999.
[32] W. Zhang et al., “ICR: In-Cache Replication for Enhancing Data Cache Reliability,” in DSN, 2003, pp. 291–300.
[33] W. Zhang, “Replication cache: a small fully associative cache to improve data cache reliability,” IEEE Transactions on Computers, vol. 54, no. 12, pp. 1547–1555, 2005.
[34] S. Mittal et al., “MASTER: A Multicore Cache Energy Saving Technique using Dynamic Cache Reconfiguration,” IEEE Transactions on VLSI Systems, 2014.
[35] S. Mittal et al., “A Survey of Architectural Approaches for Managing Embedded DRAM and Non-volatile On-chip Caches,” IEEE Transactions on Parallel and Distributed Systems (TPDS), 2014.
[36] S. Kim, “Area-efficient error protection for caches,” in Design, automation and test in Europe, 2006, pp. 1282–1287.
[37] K. Lee et al., “Mitigating soft error failures for multimedia applications by selective data protection,” in international conference on Compilers, architecture and synthesis for embedded systems, 2006, pp. 411–420.
[38] S. Kaxiras et al., “Cache decay: exploiting generational behavior to reduce cache leakage power,” in International symposium on Computer architecture (ISCA), 2001, pp. 240–251.
[39] L. Li et al., “Soft error and energy consumption interactions: a data cache perspective,” in International Symposium on Low Power Electronics and Design (ISLPED), 2004, pp. 132–137.
[40] B. T. Gold et al., “Mitigating multi-bit soft errors in L1 caches using last-store prediction,” in Proceedings of the Workshop on Architectural Support for Gigascale Integration, 2007.
[41] I. Kadayif et al., “Modeling and improving data cache reliability,” in ACM SIGMETRICS Performance Evaluation Review, vol. 35, no. 1, 2007.
[42] V. Sridharan et al., “Reducing data cache susceptibility to soft errors,” IEEE Transactions on Dependable and Secure Computing, vol. 3, no. 4, pp. 353–364, 2006.
[43] Shubu Mukhrejee, Architecture design for soft errors, Elsevier, 2008.
[44] S. Mukhejee, J. Emer, S.K. Reinhardt, The soft error problem: an architectural perspective, in: Proceedings of 11th International Symposium on High-performance Architecture, USA, 12–16 Feb, 2005, https://doi.org/10.1109/ HPCA.2005.37.
[45] A. Biswas et al., “Computing architectural vulnerability factors for address-based structures,” in International Symposium on Computer Architecture (ISCA), 2005, pp. 532–543.
[46] M.H. Ahmadilivani, M.E. Salehi, M. Kargar, Effect of cache run-time parameters on the reliability of embedded systems, in: 2020 CSI/CPSSI International Symposium on Real-Time and Embedded Systems and Technologies (RTEST), Tehran, Iran, 10–11 Jun, 2020.
[47] Gem5 Simulator (2019). Gem5 home page [online]. Available: https://www.gem5. org/.
[48] Matthew R Guthaus, Jeffrey S Ringenberg, Dan Ernst, Todd M Austin, Trevor Mudge, and Richard B Brown. MiBench: A free, commercially representative embedded benchmark suite. In International Workshop on Workload Characterization, pages 3–14, 2001.
[49] F. Kriebel, A. Subramaniyan, S. Rehman, S. J. B. Ahandagbe, M. Shafique and J. Henkel, “R2Cache: Reliability-aware reconfigurable last-level cache architecture for multi-cores,” in proceedings of International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Amsterdam, Netherlands, 4-9 Oct. 2015.