Time filter

Source Type

Chen L.,Chinese Academy of Sciences | Wang Y.,Chinese Academy of Sciences | Wang H.,Chinese Academy of Sciences | Wang H.,CAS Institute of Computing Technology | And 8 more authors.
Future Generation Computer Systems | Year: 2016

As the number of cores in chip multiprocessors (CMPs) increases rapidly, network-on-chips (NoCs) have become the major role in ensuring performance and power scalability. In this paper, we propose multiple-combinational-channel (MCC), a load balancing and deadlock free interconnect network, for cache-coherent non-uniform memory accessing (CC-NUMA). In order to make load more balancing and reduce power dissipation, we combine low usage channels and make high usage channels independent and wide enough, since messages transmitted over NoC have different widths and injection rates. Furthermore, based on the in-depth analysis of network traffic, we summarize four traffic patterns and establish several rules to avoid protocol-level deadlocks. We implement MCC on a 16-core CMPs, and evaluate the workload balance, area, power and performance using universal workloads. The experimental results show that MCC reduces nearly 21% power than multiple-physical-channel with similar throughput. Moreover, MCC improves 10% performance with similar area and power, compared to packet-switching architecture with virtual channels. © 2015 Elsevier B.V.

Fu J.,University of Chinese Academy of Sciences | Fu J.,CAS Institute of Computing Technology | Jin G.,Loongson Technology Corporation Ltd | Zhang L.,CAS Institute of Computing Technology | Wang J.,CAS Institute of Computing Technology
2016 ACM International Conference on Computing Frontiers - Proceedings | Year: 2016

Dynamic compilation has a great impact on the performance of virtual machines. In this paper, we study the features of dynamic compilation and then unveil objectives for optimizing dynamic compilation systems. Following these objectives, we propose a novel dynamic compilation scheduling algorithm called combined analysis with online sifting (CAOS). It consists of a combined priority analysis model and an online sifting mechanism. The combined priority analysis model is used to determine the priority of methods while scheduling, aiming at reconciling responsiveness with the average delay of compilation queue. By performing online sifting, runtime overhead can be further reduced since methods with little benefit to performance are sifted out. CAOS can significantly improve the startup performance of applications. Experimental results show that CAOS achieves 14.0% improvement of startup performance on average, and the highest performance boost is up to 55.1%. With the virtue of high versatility and easy implementation, CAOS can be applied to most dynamic compilation systems.

Wu R.,Chinese Academy of Sciences | Wu R.,University of Chinese Academy of Sciences | Wu R.,Loongson Technology Corporation Ltd | Tai Y.,Oracle Inc.
Gaojishu Tongxin/Chinese High Technology Letters | Year: 2015

The cost of virtual machines' exit and restoration was studies, and a method based on delay storing was proposed to reduce the cost of saving and restoring registers when virtual machines exit or resume. The main mechanism of the method is to reduce the amount of registers to be saved and restored by changing the source code of the virtual machine software and judging whether the virtual machine in resuming is still the same one that exited last time. The proposed method needs no hardware change, and supports multicore operating systems and concurrent operation of multiple virtual machines, leading to a wide applicability. The results of the test conducted on the Loongson 3A1500 platform demonstrated that the cost of virtual machine exit of the proposed method was reduced by 65% compared with the existing method, and the performance of the whole virtual machine was increase by 3% to 10%. ©, 2015, Inst. of Scientific and Technical Information of China. All right reserved.

Fu J.,University of Chinese Academy of Sciences | Fu J.,CAS Institute of Computing Technology | Jin G.,Loongson Technology Corporation Ltd | Zhang L.,CAS Institute of Computing Technology | Wang J.,CAS Institute of Computing Technology
Gaojishu Tongxin/Chinese High Technology Letters | Year: 2016

To reduce the overhead caused by instruction dispatch to improve the performance of interpreters, an instruction dispatch approach based on hardware and software co-design is proposed. Its main idea is to eliminate the expensive operation of constant address loading by optimizing the instruction dispatch table in the aspect of sofware, and to acceleratethe speed of memory access under the support of hardware by enhancing the processor's instruction set in the aspect of hardware. The hardware-software co-design can minimize the runtime overhead of instruction dispatch, thus improving the performance of interpreters. The experimental results showed that the proposed approach significantly improved the performance of interpreters. For benchmarks of SPECjvm98 and DaCapo, the overall performance of interpreters was improved by 11.5%, and the highest performance boost was up to 15.4%. The approach is highly versatile, easy to implement and can be applied to the design and implementation of high performance interpreters on mainstream processors. © 2016, Inst. of Scientific and Technical Information of China. All right reserved.

Zeng L.,CAS Institute of Computing Technology | Zeng L.,University of Chinese Academy of Sciences | Li P.,CAS Institute of Computing Technology | Li P.,University of Chinese Academy of Sciences | Wang H.,Loongson Technology Corporation Ltd
Gaojishu Tongxin/Chinese High Technology Letters | Year: 2016

The Cache compression was studied to increase Cache's effective capacity, and a region cooperative compression (RCC) algorithm was proposed to improve the compression ratio of the last level Cache. Different to traditional Cache compression algorithms, the RCC algorithm exploits the compression locality to compress Cache blocks in a Cache region by the cooperation of the first block in the region, instead of compressing the whole Cache region. RCC effectively explores the duplications across the Cache blocks in a Cache region and shows a comparable compression ratio with dictionary compression approaches with the whole Cache region as the compression granularity, whereas the (de)compression latency is not increased. The experimental results show that RCC provides the better average compression ratio than the compression algorithm of C-PACK by 12.34%, which causes the performance improvement of 5%. Compared to the non-compressive Cache with double size, the effective capacity increases by 27%, the performance increases by 8.6% and the area decreases by 63.1%. © 2016, Inst. of Scientific and Technical Information of China. All right reserved.

Hu W.,Chinese Academy of Sciences | Yang L.,Loongson Technology Corporation Ltd | Fan B.,Loongson Technology Corporation Ltd | Wang H.,Loongson Technology Corporation Ltd | Chen Y.,Chinese Academy of Sciences
IEEE Journal of Solid-State Circuits | Year: 2014

This paper is an extension of Hu et al., ISSCC, 2013, and it introduces the 32/28 nm implementations of Godson-3B1500, which are 8-core MIPS-compatible microprocessors with vector extensions. Godson-3B1500 is fabricated in STMicroelectronics 32/28 nm high-κ metal-gate low-power bulk CMOS with 10 metal layers. It contains 1.14 billion transistors and operates at the frequency of 1.0 GHz to 1.5 GHz with the voltage supply ranging from 1.0 V to 1.3 V. Compared to its predecessor (Hu et al., ISSCC, 2011), Godson-3B1500 brings significant power efficiency improvements with enhanced performance (150GFLOPS@1.2 GHz) and reduced power dissipation (< 40 W), due to not only technology scaling but also a great deal of design efforts. © 2013 IEEE.

Wu Y.,CAS Institute of Computing Technology | Wu Y.,University of Chinese Academy of Sciences | Lu C.,University of Chinese Academy of Sciences | Lu C.,Loongson Technology Corporation Ltd | Chen Y.,CAS Institute of Computing Technology
Frontiers of Computer Science | Year: 2016

With the rapid development of semiconductor industry, the number of cores integrated on chip increases quickly, which brings tough challenges such as bandwidth, scalability and power into on-chip interconnection. Under such background, Network-on-Chip (NoC) is proposed and gradually replacing the traditional on-chip interconnections such as sharing bus and crossbar. For the convenience of physical layout, mesh is the most used topology in NoC design. Routing algorithm, which decides the paths of packets, has significant impact on the latency and throughput of network. Thus routing algorithm plays a vital role in a wellperformed network. This study mainly focuses on the routing algorithms of mesh NoC. By whether taking network information into consideration in routing decision, routing algorithms of NoC can be roughly classified into oblivious routing and adaptive routing. Oblivious routing costs less without adaptiveness while adaptive routing is on the contrary. To combine the advantages of oblivious and adaptive routing algorithm, half-adaptive algorithms were proposed. In this paper, the concepts, taxonomy and features of routing algorithms of NoC are introduced. Then the importance of routing algorithms in mesh NoC is highlighted, and representative routing algorithms with respective features are reviewed and summarized. Finally, we try to shed light upon the future work of NoC routing algorithms. © 2016 Higher Education Press and Springer-Verlag Berlin Heidelberg

Hu W.,CAS Institute of Computing Technology | Hu W.,Loongson Technology Corporation Ltd | Zhang Y.,CAS Institute of Computing Technology | Zhang Y.,University of Chinese Academy of Sciences | And 2 more authors.
Science China Information Sciences | Year: 2016

In recent years, China has witnessed considerable achievements in the production of domesticallydesigned CPUs and DSPs. Owing to fifteen years of hard work that began in 2001, significant progress has been made in Chinese domestic CPUs and DSPs, primarily represented by Loongson and ShenWei processors. Furthermore parts of the CPU design techniques are comparable to the world’s most advanced designs. A special issue published in Scientia Sinica Informationis in April 2015, is dedicated to exhibiting the technical advancements in Chinese domestically-designed CPUs and DSPs. The content in this issue describes the design and optimization of high performance processors and the key technologies in processor development; these include high-performance micro-architecture design, many-core and multi-core design, radiation hardening design, highperformance physical design, complex chip verification, and binary translation technology. We hope that the articles we collected will promote understanding of CPU/DSP progress in China. Moreover, we believe that the future of Chinese domestic CPU/DSP processors is quite promising. © 2016, Science China Press and Springer-Verlag Berlin Heidelberg.

Zhu X.-J.,Chinese Academy of Sciences | Zhu X.-J.,Loongson Technology Corporation Ltd
Jisuanji Xuebao/Chinese Journal of Computers | Year: 2011

The development of integrated circuits makes the number of on-chip cores increase. Communication among the cores demands higher throughput, lower latency and more scalability. Traditional on-chip bus can not satisfy the need of on-chip communication. So researchers present a new interconnect architecture, called network on chip. In order to meet the special demand of network on chip, this paper gives a scalable topology named Rgrid and its routing algorithm called DR. Rgrid can reduce the average hops between on-chip cores, whose physical implementation is much easier than Torus topology. The author implements the Rgrid and Mesh tolopogies in the Godson3 simulator. The simulation results show that, simulator can gain much better performance using Rgrid topology than using Mesh topology for the Splash2 benchmarks. Compared to Mesh topology, the IPC of benchmarks of Rgrid increases by 0.5%~148%, the average latency degrades by 5%~81%.

Su M.,CAS Institute of Computing Technology | Su M.,University of Chinese Academy of Sciences | Su M.,Loongson Technology Corporation Ltd | Chen Y.,CAS Institute of Computing Technology | And 3 more authors.
Proceedings -Design, Automation and Test in Europe, DATE | Year: 2010

Nondeterminism of multi-clock systems often complicates various system validation processes such as post silicon debugging and at-speed testing, which has brought many difficulties to system designers and testers. The major source of nondeterministic behaviors is clock domain crossing, because the clocks that determine the timing of events are sensitive to variations. In this paper, we propose a general method to eliminate the nondeterminism resulted from clock domain crossing. This method does not assume any specific relationship among the clocks. Instead, to adapt to various clock conditions, an automatic configuration procedure and a periodic error canceling mechanism, which only require trivial hardware support, are proposed by analyzing the deterministic boundaries theoretically. To demonstrate the applicability of our method in practice, we implement it on a FPGA platform. Experiment results validate that the performance loss brought by our method over conventional multi-clock FIFO is less than 2%. © 2010 EDAA.

Loading Loongson Technology Corporation Ltd collaborators
Loading Loongson Technology Corporation Ltd collaborators