Loongson Corporation

Beijing, China

Loongson Corporation

Beijing, China
Time filter
Source Type

Chen L.,Chinese Academy of Sciences | Chen L.,CAS Institute of Computing Technology | Chen L.,University of Chinese Academy of Sciences | Chen L.,Loongson Corporation | And 13 more authors.
Gaojishu Tongxin/Chinese High Technology Letters | Year: 2013

Based on the memory accessing analysis for the synchronization process of chip multiprocessors (CMPs), a method to recognize the type of synchronization was proposed, and a novel cache coherence protocol for optimization of the synchronization of CMPs was designed. The protocol adds a new cache state for synchronization information, and it can make CMPs realize the serial synchronization operation with the way of blocking to guarantee successful execution of atomic operations. Thus, the number of memory accesses caused by synchronization conflicts can be greatly reduced, and an almost perfect memory accessing process in synchronization can be achieved. The experimental results show that, with the proposed cache coherence protocol, the synchronization performance is almost perfect. Compared with the traditional cache coherence, the synchronization performance can be increared by 100%, and the whole execution time of parallel programs can be reduced by 25%.

Liu H.,Chinese Academy of Sciences | Liu H.,CAS Institute of Computing Technology | Liu H.,University of Chinese Academy of Sciences | Liu H.,Loongson Corporation | And 12 more authors.
Proceedings - 2013 IEEE 8th International Conference on Networking, Architecture and Storage, NAS 2013 | Year: 2013

Heterogeneous multi-core processors have strong potential for performance improvement, energy efficiency and area efficiency, compared to the homogeneous multi-core processors. The present methods of execution migration for heterogeneous multi-core processor suffer in efficiency, cost, compatibility, or programmability. In this paper, we propose a HW/SW code sign migration method based on binary-instrumentation. Our method takes full advantage of the shared-ISA. It enhances the performance of heterogeneous chip multiprocessor with low HW/SW cost. And it's not required to modify source codes or compile system. The experiment results show that the efficiency of our method is 3.29 times of kernel simulation. © 2013 IEEE.

Liao B.,University of Chinese Academy of Sciences | Liao B.,Chinese Academy of Sciences | Liao B.,CAS Institute of Computing Technology | Liao B.,Loongson Corporation | And 17 more authors.
Gaojishu Tongxin/Chinese High Technology Letters | Year: 2015

In order to solve a non-uniform memory access architecture (NUMA)'s performance degradation in garbage collection (GC) caused by a large amount of remote access during GC, each phase of the GC process was analyzed and studied, and a high-efficient, real-time and stable GC algorithm was proposed for NUMA. The algorithm improves the traditional generational GC mechanism's heap space based on the non-uniform memory access architecture first, and then, greatly decreases the number of remote access in the course of GC by controlling the selection of initial root objects during the live object scanning phase, the stealing task queue in the phase of dynamic load balance, and the object copying location during the procedure of copying live objects. The advanced GC algorithm can be applied to all NUMA platforms. The final results of the experiments on the Godson-3 NUMA platform show that the proposed algorithm can reduce the stop-the-world (STW) time during GC, and enhance the performance and stability of the application program. For the SpecJVM2008 benchmarks, the new algorithm averagely reduced the STW time by 14.6% (reduced the total time by 4.1% to 41.58%), averagely increased the performance of the application program by 4.68% (the ceiling value was 17.8%), and improved its stability by 76.2%. ©, 2015, Inst. of Scientific and Technical Information of China. All right reserved.

Chen X.,Chinese Academy of Sciences | Chen X.,University of Chinese Academy of Sciences | Zhang G.,Huawei | Wang H.,Loongson Corporation | And 6 more authors.
Proceedings -Design, Automation and Test in Europe, DATE | Year: 2015

Facing the speed bottleneck of software-based simulators, FPGA-based simulation has been explored more and more. This paper proposes a novel methodology to simulate a chip-multiprocessor (CMP) on the limited FPGA resource. By mixing real cores and pseudo cores together (MRP), we can simulate a multicore system with fewer FPGA resource requirements and achieve a much higher simulation speed. We propose several methods to construct the pseudo cores. We implement our idea on a dual Virtex-6 FPGA board to simulate a general-purpose 4-core high performance CMP processor. Comparison experiments against the corresponding tape-out chip prove the effectiveness of MRP. We also evaluate MRP prototype's performance by running SPEC CPU2006 benchmarks on an unmodified Linux operating system, achieving tens to hundreds speedup compared to two other commonly-used simulators. © 2015 EDAA.

Zhang G.,Loongson Corporation | Zhang G.,University of Chinese Academy of Sciences | Zhang G.,Chinese Academy of Sciences | Wang H.,Loongson Corporation | And 11 more authors.
Proceedings - Design Automation Conference | Year: 2012

We propose a novel architecture of memory controller, called HMC (Heterogeneous Multi-Channel), as an improvement to the previous homogeneous multi-channel memory controller. HMC groups physical DRAM devices into logical sub-ranks with different data bus width, and controls them simultaneously. Employing new proposed memory access algorithm, HMC manages the number of devices involved in a single memory access flexibly, and achieves the best performance/power efficiency. Using four-core multiprogramming workloads, our experimental results show that HMC improves system performance by 27.6% with 24.2% reduction in DRAM power consumption on average. © 2012 ACM.

Liu H.,Chinese Academy of Sciences | Liu H.,CAS Institute of Computing Technology | Liu H.,University of Chinese Academy of Sciences | Liu H.,Loongson Corporation | And 10 more authors.
Gaojishu Tongxin/Chinese High Technology Letters | Year: 2014

The characteristics of the excution migration in heterogeneous multi-core processors were analyzed, and a binary-instrumentation based execution migration method for homogeneous multi-core processors was put forward to solve the drawbacks of the present methods for execution migration between shared ISA heterogeneous multi-cores in efficiency, cost, compatibility, or programmability. The proposed migration method based on binary-instrumentation can take full advantage of shared-ISA heterogeneous multi-core to enhance the performance of heterogeneous chip multiprocessors with low cost. And it need not to modify the source code or the compile system. The experimental results obtained from the test on the SPEC procedure showed that its run-time efficiency was 2.25 times of kernel simulation.

Zhang X.,Chinese Academy of Sciences | Zhang X.,CAS Institute of Computing Technology | Zhang X.,University of Chinese Academy of Sciences | Zhang X.,Loongson Corporation | And 13 more authors.
Gaojishu Tongxin/Chinese High Technology Letters | Year: 2014

The consistency maintenance for address mapping during indirect branch handling in a dynamic binary translation (DBT) system was studied, and a novel approach to optimization of the consistency maintenance was proposed based on the analysis of the traditional lock mechanism based consistency maintenance scheme's major shortcoming of causing great overhead both in singlethreaded and multithreaded execution. The new method avoids lock operations during the hot branch handling through tracing the hotspot of the indirect branches, and operates redundant address mapping when read-write conflicts are detected. For the detection, a dedicated mechanism was designed to organize the timing sequence of instructions and the address mapping data. The final results of the experiments on the Godson-3 platform emulating the X86 architecture, show that the proposed approach can reduce the execution overhead by 27.7% on average (1.8% to 58.5%) for singlethreaded benchmarks, and by 18.4% on average (3.3% to 64.6%) for multithreaded benchmarks.

Loading Loongson Corporation collaborators
Loading Loongson Corporation collaborators