Time filter

Source Type

Gao Y.,University of Chinese Academy of Sciences | Gao Y.,CAS Institute of Computing Technology | Gao Y.,Loongson Technologies Corporation Ltd | Kachris C.,Athens Information Technology | Katevenis M.,Foundation for Research and Technology Hellas
Proceedings - IEEE Symposium on Computers and Communications | Year: 2011

This paper presents a sequential mode that can be used to improve the efficiency of iterative matching algorithm for CIOQ crossbar switches. The proposed matching algorithm is collectively called SIM-δ, and four different configuration are presented: restricted iteration, pipelined iteration, exhaustive transmission, and GSI (general state information)-based matching arbitration. The implementation-based simulation results show that switches with SIM-δ outperform switches with other matching algorithms, such as iSLIP, under uniform Bernoulli and bursty, non-uniform local hotspot and diagonal traffic models. We also explore the impact of speedup to throughput and delay. For uniform Bernoulli traffic, without or with small speedup, SIM-δ can achieve higher aggregate throughput and lower average delay than iSLIP. © 2011 IEEE.

Guo Q.,CAS Institute of Computing Technology | Guo Q.,Loongson Technologies Corporation Ltd | Guo Q.,University of Chinese Academy of Sciences | Chen T.,CAS Institute of Computing Technology | And 7 more authors.
IJCAI International Joint Conference on Artificial Intelligence | Year: 2011

During the design of a microprocessor, Design Space Exploration (DSE) is a critical step which determines the appropriate design configuration of the microprocessor. In the computer architecture community, supervised learning techniques have been applied to DSE to build models for predicting the qualities of design configurations. For supervised learning, however, considerable simulation costs are required for attaining the labeled design configurations. Given limited resources, it is difficult to achieve high accuracy. In this paper, inspired by recent advances in semi-supervised learning, we propose the COMT approach which can exploit unlabeled design configurations to improve the models. In addition to an improved predictive accuracy, COMT is able to guide the design of microprocessors, owing to the use of comprehensible model trees. Empirical study demonstrates that COMT significantly outperforms state-of-the-art DSE technique through reducing mean squared error by 30% to 84%, and thus, promising architectures can be attained more efficiently.

Ren T.,University of Chinese Academy of Sciences | Ren T.,CAS Institute of Computing Technology | Xue S.,CAS Institute of Computing Technology | Xue S.,Loongson Technologies Corporation Ltd | And 4 more authors.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) | Year: 2015

Browser is the entry point to cloud computing services, and the performance of JavaScript, with which the web applications are built, has become critically important in the user experience. The key to achieving JavaScript execution efficiency is Just-in-time (JIT) compilation. At present, Firefox is one of the most popular cross platform browsers. However, there is no MIPS code generator in IonMonkey, Firefox’s next-generation optimizing JavaScript JIT compiler, leaving the un-performed interpreter the only option to execute Java- Script on MIPS platform in Firefox. In this paper, we managed to implement an efficient and reliable MIPS code generator for IonMonkey. We took an insight into the inner mechanism of IonMonkey, and solved a series of platform-related problems such as double-layer cross platform architecture, patch, jump source chain, and ABI. Additionally, we optimized IonMonkey based on MIPS architecture by using a series of methods such as short-distance jump optimization, range analysis for arithmetic operation, peephole optimization, etc. With the JIT porting and these optimizations, V8 benchmark scores ascended from 38.8 to 957, and the running time of Sunspider benchmark descended from 20428.7 ms to 2689.5 ms. The efficiency of JS engine was significantly improved on MIPS. © Springer International Publishing Switzerland 2015.

Guo Q.,CAS Institute of Computing Technology | Guo Q.,University of Chinese Academy of Sciences | Chen T.,CAS Institute of Computing Technology | Chen T.,Loongson Technologies Corporation Ltd | And 6 more authors.
Microprocessors and Microsystems | Year: 2013

Predictive modeling is an emerging methodology for microarchitectural design space exploration. However, this method suffers from high costs to construct predictive models, especially when unseen programs are employed in performance evaluation. In this paper, we propose a fast predictive model-based approach for microarchitectural design space exploration. The key of our approach is utilizing inherent program characteristics as prior knowledge (in addition to microarchitectural configurations) to build a universal predictive model. Thus, no additional simulation is required for evaluating new programs on new configurations. Besides, due to employed model tree technique, we can provide insights of the design space for early design decisions. Experimental results demonstrate that our approach is comparable to previous approaches regarding their prediction accuracies of performance/energy. Meanwhile, the training time of our approach achieves 7.6-11.8× speedup over previous approaches for each workload. Moreover, the training costs of our approach can be further reduced via instrumentation technique. © 2012 Elsevier B.V. All rights reserved.

Guo Q.,CAS Institute of Computing Technology | Guo Q.,University of Chinese Academy of Sciences | Guo Q.,Loongson Technologies Corporation Ltd
Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics | Year: 2012

One of the most critical issues during functional verification is to generate highly effective stimuli. As verification proceeds, the effectiveness of verification stimuli decreases. To improve the effectiveness of stimuli, an online filteration technique to process is proposed to generated stimuli. This technique employs one-class support vector machines to online construct a classifier to predict whether or not a newly generated stimulus is redundant, and the predicted redundant stimulus will not be sent for simulation. Besides, we also propose an instruction sequence kernel to measure the similarities among instruction sequences. Experimental results demonstrate that this technique can reduce about 83% stimuli and 79% verification time in comparison with conventional constrained random generation.

Tang D.,CAS Institute of Computing Technology | Tang D.,Loongson Technologies Corporation Ltd | Tang D.,University of Chinese Academy of Sciences | Bao Y.,CAS Institute of Computing Technology | And 3 more authors.
Proceedings - International Symposium on High-Performance Computer Architecture | Year: 2010

As technology advances both in increasing bandwidth and in reducing latency for I/O buses and devices, moving I/O data in/out memory has become critical. In this paper, we have observed the different characteristics of I/O and CPU memory reference behavior, and found the potential benefits of separating I/O data from CPU data. We propose a DMA cache technique to store I/O data in dedicated on-chip storage and present two DMA cache designs. The first design, Decoupled DMA Cache (DDC), adopts additional on-chip storage as the DMA cache to buffer I/O data. The second design, Partition-Based DMA Cache (PBDC), does not require additional on-chip storage, but can dynamically use some ways of the processor's last level cache (LLC) as the DMA cache. We have implemented and evaluated the two DMA cache designs by using an FPGA-based emulation platform and the memory reference traces of real-world applications. Experimental results show that, compared with the existing snooping-cache scheme, DDC can reduce memory access latency (in bus cycles) by 34.8% on average (up to 58.4%), while PBDC can achieve about 80% of DDC's performance improvements despite no additional on-chip storage. ©2009 IEEE.

Wang W.,CAS Institute of Computing Technology | Wang W.,Loongson Technologies Corporation Ltd | Wang W.,University of Chinese Academy of Sciences | Shen H.,CAS Institute of Computing Technology | Shen H.,Loongson Technologies Corporation Ltd
Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics | Year: 2011

Subpixel interpolation is one of the most computation-intensive parts in various HD video decoding processes. The existing subpixel interpolation architectures have difficulties in achieving high performance and flexibility simultaneously. This paper presents a reconfigurable sub-pixel interpolation architecture for multi-standard video decoding. Based on the analysis and comparison of commonalities and differences among interpolation algorithms of various standards, a novel reconfigurable parallel-serial-mixed filtering architecture is proposed, which allows dynamical configuration of the data transfer path, the I/O data pattern and the filter computation unit. It supports various video coding standards including VC-1, H.264/263, AVS and MPEG-1/2/4. The experimental results show that this design can achieve the real-time multi-standard HDTV 1080p (1920x1088@30 fps) video decoding. Compared to previous work, the proposed design can support more types of HD video coding standards while consuming the same amount of silicon resources. It has been applied in a multimedia SoC chip.

Guo Q.,CAS Institute of Computing Technology | Guo Q.,University of Chinese Academy of Sciences | Chen T.,CAS Institute of Computing Technology | Chen T.,Loongson Technologies Corporation Ltd | And 6 more authors.
Proceedings of the Asian Test Symposium | Year: 2010

As a primary method for functional verification of microprocessors, simulation-based verification has received extensive studies over the last decade. Most investigations have been dedicated to the generation of stimuli (test cases), while relatively few has focused on explicitly reducing the redundant stimuli among the generated ones. In this paper, we propose an on-the-fly approach for reducing the stimuli redundancy based on machine learning techniques, which can learn from new knowledge in every cycle of simulation-based verification. Our approach can be easily embedded in traditional framework of simulation-based functional verification, and the experiments on an industrial microprocessor have validated that the approach is effective and efficient. © 2010 IEEE.

Chen Y.,CAS Institute of Computing Technology | Chen Y.,Loongson Technologies Corporation Ltd | Hu W.,CAS Institute of Computing Technology | Hu W.,Loongson Technologies Corporation Ltd | And 4 more authors.
Proceedings - International Symposium on Computer Architecture | Year: 2010

Debugging parallel program is a well-known difficult problem. A promising method to facilitate debugging parallel program is using hardware support to achieve deterministic replay. A hardware-assisted deterministic replay scheme should have a small log size, as well as low design cost, to be feasible for adopting by industrial processors. To achieve the goals, we propose a novel and succinct hardware-assisted deterministic replay scheme named LReplay. The key innovation of LReplay is that instead of recording the logical time orders between instructions or instruction blocks as previous investigations, LReplay is built upon recording the pending period information [6]. According to the experimental results on Godson-3, the overall log size of LReplay is about 0.55B/K-Inst (byte per k-instruction) for sequential consistency, and 0.85B/K-Inst for Godson-3 consistency. The log size is smaller in an order of magnitude than state-of-art deterministic replay schemes incuring no performance loss. Furthermore, LReplay only consumes about 1.3% area of Godson-3, since it requires only trivial modifications to the existing components of Godson-3. The above features of LReplay demonstrate the potential of integrating hardware-assisted deterministic replay into future industrial processors. Copyright 2010 ACM.

Li L.,CAS Institute of Computing Technology | Li L.,Loongson Technologies Corporation Ltd | Chen Y.-J.,CAS Institute of Computing Technology | Chen Y.-J.,Loongson Technologies Corporation Ltd | And 6 more authors.
Journal of Computer Science and Technology | Year: 2011

General-purpose processor (GPP) is an important platform for fast Fourier transform (FFT), due to its flexibility, reliability and practicality. FFT is a representative application intensive in both computation and memory access, optimizing the FFT performance of a GPP also benefits the performances of many other applications. To facilitate the analysis of FFT, this paper proposes a theoretical model of the FFT processing. The model gives out a tight lower bound of the runtime of FFT on a GPP, and guides the architecture optimization for GPP as well. Based on the model, two theorems on optimization of architecture parameters are deduced, which refer to the lower bounds of register number and memory bandwidth. Experimental results on different processor architectures (including Intel Core i7 and Godson-3B) validate the performance model. The above investigations were adopted in the development of Godson-3B, which is an industrial GPP. The optimization techniques deduced from our performance model improve the FFT performance by about 40%, while incurring only 0:8% additional area cost. Consequently, Godson-3B solves the 1024-point single-precision complex FFT in 0:368 μs with about 40Watt power consumption, and has the highest performance-per-watt in complex FFT among processors as far as we know. This work could benefit optimization of other GPPs as well. ©2011 Springer Science+Business Media, LLC & Science Press, China.

Loading Loongson Technologies Corporation Ltd collaborators
Loading Loongson Technologies Corporation Ltd collaborators