Gong S.,Chinese Academy of Sciences |
Gong S.,CAS Institute of Computing Technology |
Xiong J.,Chinese Academy of Sciences |
Xiong J.,CAS Institute of Computing Technology |
And 3 more authors.
Proceedings - 2014 IEEE International Conference on Web Services, ICWS 2014 | Year: 2014
The increasing amount of web services over the Internet enable users composing them to satisfy the users' needs efficiently. Such service composing is prone to errors. Automatically detecting incompatible web services interaction and correcting them will largely improve users' experience on service composing. When correcting the errors, two major issues need to be addressed: First, how to satisfy diverse correction requirements of different users; Second, how to find the corrections efficiently. This paper proposes an approach to discovering maximum diversity corrections to reduce the risk of unsatisfying different end users' needs when presenting correction plans to them. To solve the problem efficiently, this paper proposes an approximate algorithm to find diverse correction plans. Furthermore, two pruning strategies are adopted to reduce the runtime of the algorithm. Experiments illustrate that our approach outperforms the baseline on the diversity of correction plans; and the two pruning strategies reduce the runtime significantly. © 2014 IEEE.
Yuan X.,State Key Laboratory of Computer Architecture |
Yuan X.,University of Chinese Academy of Sciences |
Wu C.,State Key Laboratory of Computer Architecture |
Wang Z.,State Key Laboratory of Computer Architecture |
And 7 more authors.
Proceedings - International Conference on Software Engineering | Year: 2015
Multi-threaded programs play an increasingly important role in current multi-core environments. Exposing concurrency bugs and debugging such multi-threaded programs have become quite challenging due to their inherent non-determinism. In order to eliminate such non-determinism, many approaches such as record-and-replay and other similar bug reproducing systems have been proposed. However, those approaches often suffer significant performance degradation because they require a large amount of recorded information and/or long analysis and replay time. In this paper, we propose an effective approach, ReCBuLC, to take advantage of the hardware clocks available on modern processors. The key idea is to reduce the recording overhead and analyzing events' global order by using time stamps recorded in each thread. Those timestamps are used to determine the global orders of shared accesses. To avoid the large overhead incurred in accessing system-wide global clock, we opt to use local per-core clocks that incur much less access overhead. We then propose techniques to resolve differences among local clocks and obtain an accurate global event order. By using per-core clocks, state-of-the-art bug reproducing systems such as PRES and CLAP can reduce the recording overheads by 1% 85%, and the analysis time by 84.66% 99.99%, respectively. © 2015 IEEE.
Ma J.,State Key Laboratory of Computer Architecture |
Ma J.,University of Chinese Academy of Sciences |
Sui X.,State Key Laboratory of Computer Architecture |
Sun N.,State Key Laboratory of Computer Architecture |
And 11 more authors.
International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS | Year: 2015
This paper presents PARD, a programmable architecture for resourcing-on-demand that provides a new programming interface to convey an application's high-level information like quality-of-service requirements to the hardware. PARD enables new functionalities like fully hardware-supported virtualization and differentiated services in computers. PARD is inspired by the observation that a computer is inherently a network in which hardware components communicate via packets (e.g., over the NoC or PCIe). We apply principles of software-defined networking to this intra-computer network and address three major challenges. First, to deal with the semantic gap between high-level applications and underlying hardware packets, PARD attaches a high-level semantic tag (e.g., a virtual machine or thread ID) to each memory-access, I/O, or interrupt packet. Second, to make hardware components more manageable, PARD implements programmable control planes that can be integrated into various shared resources (e.g., cache, DRAM, and I/O devices) and can differentially process packets according to tag-based rules. Third, to facilitate programming, PARD abstracts all control planes as a device file tree to provide a uniform programming interface via which users create and apply tag-based rules. Full-system simulation results show that by co-locating latency-critical memcached applications with other workloads PARD can improve a four-core computer's CPU utilization by up to a factor of four without significantly increasing tail latency. FPGA emulation based on a preliminary RTL implementation demonstrates that the cache control plane introduces no extra latency and that the memory control plane can reduce queueing delay for high-priority memory-access requests by up to a factor of 5.6. Copyright © 2015 ACM.
Li B.,State Key Laboratory of Computer Architecture |
Li B.,CAS Institute of Computing Technology |
Li B.,University of Chinese Academy of Sciences |
Shan S.,State Key Laboratory of Computer Architecture |
And 8 more authors.
Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics | Year: 2014
An address mapping design which supports the adjustment of hybrid ECC codes is proposed to achieve tradeoff between reliability and cost. We have observed that a proportional relationship exists between the length of data and the length of the check information for ECC codes, so we propose an address mapping logic. Since the data and the check information are separately stored, once the memory is accessed or the fault tolerance level is changed, the address mapping logic is able to guarantee the accesses to the check information, and to facilitate the adjustment of the fault tolerance level. The experimental results indicated that this method is effective with negligible performance and power cost.
Chen T.,State Key Laboratory of Computer Architecture |
Du Z.,State Key Laboratory of Computer Architecture |
Sun N.,State Key Laboratory of Computer Architecture |
Wang J.,State Key Laboratory of Computer Architecture |
And 3 more authors.
IEEE Micro | Year: 2015
Machine-learning tasks are becoming pervasive in a broad range of domains and systems (from embedded systems to datacenters). Recent advances on machine-learning show that neural networks are the state of the art across many applications. As architectures evolve toward heterogeneous multicores comprising a mix of cores and accelerators, a neural network accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope. Until now, most machine-learning accelerator designs have focused on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art neural networks are characterized by their large size. The authors designed an accelerator architecture for large-scale neural networks, with a special emphasis on the impact of memory on accelerator design, performance, and energy. In this article, they present a concrete design at 65 nm that can perform 496 16-bit fixed-point operations in parallel every 1.02 ns, that is, 452 GOP/s, in a 3.02mm2, 485-mW footprint (excluding main memory accesses). © 1981-2012 IEEE.