State Key Laboratory of High Performance Computing

Changsha, China

State Key Laboratory of High Performance Computing

Changsha, China
SEARCH FILTERS
Time filter
Source Type

Jiang J.,National University of Defense Technology | Chen L.,National University of Defense Technology | Wu X.,National University of Defense Technology | Wang J.,National University of Defense Technology | Wang J.,State Key Laboratory of High Performance Computing
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) | Year: 2017

Statement-wise abstract interpretation that calculates the abstract semantics of a program statement by statement, is scalable but may cause precision loss due to limited local information attached to each statement. While Satisfiability Modulo Theories (SMT) can be used to characterize precisely the semantics of a loop-free program fragment, it is challenging to analyze loops efficiently using plainly SMT formula. In this paper, we propose a block-wise abstract interpretation framework to analyze a program block by block via combining abstract domains with SMT. We first partition a program into blocks, encode the transfer semantics of a block through SMT formula, and at the exit of a block we abstract the SMT formula that encodes the post-state of a block w.r.t. a given prestate into an abstract element in a chosen abstract domain. We leverage the widening operator of abstract domains to deal with loops. Then, we design a disjunctive lifting functor on top of abstract domains to represent and transmit useful disjunctive information between blocks. Furthermore, we consider sparsity inside a large block to improve efficiency of the analysis. We develop a prototype based on block-wise abstract interpretation. We have conducted experiments on the benchmarks from SV-COMP 2015. Experimental results show that block-wise analysis can check about 1x more properties than statement-wise analysis does. © Springer International Publishing AG 2017.


Mao H.,State Key Laboratory of High Performance Computing | Mao H.,National University of Defense Technology | Zhang H.,State Key Laboratory of High Performance Computing | Zhang H.,National University of Defense Technology | And 7 more authors.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) | Year: 2012

In our daily life, people increasingly use multiple machines to do their daily work. As platform switching and file modification are so frequently that a way for file synchronization across multiple machines is required to make the files in synchronized. In this paper, we propose EaSync, a transparent file synchronization service across multiple machines. EaSync proposes several key technologies for file synchronization oriented service, including a timestamp based synchronization protocol, an enhanced deduplication algorithm DS-Dedup. We implement and evaluate the EaSync prototype system. As the result shown, EaSync outperforms other synchronization system in operation latency and other metrics. © IFIP International Federation for Information Processing 2012.


Huang D.,National University of Defense Technology | Huang D.,State Key Laboratory of High Performance Computing | Wen M.,National University of Defense Technology | Wen M.,State Key Laboratory of High Performance Computing | And 11 more authors.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) | Year: 2014

When adapting GPU-specific OpenCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus extensively used. However, locality concerns exposed in GPU-specific OpenCL code are usually inherited without analysis, which may give side-effects on the CPU performance. When executing GPU-specific kernels on CPUs, local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns by using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by removing all the unwanted local-memory arrays together with the obsolete barrier statements. Experiments show that the automated transformation can satisfactorily improve OpenCL kernel performances on Sandy Bridge CPU and Intel's Many-Integrated-Core coprocessor. © 2014 Springer International Publishing Switzerland.


Xu W.,State Key Laboratory of High Performance Computing | Xu W.,National University of Defense Technology | Lu Y.,State Key Laboratory of High Performance Computing | Lu Y.,National University of Defense Technology | And 14 more authors.
Frontiers of Computer Science | Year: 2014

With the rapid improvement of computation capability in high performance supercomputer system, the imbalance of performance between computation subsystem and storage subsystem has become more and more serious, especially when various big data are produced ranging from tens of gigabytes up to terabytes. To reduce this gap, large-scale storage systems need to be designed and implemented with high performance and scalability.MilkyWay-2 (TH-2) supercomputer system with peak performance 54.9 Pflops, definitely has this kind of requirement for storage system. This paper mainly introduces the storage system in MilkyWay-2 supercomputer, including the hardware architecture and the parallel file system. The storage system in MilkyWay-2 supercomputer exploits a novel hybrid hierarchy storage architecture to enable high scalability of I/O clients, I/O bandwidth and storage capacity. To fit this architecture, a user level virtualized file system, named H2FS, is designed and implemented which can cooperate local storage and shared storage together into a dynamic single namespace to optimize I/O performance in IO-intensive applications. The evaluation results show that the storage system in MilkyWay-2 supercomputer can satisfy the critical requirements in large scale supercomputer, such as performance and scalability. © 2014 Higher Education Press and Springer-Verlag Berlin Heidelberg.


Mao H.,National University of Defense Technology | Mao H.,State Key Laboratory of High Performance Computing | Xiao N.,National University of Defense Technology | Xiao N.,State Key Laboratory of High Performance Computing | And 2 more authors.
Communications in Computer and Information Science | Year: 2012

For the convenience, the cloud storage service has been used in daily life very commonly. However, the service sometimes suffers from the availability problems. They may not be able to be accessed timely for reasons (e.g. service down, network connection banned). It reduces the availability of the cloud services. To overcome the problems and challenges of backing up data on cloud services, we propose a new storage architecture, RAIC, which uses the cloud storage service like a disk and make them into a RAID-like system to provide the users with a high availability storage service. We have designed and implemented a prototype system for RAIC. With the evaluation of the system, we find that RAIC performs efficient. The upload performance is about 90.6% of the ideal upload bandwidth and 74.2% for the download performance. © 2012 Springer-Verlag.


Shen L.,State Key Laboratory of High Performance Computing | Shen L.,National University of Defense Technology | Xu F.,State Key Laboratory of High Performance Computing | Xu F.,National University of Defense Technology | And 2 more authors.
Journal of Computer Science and Technology | Year: 2016

Thread level speculation provides not only a simple parallel programming model, but also an effective mechanism for thread-level parallelism exploitation. The performance of software speculative parallel models is limited by high global overheads caused by different types of loops. These loops usually have different characteristics of dependencies and different requirements of optimization strategies. In this paper, we propose three comprehensive optimization techniques to reduce different factors of global overheads, aiming at requirements from different types of loops. Inter-thread fetching can reduce the high mis-speculation rate of the loops with frequent dependencies and out-of-order committing can reduce the control overhead of the loops with infrequent dependencies, while enhanced dynamic task granularity resizing can reduce the control overhead and optimize the global overhead of the loops with changing characteristics of dependencies. All these three optimization techniques have been implemented in HEUSPEC, a software TLS system. Experimental results indicate that they can satisfy the demands from different groups of benchmarks. The combination of these techniques can improve the performance of all benchmarks and reach a higher average speedup. © 2016, Springer Science+Business Media New York.


Wang C.,National University of Defense Technology | Wang C.,State Key Laboratory of High Performance Computing | Lu Y.,National University of Defense Technology | Lu Y.,State Key Laboratory of High Performance Computing | And 4 more authors.
Proceedings of 2015 IEEE International Conference on Computer and Communications, ICCC 2015 | Year: 2015

Graph traversal is a widely used algorithm in a variety of fields, including social networks, business analytics, and high-performance computing and so on. Graph traversal on single nodes has been well studied and optimized on modern CPU architectures. Now, heterogeneous computing is becoming more and more popular and CPU+MIC is a typical heterogeneous architectures. The Intel MIC (Many Integrated Core) has up to 57 cores and hasn't been fully evaluated for graph traversal. When use a MIC to traverse a graph, the MIC may suffer from loading imbalance for the reason that the degree of vertexes in a graph may differs very much, which can degrade system performance. So in this paper, an algorithmic design and optimization techniques are presented to load balancing in MIC. About the optimization design, the main idea is that treat the vertexes with big degree and the vertexes with small degree separately. For this reason, some adjustments will be made to existing algorithms and data structures. It has achieved almost big performance improvements over the BFS algorithm without loading balancing in MIC as shown in section VI. We believe that this novel algorithm can be successfully applied to a broader class of graph algorithms with many MIC cores. © 2015 IEEE.


Cheng P.,State Key Laboratory of High Performance Computing | Cheng P.,National University of Defense Technology | Lu Y.,State Key Laboratory of High Performance Computing | Lu Y.,National University of Defense Technology | And 4 more authors.
Proceedings - 2015 IEEE 12th International Conference on Ubiquitous Intelligence and Computing, 2015 IEEE 12th International Conference on Advanced and Trusted Computing, 2015 IEEE 15th International Conference on Scalable Computing and Communications, 2015 IEEE International Conference on Cloud and Big Data Computing, 2015 IEEE International Conference on Internet of People and Associated Symposia/Workshops, UIC-ATC-ScalCom-CBDCom-IoP 2015 | Year: 2015

Heterogeneous computing has been used widely since accelerators like Graphic Processing Unit (GPU) and Intel Many Integrated Core (MIC) can offer an order of magnitude higher compute power for arithmetic intensive data-parallel workloads. However, heterogeneous programming is more complicated since there is no shared memory between CPU and MIC. Programmers must distinguish the local or remote access of data and transmit the data between CPU and MIC explicitly. Furthermore, standard offload programming models like Intel Language Extensions for Offload (LEO) is restricted to a single compute node and hence a limited number of coprocessors also complicate the efficient use of MIC for heterogeneous computations. In this paper, we propose CoGA, the extension of Global Array (GA) for heterogeneous systems consist of CPU and MIC. Our implementation of CoGA is on top of Symmetric Communication Interface (SCIF), a sockets-like API for communication between processes on MIC and host within the same system. CoGA can provide a shared memory abstraction between CPU and MIC and simplifies the programming by allowing programmers to access the shared data regardless where the referenced data is located. Our evaluation on data transmission bandwidth and Sparse-Matrix Vector multiplication problem proves that CoGA is practical and effective. © 2015 IEEE.


Du J.,State Key Laboratory of Complex Electromagnetic Environment Effects on Electronics and Information System | Du J.,State Key Laboratory of High Performance Computing | Ao F.,State Key Laboratory of Complex Electromagnetic Environment Effects on Electronics and Information System | Sui S.,State Key Laboratory of Complex Electromagnetic Environment Effects on Electronics and Information System | Wang H.,State Key Laboratory of Complex Electromagnetic Environment Effects on Electronics and Information System
Proceedings - 2015 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, CyberC 2015 | Year: 2015

Recently, the real-time synthetic aperture radar (SAR) imaging technique is a hotspot of research in the field of remote sensing and military applications. As the SAR imaging algorithm is associated with high data and computation intensive, it is suitable for using hybrid storage systems, e.g. A cluster, for the performance acceleration. To design a SAR algorithm with high performance, we need consider a prerequisite to maximize the parallelizability of the algorithm due to multi-level parallelization features of the cluster platform. Focusing on the large-scale data, we explore concurrency characteristics of the SAR imaging algorithm on a hybrid storage system, and propose some parallel optimization techniques to accelerate the SAR imaging algorithm. According to the study, we implement a parallel SAR imaging algorithm and evaluate its performance. Experiment results show that the optimized SAR imaging program has high-speed network utilization, and can realize obvious improvement on the performance. © 2015 IEEE.

Loading State Key Laboratory of High Performance Computing collaborators
Loading State Key Laboratory of High Performance Computing collaborators