Time filter

Source Type

Lin C.,Hefei University of Technology | Lin C.,Anhui Province Key Laboratory of Computing and Communication Software | Gu N.,Hefei University of Technology | Gu N.,Anhui Province Key Laboratory of Computing and Communication Software | And 2 more authors.
Journal of University of Science and Technology of China | Year: 2011

A general method for identifying SIMD instructions was presented for the features of digital signal processing applications. And the new cluster assignment algorithm and register allocation algorithm were given for the features of SIMD instructions on cluster architectures. Finally, the compiler optimization methods mentioned above was implemented on BWDSPIOO, with satisfactory.


Peng J.,Hefei University of Technology | Peng J.,Anhui Province Key Laboratory of Computing and Communication Software | Gu N.,Hefei University of Technology | Gu N.,Anhui Province Key Laboratory of Computing and Communication Software | And 6 more authors.
Journal of University of Science and Technology of China | Year: 2013

Linux 2. 6 load balancing algorithm on the scheduling domain supports CMP, CMT, SMP, NUMA architectures. For CMT, the algorithm tries to assign the new process to the idlest CPU of the idlest core, and if the first CPU of a core is comparatively idle, it tries to pull a moderate amount of tasks from the busiest CPU of the core to balance the system workload periodically. Under certain circumstances, this strategy would cause the system to be more unbalanced. For the above defects, the algorithm can be adjusted in two ways. The idlest CPU of the entire system should be selected for the new process, and the idlest CPU of a scheduling domain can periodically move tasks from the scheduling domain to itself. After applying this optimization, the system's performance by as much as 8% in HackBench testing on an 8-core 32-thread XLR532 processor.


Zhang X.,Hefei University of Technology | Zhang X.,Anhui Province Key Laboratory of Computing and Communication Software | Gu N.,Hefei University of Technology | Gu N.,Anhui Province Key Laboratory of Computing and Communication Software | And 4 more authors.
Proceedings - 2016 International Conference on Networking and Network Applications, NaNA 2016 | Year: 2016

Due to the conservative additive increase and multiplicative decrease strategy, traditional TCP has a severe problem in utilizing bandwidth over high-speed long-distance networks. This paper proposes a delay-based congestion control algorithm, named DFTCP, for these networks. DFTCP calculates a reasonable equilibrium point for a network flow based on the estimation of the maximum size of available buffers along its network path, and adjusts the congestion window size according to the distance between the equilibrium point and the current network state. The experimental results demonstrate that, the new algorithm performs well in terms of bandwidth utilization, intra-protocol fairness and RTT fairness, and exhibits reasonable friendliness to standard TCP. © 2016 IEEE.


Cao H.,Hefei University of Technology | Gu N.,Hefei University of Technology | Ren K.,Anhui Province Key Laboratory of Computing and Communication Software | Li Y.,Hefei University of Technology
Proceedings of the 2015 Federated Conference on Computer Science and Information Systems, FedCSIS 2015 | Year: 2015

In this paper, the performance research on CPython's latest interpreter is presented, concluding that bytecode dispatching takes about 25 percent of total execution time on average. Based on this observation, a novel bytecode dispatching mechanism is proposed to reduce the time spent on this phase to a minimum. With this mechanism, the blocks associated with each kind of bytecodes are rewritten in hand-tuned assembly, their opcodes are renumbered, and their memory spaces are rescheduled. With these preparations, this new bytecode dispatching mechanism replaces the time-consuming memory reading operations with rapid operations on registers. This mechanism is implemented in CPython-3.3.0. Experiments on lots of benchmarks demonstrate its correctness and efficiency. The comparison between original CPython and optimized CPython shows that this new mechanism achieves about 8.5 percent performance improvement on average. For some particular benchmarks, the maximum improvement is up to 18 percentages. © 2015, IEEE.


Yang Y.,Hefei University of Technology | Yang Y.,Anhui Province Key Laboratory of Computing and Communication Software | Gu N.,Hefei University of Technology | Gu N.,Anhui Province Key Laboratory of Computing and Communication Software | And 4 more authors.
Journal of Computational Information Systems | Year: 2014

Modern Very Long Instruction (VLIW) DSP can improve performance by using instruction-level parallelism (ILP) and obtain better parallelism through instruction clustering. The essence of clustering is resource allocation. Traditional clustering method assumes that one instruction can only be assigned to a certain cluster, which is inapplicable to the architecture with SIMD structure. To address this issue, this paper presents IPRAR, a novel instruction clustering algorithm for the features of SIMD structure based on DFG. Due to considering the features of SIMD structure, IPRAR can be implemented on BWDSP100 with satisfactory. Besides, IPRAR can make the load of cluster more balanced and minimize the intercluster communication overload. Finally, numerous experiment results show that the IPRAR has an average of 7-10% performance improvement compared to the traditional methods. © 2014 Binary Information Press.


Zhao Z.,Hefei University of Technology | Zhao Z.,Anhui Province Key Laboratory of Computing and Communication Software | Gu N.,Hefei University of Technology | Gu N.,Anhui Province Key Laboratory of Computing and Communication Software | And 4 more authors.
Journal of Computational Information Systems | Year: 2014

Basic Linear Algebra Subroutines (BLAS) is a widely used basic mathematical library for high-performance computing. It has a great impact on the performance of supercomputers. This paper focuses on memory accessing optimization for the level-2 library of BLAS (BLAS2) based on Godson-3B chips. The key contribution of this work is proposing several methods to make full use of the Godson-3B's hardware features, as well as to improve the performance of memory access. GEMV, the kernel function of BLAS2, is selected to demonstrate the optimization methods for Godson-3B. Using Single Instruction Multiple Data (SIMD) instructions, Direct Register Access (DRA) and Direct Cache Access (DCA) of Godson-3B, two optimization algorithms for GEMV are introduced in this paper. Experiments show that the best computing performance of the optimized GEMVs on Godson-3B is more than 1.8Gops, which is far superior to that of other versions. While the test scale of GEMVs is greater than the size of L2-cache, its average performance exceeds UBLAS and OpenBlas, the two most famous optimized BLAS libraries for GODSON, by more than 60% and 20% respectively. © 2014 Binary Information Press.


Zhang R.,Hefei University of Technology | Zhang R.,Anhui Province Key Laboratory of Computing and Communication Software | Huang Z.,Hefei University of Technology | Huang Z.,Anhui Province Key Laboratory of Computing and Communication Software | And 2 more authors.
Journal of Information and Computational Science | Year: 2015

Active contour models have been widely applied in image segmentation with promising results. However, many of them fail to segment images successfully due to intensity inhomogeneity. To address this problem, we propose a hybrid region-based active contour model which fully utilizes local intensity information. This model develops a localized active contour framework and incorporates a local Gaussian distribution fitting energy. In order to increase both the computational efficiency and stability of the proposed model, a multi-resolution strategy is employed and the active contour is evolved in a reinitialization-free way. Extensive experiments demonstrate that our model is effective and efficient in segmenting images with intensity inhomogeneity, and outperforms several state-of-the-art algorithms. © 2015 by Binary Information Press.


Zhant K.,Hefei University of Technology | Zhant K.,Anhui Province Key Laboratory of Computing and Communication Software
Journal of University of Science and Technology of China | Year: 2013

In Web applications, eliminating redundancy in downloaded data could improve the responsiveness. Schemes proposed so far do not exploit the HTML structures, leading to higher cache mss rates. A novel Web caching scheme called structure-based adaptive Web caching (SBAWC) was proposed, which exploited the very HTML structures and semantics in the design of a Web caching scheme to achieve better redundancy elimination in Web applications SBAWC differentiates stable and unstable structures and cache only stable structures, resulting in higher cache hit rates. The experimental results on real-world Webpages show that SBAWC is significantly more efficient in eliminating redundancy in communications of Web applications than the existing schemes.


Lin C.,Hefei University of Technology | Lin C.,Anhui Province Key laboratory of Computing and Communication Software | Gu N.,Hefei University of Technology | Gu N.,Anhui Province Key laboratory of Computing and Communication Software | Cai S.,CAS Institute of Computing Technology
Journal of University of Science and Technology of China | Year: 2013

JVM uses just-in-time compiler (JIT) to improve its performance. JIT compiles the method that has been invoked for a given number of times, and then JVM executes the compiled method when this method is invoked the next time. Cache locking mechanism allows JVM to lock the compiled methods in the cache. This can improve cache performance, because JVM will execute the compiled methods frequently. By analyzing the calling law of the compiled methods in JVM, the calling distribution law, average size and memory distribution of the compiled methods can be obtained. Based on the calling law of the compiled methods, a dynamic cache locking optimization algorithm is proposed, which can lock the active compiled methods in the cache and reduce the cache miss rate when JVM executes the compiled methods. The algorithm has been implemented in HotSpot based on Loongson-3A. Experiment results show that, on average, the dynamic cache locking algorithm reduces the cache miss rate by 8. 5% and improves performance by 4% on the benchmark SPECjvm2008.


He S.,University of Science and Technology of China | He S.,Anhui Province Key Laboratory of Computing and Communication Software | Gu N.,University of Science and Technology of China | Gu N.,Anhui Province Key Laboratory of Computing and Communication Software | And 4 more authors.
2011 International Conference on Electronics, Communications and Control, ICECC 2011 - Proceedings | Year: 2011

Multi-core sorting is one of the research focuses in the field of parallel algorithms. It is of great significance to improve multi-core systems' performance and promote its appliance. PSRS is a general multi-core sorting algorithm with the disadvantages of low portability and poor performance without the architecture characteristics. This paper presents an optimized method combined architecture features of multi-core systems with optimization techniques commonly used in sorting. It adopts the middle element by binary search, loser tree and CPU affinity setting to solve the problems of the fastest location in multi-position binary search and the binding of multi-merge and process. Some experiment results show that the improved PSRS has good versatility with more than 2.69 speedup in three 4-core systems. © 2011 IEEE.

Loading Anhui Province Key Laboratory of Computing and Communication Software collaborators
Loading Anhui Province Key Laboratory of Computing and Communication Software collaborators