Entity

Time filter

Source Type


Li C.-S.,Zhejiang University | Huang K.,Zhejiang University | Xiu S.-W.,Zhejiang University | Ma D.,Zhejiang University | And 2 more authors.
Zhejiang Daxue Xuebao (Gongxue Ban)/Journal of Zhejiang University (Engineering Science) | Year: 2011

A two-level pipeline architecture was proposed in order to decrease the high complexity of sub-pixel interpolation process in H.264/AVC decoding system. The first level pipeline scheme was utilized to explore the parallelism for the interpolation processes of different 4×4 blocks with two stages of fetching 4×4 block's reference pixels and interpolation computation operation when the four 4×4 blocks inside one 8×8 block share the same motion information. The second level pipeline scheme was used to accelerate the sub-pixel interpolation computation operation of different pixels by using the independence of adjacent half-pixels and the symmetry between horizontal and vertical interpolation computation processes. The kernel interpolation computation unit was implemented with 13 six-tap filters, 4 bilinear interpolation filters and 4 chroma interpolation filters. The pipelining and parallelism in interpolation computation process can reduce computation time by at least 75%. Experimental results show that the proposed architecture design can reduce the external memory bandwidth by 47% and improve the performance of sub-pixel interpolation by 30% at a lower hardware cost compared with other designs. Source


Ma D.,Hangzhou Dianzi University | Yan R.,CAS Institute of Software | Huang K.,Zhejiang University | Yu M.,Zhejiang University | And 4 more authors.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | Year: 2013

Efficient design of multiprocessor system-on-chip (MPSoC) requires early, fast, and accurate performance estimation techniques. In this paper, we present new techniques based on fine-grained code analysis to estimate accurate performance during simulation of MPSoC transaction accurate models. First, a GCC profiling tool is applied in the native simulation process. Based on the profiling result, an instruction analyzer of the target CPU architecture is proposed to analyze the cycle cost of C code under estimation. In addition, a memory analyzer is used to further estimate memory access latency including both instruction/data cache time cost and global memory access cycles. Both data and instruction cache models are proposed to estimate cache miss penalty, and a segment-based strategy is adopted to update the cache models more efficiently. Furthermore, an equalized access model is presented to imitate the memory access behavior of processors for estimating global memory access latency caused by bus contention and memory bandwidth. We have applied these techniques on an H.264 decoder application with different hardware architectures. The experimental results show that applying these techniques can obviously improve estimation accuracy of transaction accurate models close to that of the virtual prototype models, with a tolerable overhead on simulation speed. © 1982-2012 IEEE. Source


Huang K.,Zhejiang University | Yu M.,Zhejiang University | Yan R.,CAS Institute of Software | Zhang X.,Zhejiang University | And 4 more authors.
ACM Transactions on Embedded Computing Systems | Year: 2015

Communication frequency is increasing with the growing complexity of emerging embedded applications and the number of processors in the implemented multiprocessor SoC architectures. In this article, we consider the issue of communication cost reduction during multithreaded code generation from partitioned Simulink models to help designers in code optimization to improve system performance. We first propose a technique combining message aggregation and communication pipeline methods, which groups communications with the same destinations and sources and parallelizes communication and computation tasks. We also present a method to apply static analysis and dynamic emulation for efficient communication buffer allocation to further reduce synchronization cost and increase processor utilization. The existing cyclic dependency in the mapped model may hinder the effectiveness of the two techniques. We further propose a set of optimizations involving repartition with strongly connected threads to maximize the degree of communication reduction and preprocessing strategies with available delays in the model to reduce the number of communication channels that cannot be optimized. Experimental results demonstrate the advantages of the proposed optimizations with 11-143% throughput improvement. © 2015. Source


Huang K.,Zhejiang University | Ma D.,Zhejiang University | Yan R.-J.,CAS Institute of Software | Ge H.-T.,Hangzhou C Sky Micro System Company | Yan X.-L.,Zhejiang University
Journal of Zhejiang University: Science C | Year: 2013

Context-based adaptive binary arithmetic coding (CABAC) is the major entropy-coding algorithm employed in H.264/AVC. In this paper, we present a new VLSI architecture design for an H.264/AVC CABAC decoder, which optimizes both decode decision and decode bypass engines for high throughput, and improves context model allocation for efficient external memory access. Based on the fact that the most possible symbol (MPS) branch is much simpler than the least possible symbol (LPS) branch, a newly organized decode decision engine consisting of two serially concatenated MPS branches and one LPS branch is proposed to achieve better parallelism at lower timing path cost. A look-ahead context index (ctxIdx) calculation mechanism is designed to provide the context model for the second MPS branch. A head-zero detector is proposed to improve the performance of the decode bypass engine according to UEGk encoding features. In addition, to lower the frequency of memory access, we reorganize the context models in external memory and use three circular buffers to cache the context models, neighboring information, and bit stream, respectively. A pre-fetching mechanism with a prediction scheme is adopted to load the corresponding content to a circular buffer to hide external memory latency. Experimental results show that our design can operate at 250 MHz with a 20.71k gate count in SMIC18 silicon technology, and that it achieves an average data decoding rate of 1.5 bins/cycle. © 2013 Journal of Zhejiang University Science Editorial Office and Springer-Verlag Berlin Heidelberg. Source


Wang Y.-B.,Zhejiang University | Huang K.,Zhejiang University | Chen C.,Zhejiang University | Feng J.,Hangzhou C Sky Micro System Company | And 2 more authors.
Zhejiang Daxue Xuebao (Gongxue Ban)/Journal of Zhejiang University (Engineering Science) | Year: 2014

Some embedded Flash data fetching acceleration techniques based on cache were proposed and implemented, which are used for low-cost, low-power consumption application, including low frequency fast access, backfill hidden with modified critical-word-first strategy, cache-lock with adaptive prefetching, and pre-lookup. With the combination of these techniques, the Flash data fetching performance is improved and the power dissipation is kept low. Simulations show that when the resource on chip (cache size) is limited and the system frequency is low (for some low-power consumption applications), the embedded Flash accelerator with these techniques has higher performance(20%-40% higher) and lower dynamic power consumption (about 40% lower) compared with conventional two-way set-associative cache. Source

Discover hidden collaborations