Bangalore, India
Bangalore, India

Time filter

Source Type

Merchant F.,Indian Institute of Science | Vatwani T.,IIT | Chattopadhyay A.,Nanyang Technological University | Raha S.,Indian Institute of Science | And 2 more authors.
Proceedings of the IEEE International Conference on VLSI Design | Year: 2016

Householder Transformation (HT) is a prime building block of widely used numerical linear algebra primitives such as QR factorization. Despite years of intense research on HT, there exists a scope to expose higher Instruction Level Parallelism in HT through algorithmic transforms. In this paper, we propose several novel algorithmic transformations in HT to expose higher Instruction-Level Parallelism. Our propositions are backed by theoretical proofs and a series of experiments using commercial general-purpose processors. Finally, we show that algorithm-architecture co-design leads to the most efficient realization of HT. A detailed experimental study with architectural modifications is presented for a commercial CGRA. The benchmarking results with some of the recent HT implementations show 30-40% improvement in performance. © 2016 IEEE.


Das S.,Indian Institute of Science | Narayan R.,Morphing Machines Pvt. Ltd | Nandy S.K.,Indian Institute of Science
Communications in Computer and Information Science | Year: 2012

In this paper we present a hardware-software hybrid technique for modular multiplication over large binary fields. The technique involves application of Karatsuba-Ofman algorithm for polynomial multiplication and a novel technique for reduction. The proposed reduction technique is based on the popular repeated multiplication technique and Barrett reduction. We propose a new design of a parallel polynomial multiplier that serves as a hardware accelerator for large field multiplications.We show that the proposed reduction technique, accelerated using the modified polynomial multiplier, achieves significantly higher performance compared to a purely software technique and other hybrid techniques. We also show that the hybrid accelerated approach to modular field multiplication is significantly faster than the Montgomery algorithm based integrated multiplication approach. © Springer-Verlag Berlin Heidelberg 2012.


Kala S.,Indian Institute of Science | Nalesh S.,Indian Institute of Science | Maity A.,National Institute of Technology Durgapur | Nandy S.K.,Indian Institute of Science | Narayan R.,Morphing Machines Pvt. Ltd.
Proceedings - IEEE International Symposium on Circuits and Systems | Year: 2013

In this paper we propose a fully parallel 64K point radix-44 FFT processor. The radix-44 parallel unrolled architecture uses a novel radix-4 butterfly unit which takes all four inputs in parallel and can selectively produce one out of the four outputs. The radix-44 block can take all 256 inputs in parallel and can use the select control signals to generate one out of the 256 outputs. The resultant 64K point FFT processor shows significant reduction in intermediate memory but with increased hardware complexity. Compared to the state-of-art implementation [5], our architecture shows reduced latency with comparable throughput and area. The 64K point FFT architecture was synthesized using a 130nm CMOS technology which resulted in a throughput of 1.4 GSPS and latency of 47.7μs with a maximum clock frequency of 350MHz. When compared to [5], the latency is reduced by 303μs with 50.8% reduction in area. © 2013 IEEE.


Das S.,Indian Institute of Science | Sivanandan N.,Indian Institute of Science | Madhu K.T.,Indian Institute of Science | Nandy S.K.,Indian Institute of Science | Narayan R.,Morphing Machines Pvt. Ltd.
Proceedings of the IEEE International Conference on VLSI Design | Year: 2016

In this paper, we present an architecture named REDEFINE Hyper Cell Multicore (RHyMe) designed to efficiently realize HPC application kernels, such as loops. RHyMe relies on the compiler to generate the meta-data for its functioning. Most of the orchestration activity for executing kernels is governed by compiler generated meta-data made use of at runtime. In RHyMe, macro operations can be realized as a hardware overlay of MIMO operations on hardware structures called Hyper Cells. While a Hyper Cell enables exploiting fine-grain instruction level and pipeline parallelism, coarse-grain parallelism is exploited among multiple Hyper Cells. Regularity exhibited by computations such as loops results in efficient usage of simple compute hardware such as Hyper Cells as well as memory structures that can be managed explicitly. © 2016 IEEE.


Biswas A.K.,Indian Institute of Science | Nandy S.K.,Indian Institute of Science | Narayan R.,Morphing Machines Pvt Ltd
2015 IEEE International Conference on Electronics, Computing and Communication Technologies, CONECCT 2015 | Year: 2015

NoC based high performance MP-SoCs can have multiple secure regions or Trusted Execution Environments (TEEs). These TEEs can be separated by non-secure regions or Rich Execution Environments (REEs) in the same MP-SoC. All communications between two TEEs need to cross the in-between REEs. Without any security mechanisms, these traffic flows can face router attacks in REEs. Both routing table and routing logic based routers are vulnerable to such attacks. In this paper, we address attacks on routing tables. We propose two countermeasures-Run-time protector and Restart-time protector. In addition to detection and prevention, proposed protectors can locate a malicious router also. Synthesis results show that, the area of a Run-time protector and a Restart-time protector is only 6.6% and 2% of a conventional router area respectively. © 2015 IEEE.


Merchant F.,Indian Institute of Science | Choudhary N.,Indian Institute of Science | Nandy S.K.,Indian Institute of Science | Narayan R.,Morphing Machines Pvt. Ltd.
Proceedings of the IEEE International Conference on VLSI Design | Year: 2016

In this paper we present different optimization techniques on look-up table based algorithms for double precision floating point arithmetic. Based on our analysis of different look-up table based algorithms in the literature, we re-engineer basics blocks of the algorithms (i.e. Multiplier (s) and adder (s)) to facilitate area and timing benefits to achieve higher performance. We propose different look-up table optimization techniques for the algorithms. We also analyze trade-off in employing exact rounding (0.5ulp) (unit in the last place) in the double precision floating point unit. Based on performance and extensibility criteria we take algorithms proposed by Wong and Goto as a base case to validate our optimization techniques and compare the performance with other algorithms in the literature. We improve the performance (latency × area) of Wong and Goto division algorithm by 26.94%. © 2016 IEEE.


Mahale G.,Indian Institute of Science | Nandy S.K.,Indian Institute of Science | Bhatia E.,BITS Pilani | Narayan R.,Morphing Machines Pvt Ltd.
Proceedings of the IEEE International Conference on VLSI Design | Year: 2016

In this paper we propose architecture of a processor for vector operations involved in on-line learning of neural networks. We target to implement on-line learning on a Radial Basis Function Neural Network (RBFNN) based Face Recognition (FR) system that has pseudo inverse computation as an essential component during training. Synaptic weights of RBFNN output layer need to be updated whenever the FR system comes across a new face to be learnt. For real-time on-line learning, update of synaptic weights is done using an existing Incremental Pseudo Inverse (IPI) algorithm in the place of compute intensive pseudo inverse algorithm. We design a custom data-path for vector operations appearing in IPI algorithm. The custom data-path along with configuration and memory access mechanisms forms a processing unit, termed Processor for Vector Operations (VOP). We simulate and synthesize VOP to target Virtex-6 FPGA using the Xilinx ISE. Apart from on-line learning, the VOPs can be used in acceleration of several applications involving predominant vector-matrix operations. © 2016 IEEE.


A method and System on Chip (SoC) for adapting a reconfigurable hardware for an application kernel at run time is provided. The method includes obtaining a plurality of Hyper-Operations corresponding to the application. A Hyper-Operation performs one or more of a plurality of MIMO functions of the application. The method further includes retrieving compute metadata and transport metadata corresponding to each Hyper-Operation. Compute metadata specifies functionality of a Hyper-Operation and transport metadata specifies data flow path of a Hyper-Operation. Thereafter, the method maps each Hyper-Operation to a corresponding set of tiles in the hardware. The set of tiles includes one or more tiles and a tile performs one or more of the plurality of MIMO functions of the application.


Biswas A.K.,Indian Institute of Science | Nandy S.K.,Indian Institute of Science | Narayan R.,Morphing Machines Pvt Ltd
Circuits, Systems, and Signal Processing | Year: 2015

The growing number of applications and processing units in modern Multiprocessor Systems-on-Chips (MPSoCs) come along with reduced time to market. Different IP cores can come from different vendors, and their trust levels are also different, but typically they use Network-on-Chip (NoC) as their communication infrastructure. An MPSoC can have multiple Trusted Execution Environments (TEEs). Apart from performance, power, and area research in the field of MPSoC, robust and secure system design is also gaining importance in the research community. To build a secure system, the designer must know beforehand all kinds of attack possibilities for the respective system (MPSoC). In this paper we survey the possible attack scenarios on present-day MPSoCs and investigate a new attack scenario, i.e., router attack targeted toward NoC architecture. We show the validity of this attack by analyzing different present-day NoC architectures and show that they are all vulnerable to this type of attack. By launching a router attack, an attacker can control the whole chip very easily, which makes it a very serious issue. Both routing tables and routing logic-based routers are vulnerable to such attacks. In this paper, we address attacks on routing tables. We propose different monitoring-based countermeasures against routing table-based router attack in an MPSoC having multiple TEEs. Synthesis results show that proposed countermeasures, viz. Runtime-monitor, Restart-monitor, Intermediate manager, and Auditor, occupy areas that are 26.6, 22, 0.2, and 12.2 % of a routing table-based router area. Apart from these, we propose Ejection address checker and Local monitoring module inside a router that cause 3.4 and 10.6 % increase of a router area, respectively. Simulation results are also given, which shows effectiveness of proposed monitoring-based countermeasures. © 2015, Springer Science+Business Media New York.


Kala S.,Indian Institute of Science | Nalesh S.,Indian Institute of Science | Nandy S.K.,Indian Institute of Science | Narayan R.,Morphing Machines Pvt Ltd.
Journal of Low Power Electronics | Year: 2015

Fast Fourier Transform is an integral part of OFDM systems. FFT is the most compute intensive operation that critically affects the OFDM system performance. In order to support the various OFDM standards, a scalable and reconfigurable FFT architecture is necessary. This paper presents an energy efficient and scalable FFT architecture, which can be dynamically reconfigured to adapt to specifications of different standards. The proposed architecture is based on Radix-4n algorithm and uses a parallel-pipelined unrolled architecture. The proposed architecture can be scaled to support FFTs of sizes upto 64 K points. As a proof of concept, FFT architecture for computation of FFTs of sizes 64 to 4096 point has been implemented in UMC 65 nm 1P10M CMOS process with a maximum clock frequency of 125 MHz and area of 1.05 mm2. The power consumption at 40 MHz is 33.5 mW for the computation of 4096 point FFT. Energy efficiency (FFTs per unit of energy) of the proposed architecture is 1176 for 1 K point, 584 for 2 K point and 291 for 4 K point FFTs at 40 MHz. The proposed architecture shows better performance in terms of scalability and energy efficiency when compared to existing implementations. Copyright © 2015 American Scientific Publishers All rights reserved.

Loading Morphing Machines Pvt. Ltd collaborators
Loading Morphing Machines Pvt. Ltd collaborators