Bangalore, India
Bangalore, India

Time filter

Source Type

Nalesh S.,Indian Institute of Science | Madhu K.T.,Indian Institute of Science | Das S.,Indian Institute of Science | Nandy S.K.,Indian Institute of Science | Narayan R.,Morphing Machines Pvt. Ltd.
Integration, the VLSI Journal | Year: 2017

Transistor supply voltages no longer scales at the same rate as transistor density and frequency of operation. This has led to the Dark Silicon problem, wherein only a fraction of transistors can operate at maximum frequency and nominal voltage, in order to ensure that the chip functions within the power and thermal budgets. Heterogeneous computing systems which consist of General Purpose Processors (GPPs), Graphic Processing Units (GPUs) and application specific accelerators can provide improved performance while keeping power dissipation at a realistic level. For the accelerators to be effective, they have to be specialized for related classes of application kernels and have to be synthesized from high level specifications. Coarse Grained Reconfigurable has been proposed as accelerators for a variety of application kernels. For CGRAs to be used as accelerators in the Dark Silicon era, a synthesis framework which focuses on optimizing energy efficiency, while achieving the target performance is essential. However, existing compilation techniques for CGRAs focuses on optimizing only for performance, and any reduction in energy is just a side-effect. In this paper we explore synthesizing application kernels expressed as functions, on a coarse grained composable reconfigurable array (CGCRA). The proposed reconfigurable array comprises HyperCells, which are reconfigurable macro-cells that facilitate modeling power and performance in terms of easily measurable parameters. The proposed synthesis approach takes kernels expressed in a functional language, applies a sequence of well known program transformations, explores trade-offs between throughput and energy using the power and performance models, and realizes the kernels on the CGCRA. This approach when used to map a set of signal processing and linear algebra kernels achieves resource utilization varying from 50% to 80%. © 2017 Elsevier B.V.

Madhu K.T.,Indian Institute of Science | Rao A.,Morphing Machines Pvt Ltd | Das S.,Indian Institute of Science | KrishnaC M.,Indian Institute of Science | And 2 more authors.
ACM International Conference Proceeding Series | Year: 2016

Performance of an application on a many-core machine primarily hinges on the ability of the architecture to exploit parallelism and to provide fast memory accesses. Exploiting parallelism in static application graphs on a multicore tar- get is relatively easy owing to the fact that compilers can map them onto an optimal set of processing elements and memory modules. Dynamic application graphs have computations and data dependencies that manifest at runtime and hence may not be schedulable statically. Load balancing of such graphs requires runtime support (such as support for work-stealing) but results in overheads due to data and code movement. In this work, we use ReNÉ MPSoC as an alter- native to the traditional many-core processing platforms to target application kernel graphs. ReNÉ is designed to be used as an accelerator to a host and offers the ability to exploit massive parallelism at multiple granularities and sup- ports work-stealing for dynamic load-balancing. Further, it offers handles to enable and disable work-stealing selectively. ReNÉ employs an explicitly managed global memory with minimal hardware support for address translation required for relocating application kernels. We present a resource management methodology on ReNE MPSoC that encom- passes a lightweight resource management hardware module and a compilation ow. Our methodology aims at identifying resource requirements at compile time and create re- source boundaries (per application kernel) to guarantee performance and maximize resource utilization. The approach offers similar exibility in resource allocation as a dynamic scheduling runtime but guarantees performance since locality of reference of data and code can be ensured. © 2016 ACM.

Das S.,Indian Institute of Science | Narayan R.,Morphing Machines Pvt. Ltd | Nandy S.K.,Indian Institute of Science
Communications in Computer and Information Science | Year: 2012

In this paper we present a hardware-software hybrid technique for modular multiplication over large binary fields. The technique involves application of Karatsuba-Ofman algorithm for polynomial multiplication and a novel technique for reduction. The proposed reduction technique is based on the popular repeated multiplication technique and Barrett reduction. We propose a new design of a parallel polynomial multiplier that serves as a hardware accelerator for large field multiplications.We show that the proposed reduction technique, accelerated using the modified polynomial multiplier, achieves significantly higher performance compared to a purely software technique and other hybrid techniques. We also show that the hybrid accelerated approach to modular field multiplication is significantly faster than the Montgomery algorithm based integrated multiplication approach. © Springer-Verlag Berlin Heidelberg 2012.

Kala S.,Indian Institute of Science | Nalesh S.,Indian Institute of Science | Maity A.,National Institute of Technology Durgapur | Nandy S.K.,Indian Institute of Science | Narayan R.,Morphing Machines Pvt. Ltd.
Proceedings - IEEE International Symposium on Circuits and Systems | Year: 2013

In this paper we propose a fully parallel 64K point radix-44 FFT processor. The radix-44 parallel unrolled architecture uses a novel radix-4 butterfly unit which takes all four inputs in parallel and can selectively produce one out of the four outputs. The radix-44 block can take all 256 inputs in parallel and can use the select control signals to generate one out of the 256 outputs. The resultant 64K point FFT processor shows significant reduction in intermediate memory but with increased hardware complexity. Compared to the state-of-art implementation [5], our architecture shows reduced latency with comparable throughput and area. The 64K point FFT architecture was synthesized using a 130nm CMOS technology which resulted in a throughput of 1.4 GSPS and latency of 47.7μs with a maximum clock frequency of 350MHz. When compared to [5], the latency is reduced by 303μs with 50.8% reduction in area. © 2013 IEEE.

Das S.,Indian Institute of Science | Sivanandan N.,Indian Institute of Science | Madhu K.T.,Indian Institute of Science | Nandy S.K.,Indian Institute of Science | Narayan R.,Morphing Machines Pvt. Ltd.
Proceedings of the IEEE International Conference on VLSI Design | Year: 2016

In this paper, we present an architecture named REDEFINE Hyper Cell Multicore (RHyMe) designed to efficiently realize HPC application kernels, such as loops. RHyMe relies on the compiler to generate the meta-data for its functioning. Most of the orchestration activity for executing kernels is governed by compiler generated meta-data made use of at runtime. In RHyMe, macro operations can be realized as a hardware overlay of MIMO operations on hardware structures called Hyper Cells. While a Hyper Cell enables exploiting fine-grain instruction level and pipeline parallelism, coarse-grain parallelism is exploited among multiple Hyper Cells. Regularity exhibited by computations such as loops results in efficient usage of simple compute hardware such as Hyper Cells as well as memory structures that can be managed explicitly. © 2016 IEEE.

Biswas A.K.,Indian Institute of Science | Nandy S.K.,Indian Institute of Science | Narayan R.,Morphing Machines Pvt Ltd
2015 IEEE International Conference on Electronics, Computing and Communication Technologies, CONECCT 2015 | Year: 2015

NoC based high performance MP-SoCs can have multiple secure regions or Trusted Execution Environments (TEEs). These TEEs can be separated by non-secure regions or Rich Execution Environments (REEs) in the same MP-SoC. All communications between two TEEs need to cross the in-between REEs. Without any security mechanisms, these traffic flows can face router attacks in REEs. Both routing table and routing logic based routers are vulnerable to such attacks. In this paper, we address attacks on routing tables. We propose two countermeasures-Run-time protector and Restart-time protector. In addition to detection and prevention, proposed protectors can locate a malicious router also. Synthesis results show that, the area of a Run-time protector and a Restart-time protector is only 6.6% and 2% of a conventional router area respectively. © 2015 IEEE.

Merchant F.,Indian Institute of Science | Choudhary N.,Indian Institute of Science | Nandy S.K.,Indian Institute of Science | Narayan R.,Morphing Machines Pvt. Ltd.
Proceedings of the IEEE International Conference on VLSI Design | Year: 2016

In this paper we present different optimization techniques on look-up table based algorithms for double precision floating point arithmetic. Based on our analysis of different look-up table based algorithms in the literature, we re-engineer basics blocks of the algorithms (i.e. Multiplier (s) and adder (s)) to facilitate area and timing benefits to achieve higher performance. We propose different look-up table optimization techniques for the algorithms. We also analyze trade-off in employing exact rounding (0.5ulp) (unit in the last place) in the double precision floating point unit. Based on performance and extensibility criteria we take algorithms proposed by Wong and Goto as a base case to validate our optimization techniques and compare the performance with other algorithms in the literature. We improve the performance (latency × area) of Wong and Goto division algorithm by 26.94%. © 2016 IEEE.

Mahale G.,Indian Institute of Science | Nandy S.K.,Indian Institute of Science | Bhatia E.,BITS Pilani | Narayan R.,Morphing Machines Pvt Ltd.
Proceedings of the IEEE International Conference on VLSI Design | Year: 2016

In this paper we propose architecture of a processor for vector operations involved in on-line learning of neural networks. We target to implement on-line learning on a Radial Basis Function Neural Network (RBFNN) based Face Recognition (FR) system that has pseudo inverse computation as an essential component during training. Synaptic weights of RBFNN output layer need to be updated whenever the FR system comes across a new face to be learnt. For real-time on-line learning, update of synaptic weights is done using an existing Incremental Pseudo Inverse (IPI) algorithm in the place of compute intensive pseudo inverse algorithm. We design a custom data-path for vector operations appearing in IPI algorithm. The custom data-path along with configuration and memory access mechanisms forms a processing unit, termed Processor for Vector Operations (VOP). We simulate and synthesize VOP to target Virtex-6 FPGA using the Xilinx ISE. Apart from on-line learning, the VOPs can be used in acceleration of several applications involving predominant vector-matrix operations. © 2016 IEEE.

A method and System on Chip (SoC) for adapting a reconfigurable hardware for an application kernel at run time is provided. The method includes obtaining a plurality of Hyper-Operations corresponding to the application. A Hyper-Operation performs one or more of a plurality of MIMO functions of the application. The method further includes retrieving compute metadata and transport metadata corresponding to each Hyper-Operation. Compute metadata specifies functionality of a Hyper-Operation and transport metadata specifies data flow path of a Hyper-Operation. Thereafter, the method maps each Hyper-Operation to a corresponding set of tiles in the hardware. The set of tiles includes one or more tiles and a tile performs one or more of the plurality of MIMO functions of the application.

Biswas A.K.,Indian Institute of Science | Nandy S.K.,Indian Institute of Science | Narayan R.,Morphing Machines Pvt Ltd
Circuits, Systems, and Signal Processing | Year: 2015

The growing number of applications and processing units in modern Multiprocessor Systems-on-Chips (MPSoCs) come along with reduced time to market. Different IP cores can come from different vendors, and their trust levels are also different, but typically they use Network-on-Chip (NoC) as their communication infrastructure. An MPSoC can have multiple Trusted Execution Environments (TEEs). Apart from performance, power, and area research in the field of MPSoC, robust and secure system design is also gaining importance in the research community. To build a secure system, the designer must know beforehand all kinds of attack possibilities for the respective system (MPSoC). In this paper we survey the possible attack scenarios on present-day MPSoCs and investigate a new attack scenario, i.e., router attack targeted toward NoC architecture. We show the validity of this attack by analyzing different present-day NoC architectures and show that they are all vulnerable to this type of attack. By launching a router attack, an attacker can control the whole chip very easily, which makes it a very serious issue. Both routing tables and routing logic-based routers are vulnerable to such attacks. In this paper, we address attacks on routing tables. We propose different monitoring-based countermeasures against routing table-based router attack in an MPSoC having multiple TEEs. Synthesis results show that proposed countermeasures, viz. Runtime-monitor, Restart-monitor, Intermediate manager, and Auditor, occupy areas that are 26.6, 22, 0.2, and 12.2 % of a routing table-based router area. Apart from these, we propose Ejection address checker and Local monitoring module inside a router that cause 3.4 and 10.6 % increase of a router area, respectively. Simulation results are also given, which shows effectiveness of proposed monitoring-based countermeasures. © 2015, Springer Science+Business Media New York.

Loading Morphing Machines Pvt. Ltd collaborators
Loading Morphing Machines Pvt. Ltd collaborators