Xilinx Research Labs

San Jose, CA, United States

Xilinx Research Labs

San Jose, CA, United States
Time filter
Source Type

Di Tucci L.,Polytechnic of Milan | O'Brien K.,Xilinx Research Labs | Blott M.,Xilinx Research Labs | Santambrogio M.D.,Polytechnic of Milan
Proceedings of the 2017 Design, Automation and Test in Europe, DATE 2017 | Year: 2017

Smith-Waterman is a dynamic programming algorithm that plays a key role in the modern genomics pipeline as it is guaranteed to find the optimal local alignment between two strings of data. The state of the art presents many hardware acceleration solutions that have been implemented in order to exploit the high degree of parallelism available in this algorithm. The majority of these implementations use heuristics to increase the performance of the system at the expense of the accuracy of the result. In this work, we present an implementation of the pure version of the algorithm. We include the key architectural optimizations to achieve highest possible performance for a given platform and leverage the Berkeley roofline model to track the performance and guide the optimizations. To achieve scalability, our custom design comprises of systolic arrays, data compression features and shift registers, while a custom port mapping strategy aims to maximize performance. Our designs are built leveraging an OpenCL-based design entry, namely Xilinx SDAccel, in conjunction with a Xilinx Virtex 7 and Kintex Ultrascale platform. Our final design achieves a performance of 42.47 GCUPS (giga cell updates per second) with an energy efficiency of 1.6988 GCUPS/W. This represents an improvement of 1.72x in performance and energy efficiency over previously published FPGA implementations and 8.49x better in energy efficiency over comparable GPU implementations. © 2017 IEEE.

Doumoulakis A.,Xilinx Research Labs | Keryell R.,Xilinx Research Labs | O'Brien K.,Xilinx Research Labs
ACM International Conference Proceeding Series | Year: 2017

Heterogeneous computing is required in systems ranging from low-end embedded systems up to the high-end HPC systems to reach high-performance while keeping power consumption low. Having more and more accelerators and CPUs also creates challenges for the programmer, requiring even more expertise of them. Fortunately, new modern C++-based domain-specific languages, such as the SYCL open standard from Khronos Group, simplify the programming at the full system level while keeping high performance. SYCL is a single-source programming model providing a task graph of heterogeneous kernels that can be run on various accelerators or even just the CPU. The memory heterogeneity is abstracted through buffer objects and the memory usage is abstracted with accessor objects. From these accessors, the task graph is implicitly constructed, the synchronizations and the data movements across the various physical memories are done automatically, by opposition to OpenCL or CUDA. Sometimes, some applications or libraries already exist using the OpenCL standard or some OpenCL kernels are provided, either as OpenCL kernel source code or even as built-in OpenCL kernels written in RTL for extreme optimization on FPGA. SYCL provides an OpenCL interoperability mode to reuse existing OpenCL code while keeping the higher level task graph programming model without needing explicit memory transfers. We present some experiments on two applications on GPU and FPGA with the triSYCL open-source implementation to show the benefits of this OpenCL interoperability mode. © 2017 ACM.

Zhou S.,University of Southern California | Jiang W.,Xilinx Research Labs | Prasanna V.,University of Southern California
HotSDN 2014 - Proceedings of the ACM SIGCOMM 2014 Workshop on Hot Topics in Software Defined Networking | Year: 2014

This work presents a hardware-software co-design approach of an OpenFlow switch using a state-of-the-art heterogeneous System-on-chip (SoC) platform. Specifically, we implement the OpenFlow switch on a Xilinx Zynq ZC706 board. The Xilinx Zynq SoC family provides a tight coupling of field programmable gate array (FPGA) fabric and ARM processor cores, making it an attractive on-chip implementation platform for SDN switches. High-performance, yet highly-programmable, data plane processing can reside in the programmable logic (PL), while complex control software can reside in ARM processor. Our proposed architecture scales across a range of possible packet throughput rates and a range of possible flow table sizes. Post-place-and-route results show that our design targeted at Zynq can achieve a total 88 Gbps throughput for a 1K flow table which supports dynamic updates. Correct operation has been demonstrated using a ZC706 board. © 2014 Authors.

Ganegedara T.,University of Southern California | Jiang W.,Xilinx Research Labs | Prasanna V.K.,University of Southern California
IEEE Transactions on Parallel and Distributed Systems | Year: 2014

Packet classification is widely used as a core function for various applications in network infrastructure. With increasing demands in throughput, performing wire-speed packet classification has become challenging. Also the performance of today's packet classification solutions depends on the characteristics of rulesets. In this work, we propose a novel modular Bit-Vector (BV) based architecture to perform high-speed packet classification on Field Programmable Gate Array (FPGA). We introduce an algorithm named StrideBV and modularize the BV architecture to achieve better scalability than traditional BV methods. Further, we incorporate range search in our architecture to eliminate ruleset expansion caused by range-to-prefix conversion. The post place-and-route results of our implementation on a state-of-the-art FPGA show that the proposed architecture is able to operate at 100+ Gbps for minimum size packets while supporting large rulesets up to 28 K rules using only the on-chip memory resources. Our solution is ruleset-feature independent , i.e. the above performance can be guaranteed for any ruleset regardless the composition of the ruleset. © 1990-2012 IEEE.

Lotze J.,Trinity College Dublin | Fahmy S.A.,Nanyang Technological University | Noguera J.,Xilinx Research Labs | Doyle L.E.,Trinity College Dublin
IEEE Journal on Selected Areas in Communications | Year: 2011

Cognitive radio is a promising technology for fulfilling the spectrum and service requirements of future wireless communication systems. Real experimentation is a key factor for driving research forward. However, the experimentation testbeds available today are cumbersome to use, require detailed platform knowledge, and often lack high level design methods and tools. In this paper we propose a novel cognitive radio design technique, based on a high-level model which is implementation independent, supports design-time correctness checks, and clearly defines the underlying execution semantics. A radio designed using this technique can be synthesised to various real radio platforms automatically; detailed knowledge of the target platform is not required. The proposed technique therefore simplifies cognitive radio design and implementation significantly, allowing researchers to validate ideas in experiments without extensive engineering effort. One example target platform is proposed, comprising software and reconfigurable hardware. The design technique is demonstrated for this platform through the development of two realistic cognitive radio applications. © 2006 IEEE.

Brebner G.,Xilinx Research Labs
Conference on Optical Fiber Communication, Technical Digest Series | Year: 2015

There is a significant speed mismatch between optical transmission and the current data center. The Field Programmable Gate Array offers a flexible bridge between emergent photonic technologies and next-generation data center servers, via programmable hardware. © 2015 OSA.

Jiang W.,Xilinx Research Labs
ANCS 2013 - Proceedings of the 9th ACM/IEEE Symposium on Architectures for Networking and Communications Systems | Year: 2013

Ternary Content Addressable Memory (TCAM) is widely used in network infrastructure for various search functions. There has been a growing interest in implementing TCAM using reconfigurable hardware such as Field Programmable Gate Array (FPGA). Most of existing FPGA-based TCAM designs are based on brute-force implementations, which result in inefficient on-chip resource usage. As a result, existing designs support only a small TCAM size even with large FPGA devices. They also suffer from significant throughput degradation in implementing a large TCAM, mainly caused by deep priority encoding. This paper presents a scalable random access memory (RAM)-based TCAM architecture aiming for efficient implementation on state-of-the-art FPGAs. We give a formal study on RAM-based TCAM to unveil the ideas and the algorithms behind it. To conquer the timing challenge, we propose a modular architecture consisting of arrays of small-size RAM-based TCAM units. After decoupling the update logic from each unit, the modular architecture allows us to share each update engine among multiple units. This leads to resource saving. The capability of explicit range matching is also offered to avoid range-to-ternary conversion for search functions that require range matching. Implementation on a Xilinx Virtex 7 FPGA shows that our design can support a large TCAM of up to 2.4 Mbits while sustaining high throughput of 150 million packets per second. The resource usage scales linearly with the TCAM size. The architecture is configurable, allowing various performance trade-offs to be exploited. To the best of our knowledge, this is the first FPGA design that implements a TCAM larger than 1 Mbits. © 2013 IEEE.

Brebner G.,Xilinx Research Labs
2011 Optical Fiber Communication Conference and Exposition and the National Fiber Optic Engineers Conference, OFC/NFOEC 2011 | Year: 2011

This paper demonstrates that Xilinx FPGA partial reconfiguration technology can yield resource and power savings in optical network elements through selective hardware modification during live operation, illustrated by two examples: data framing and EFEC calculation. © 2011 Optical Society of America.

Sutton P.D.,Trinity College Dublin | Lotze J.,Trinity College Dublin | Lahlou H.,Trinity College Dublin | Fahmy S.A.,Nanyang Technological University | And 5 more authors.
IEEE Communications Magazine | Year: 2010

Iris is a software architecture for building highly reconfigurable radio networks. It has formed the basis for a wide range of dynamic spectrum access and cognitive radio demonstration systems presented at a number of international conferences between 2007 and 2010. These systems have been developed using heterogeneous processing platforms including generalpurpose processors, field-programmable gate arrays and the Cell Broadband Engine. Focusing on runtime reconfiguration, Iris offers support for all layers of the network stack and provides a platform for the development of not only reconfigurable point-to-point radio links but complete networks of cognitive radios. This article provides an overview of Iris, presenting the unique features of the architecture and illustrating how it can be used to develop a cognitive radio testbed. © 2006 IEEE.

Brebner G.,Xilinx Research Labs
European Conference on Optical Communication, ECOC | Year: 2015

Software presents performance difficulties if it is the sole means of programmability in highspeed optical SDN and NFV. Programmable hardware, specifically the FPGA, offers an ideal cost-effective and power-efficient companion for implementing flexible SDN data planes and accelerating NFV functions. © 2015 Viajes el Corte Ingles, VECISA.

Loading Xilinx Research Labs collaborators
Loading Xilinx Research Labs collaborators