Distributed Computing Group

Polaris, Italy

Distributed Computing Group

Polaris, Italy

Time filter

Source Type

Versaci F.,Distributed Computing Group | Pireddu L.,Distributed Computing Group | Zanetti G.,Distributed Computing Group
Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016 | Year: 2016

The adoption of Big Data technologies can potentially boost the scalability of data-driven biology and health workflows by orders of magnitude. Consider, for instance, that technologies in the Hadoop ecosystem have been successfully used in data-driven industry to scale their processes to levels much larger than any biological-or health-driven work attempted thus far. In this work we demonstrate the scalability of a sequence alignment pipeline based on technologies from the Hadoop ecosystem -namely, Apache Flink and Hadoop MapReduce, both running on the distributed Apache YARN platform. Unlike previous work, our pipeline starts processing directly from the raw BCL data produced by Illumina sequencers. A Flink-based distributed algorithm reconstructs reads from the Illumina BCL data, and then demultiplexes them -analogously to the bcl2fastq2 program provided by Illumina. Subsequently, the BWA-MEM-based distributed aligner from the Seal project is used to perform read mapping on the YARN platform. While the standard programs by Illumina and BWA-MEM are limited to shared-memory parallelism (multi-threading), our solution is completely distributed and can scale across a large number of computing nodes. Results show excellent pipeline scalability, linear in the number of nodes. In addition, this approach automatically benefits from the robustness to hardware failure and transient cluster problems provided by the YARN platform, as well as the scalability of the Hadoop Distributed File System. Moreover, this YARN-based approach complements the up-and-coming version 4 of the GATK toolkit, which is based on Spark and therefore can run on YARN. Together, they can be used to form a scalable complete YARN-based variant calling pipeline for Illumina data, which will be further improved with the arrival of distributed in-memory filesystem technology such as Apache Arrow, thus removing the need to write intermediate data to disk. © 2016 IEEE.


Biffi A.,San Raffaele Scientific Institute | Montini E.,San Raffaele Scientific Institute | Lorioli L.,San Raffaele Scientific Institute | Lorioli L.,Vita-Salute San Raffaele University | And 38 more authors.
Science | Year: 2013

Metachromatic leukodystrophy (MLD) is an inherited lysosomal storage disease caused by arylsulfatase A (ARSA) deficiency. Patients with MLD exhibit progressive motor and cognitive impairment and die within a few years of symptom onset. We used a lentiviral vector to transfer a functional ARSA gene into hematopoietic stem cells (HSCs) from three presymptomatic patients who showed genetic, biochemical, and neurophysiological evidence of late infantile MLD. After reinfusion of the gene-corrected HSCs, the patients showed extensive and stable ARSA gene replacement, which led to high enzyme expression throughout hematopoietic lineages and in cerebrospinal fluid. Analyses of vector integrations revealed no evidence of aberrant clonal behavior. The disease did not manifest or progress in the three patients 7 to 21 months beyond the predicted age of symptom onset. These findings indicate that extensive genetic engineering of human hematopoiesis can be achieved with lentiviral vectors and that this approach may offer therapeutic benefit for MLD patients.


Biffi A.,San Raffaele Telethon Institute for Gene Therapy HSR TIGET | Biffi A.,San Raffaele Scientific Institute | Bartholomae C.C.,National Center for Tumor Diseases | Cesana D.,San Raffaele Telethon Institute for Gene Therapy HSR TIGET | And 30 more authors.
Blood | Year: 2011

A recent clinical trial for adrenoleukodystrophy (ALD) showed the efficacy and safety of lentiviral vector (LV) gene transfer in hematopoietic stem progenitor cells. However, several common insertion sites (CIS) were found in patients' cells, suggesting that LV integrations conferred a selective advantage. We performed highthroughput LV integration site analysis on human hematopoietic stem progenitor cells engrafted in immunodeficient mice and found the same CISs reported in patients with ALD. Strikingly, most CISs in our experimental model and in patients with ALD cluster in megabase-wide chromosomal regions of high LV integration density. Conversely, cancer-triggering integrations at CISs found in tumor cells from γretroviral vector-based clinical trials and oncogene-tagging screenings in mice always target a single gene and are contained in narrow genomic intervals. These findings imply that LV CISs are produced by an integration bias toward specific genomic regions rather than by oncogenic selection. © 2011 by The American Society of Hematology.


Lebre A.,Ecole des Mines de Nantes | Anedda P.,Distributed Computing Group | Gaggero M.,Distributed Computing Group | Quesnel F.,Ecole des Mines de Nantes
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) | Year: 2012

Although the use of virtual environments provided by cloud computing infrastructures is gaining consensus from the scientific community, running applications in these environments is still far from reaching the maturity of more usual computing facilities such as clusters or grids. Indeed, current solutions for managing virtual environments are mostly based on centralized approaches that barter large-scale concerns such as scalability, reliability and reactivity for simplicity. However, considering current trends about cloud infrastructures in terms of size (larger and larger) and in terms of usage (cross-federation), every large-scale concerns must be addressed as soon as possible to efficiently manage next generation of cloud computing platforms. In this work, we propose to investigate an alternative approach leveraging DIStributed and COoperative mechanisms to manage Virtual EnviRonments autonomicallY (DISCOVERY). This initiative aims at overcoming the main limitations of the traditional server-centric solutions while integrating all mandatory mechanisms into a unified distributed framework. The system we propose to implement, relies on a peer-to-peer model where each agent can efficiently deploy, dynamically schedule and periodically checkpoint the virtual environments they manage. The article introduces the global design of the DISCOVERY proposal and gives a preliminary description of its internals. © 2012 Springer-Verlag Berlin Heidelberg.


Anedda P.,Distributed Computing Group | Gaggero M.,Distributed Computing Group | Busonera G.,Distributed Computing Group | Schiaratura O.,Distributed Computing Group | Zanetti G.,Distributed Computing Group
Proceedings - 2010 12th IEEE International Conference on High Performance Computing and Communications, HPCC 2010 | Year: 2010

High Performance Computational Clusters are, in general, rather rigid objects that present to their user a limited number of degrees of freedom related, usually, only to the specification of the resources requested and to the selection of specific applications and libraries. While in standard production environments this is reasonable and actually desirable, it can become an hindrance when one needs a dynamic and flexible computational environment, for instance for experiments and evaluation, where very different computational approaches, e.g., map-reduce, standard parallel jobs and virtual HPC clusters need to coexist on the same physical facility. In this paper we will present our efforts to address some of these challenges while maintaining a unified cluster management environment. © 2010 IEEE.


Anedda P.,Distributed Computing Group | Leo S.,Distributed Computing Group | Manca S.,Distributed Computing Group | Gaggero M.,Distributed Computing Group | Zanetti G.,Distributed Computing Group
Future Generation Computer Systems | Year: 2010

A systematic study of issues related to suspending, migrating and resuming virtual clusters for data-driven HPC applications is presented. The interest is focused on nontrivial virtual clusters, that is where the running computation is expected to be coordinated and strongly coupled. It is shown that this requires that all cluster level operations, such as start and save, should be performed as synchronously as possible on all nodes, introducing the need of barriers at the virtual cluster computing meta-level. Once a synchronization mechanism is provided, and appropriate transport strategies have been setup, it is possible to suspend, migrate and resume whole virtual clusters composed of "heavy" (4 GB RAM, 6 GB disk images) virtual machines in times of the order of few minutes without disrupting parallel computationalbeit of the MapReduce typerunning inside them. The approach is intrinsically parallel, and should scale without problems to larger size virtual clusters. © 2010 Elsevier B.V. All rights reserved.

Loading Distributed Computing Group collaborators
Loading Distributed Computing Group collaborators