Time filter

Source Type

Santa Cruz, CA, United States

News Article | August 21, 2016
Site: motherboard.vice.com

I know a game has captured the imagination of the mainstream when my mom asks me about it, and I know No Man’s Sky is an exception in this rare category of games because she asked me about it twice: Once when Hello Games founder Sean Murray demoed it to Stephen Colbert on The Late Show, and again when Murray and No Man’s Sky were profiled in a big piece in the The New Yorker. The New Yorker story describes how “the universe is being built” by Murray and his small team (12 people at the time) by creating mathematical rules that will procedurally generate 18,446,744,073,709,551,616 (18 quintillion) unique planets for players to explore. “Because the designers are building their universe by establishing its laws of nature, rather than by hand-crafting its details, much about it remains unknown, even to them,” The New Yorker said. The New Yorker story explains how these “laws” fundamentally shape No Man’s Sky’s universe with evocative examples. “Because of its algorithmic structure, nearly everything in it is interconnected: changes to the handling of a ship can affect the way insects fly,” the story said. This basic notion was repeated, practically unchallenged, in virtually every writeup of the game since it was revealed in 2013. Now that it’s out, players are varying degrees of outraged that No Man’s Sky is not as large, unique, or enchanting as they thought it would be based on what Murray said leading up to release. An image of his face with the title “One Man’s Lie” was on the front page of the r/gaming subreddit at the time of writing. On the r/games subreddit, a dedicated player has gathered a comprehensive list of grievances, comparing what Murray has said about the game versus the final product to point out where it fell short of what was promised. No Man’s Sky is not what many players hoped it would be, and while it does appear that a handful of features were cut or scaled back for release (which is common in game development), I believe that the outrage is less about lies or supersonic hype as much as it is rooted in a willful misunderstanding of procedural generation and its limits. Yes, No Man’s Sky technically has a quintillion planets, but they all start to feel the same after a while because they were created with a mathematical formula that only has so many variables to work with. By definition, you can not perfectly automate a process for creating special things meaningfully different than one another. Kate Compton, who creates a lot of interesting procedurally generated things as a PhD candidate at UCSC and who worked on creating planets for Spore, knows all about this. There’s a lot of parallels between Spore and No Man’s Sky. Spore, which was headed by SimCity creator Will Wright, was another game that leveraged procedural generation to create a vast universe. It also generated huge buzz prior to release with an appearance on The Colbert Report and a feature in The New Yorker, only to fall short of expectations. Compton and other Spore developers felt such empathy for Hello Games, they even sent the studio a case of beer as a gift. Players expected Spore to be a vast, rich, procedurally generated universe, but that’s not exactly what they got. “I can easily generate 10,000 bowls of plain oatmeal, with each oat being in a different position and different orientation, and mathematically speaking they will all be completely unique,” she said. “But the user will likely just see a lot of oatmeal. Perceptual uniqueness is the real metric, and it’s darn tough.” Compton told me over the phone that she could see that No Man’s Sky was about to get the same backlash as Spore. “I could see it coming because I saw the exact same thing with Spore,” she said. “We also said things that sounded like the right thing to say at the time, but it’s hard when you’re in the technical weeds, to explain [procedural generation] to someone who isn’t technical.” We all wanted to believe that 12 game developers in Guildford, England could create a deep, believable universe, but that is a ridiculous thing to believe given even basic facts about video game development. More than 200 people are currently working on Gears of War 4, which is a linear, narrative-driven game most players will see end-to-end in a few hours. Many hundreds more worked on the last Assassin’s Creed game, for example, which takes place in a large open world that is still smaller than one of those quintillion planets in No Man’s Sky. I understand how people got the impression that Hello Games was insinuating that it would be able to do what teams that were orders of magnitude bigger could not. You hear the number “18 quintillion” and your imagination goes wild, but when I asked Murray about what the number really means when he showed me a demo of No Man’s Sky in March, he was honest about it. “Even in our universe people will start to see the rules, that creatures can't have more than so many legs, creatures can't have that many eyes, but hopefully at that point they're really more focused on the game than the aesthetics of the game,” Murray told me. “I struggle to get this across to people, when you're watching people play, and they played for 10 hours of whatever, they're as excited about a barren world as they are about one that is full of life, because they're going down and scanning for resources and mining them and they're excited because they found a new technology not because they found a new shape of tree.” Players can go back to pre-release coverage of the game and pick nits until they find all the differences between what we saw in the years leading up to release and the final product, and they could do the same to every video game for similar results. The difference here is what players allowed themselves to imagine these quintillion planets will look and play like. Infinite space does not equal infinite possibilities. This is a lesson I learned from Minecraft, or any other procedurally generated world. At first I’m awestruck by the scale of the space, that I could go anywhere, and that I could play with the game’s system to make my own fun. This is exactly what I do in No Man’s Sky: I land, I look for the resources I need, and I use them to build upgrades. It’s cool because I chose where to go and I chose what to build. But then there’s the second phase, where I realize that these spaces aren’t handcrafted. They’re a mix of formulas that create an infinite number of situations that are only superficially different than one another. This planet is red. That planet is blue. I’m oversimplifying here to prove a point, but not by much. It’s not like Star Trek, when you had no idea what to expect whenever the ground team landed on an alien planet. “All the quotes from that time were similar, we were gonna have this many planets, content forever,” Compton told me about the marketing for Spore leading up to its release. “We got that the big challenge was technical, and we thought if all the procedurally generated stuff worked, we would win … But after the technical challenge is complete there’s this whole other challenge of making it meaningful.” When Sony came to New York to demo No Man’s Sky they also came with a demo for Uncharted 4, which is the exact opposite to No Man’s Sky in terms of design. Uncharted 4, which is made by another studio of about 300 people, is one of the most beautiful, rich virtual worlds I’ve ever seen. Every leaf and stone in it was hand crafted and placed at just the right spot to create the desired effect. It’s a highly directed experience, where the designers are trying to funnel me from point A to B in a way the feels like an exciting, unique experience, but that in reality is identical to the experience of every other player. It’s easier to make things feel meaningful this way because it’s a much more controlled environment. Murray is fully aware that he’s sacrificing some of that when he leans so heavily on procedural generation, but he believes that it’s a worthy sacrifice. “I think we've gone really far down that path [of directed experiences], and I feel like a lot of games—and I've done this before—they're grabbing the players’ head and forcing them to look at something, you're leading them by the nose, look, here's a boss battle coming up, so have some extra health, that kind of thing,” Murray said. “You're trying to normalize everyone's experience, but that's not what games are good at.”

By comparing disease dynamics in North American and Asian bat populations, researchers have found evidence that Asian bat species have much lower levels of infection than North American species and therefore are resistant to the fungus. The study, published March 9, 2016 in Proceedings of the Royal Society B, also suggests that some declining North American bat species may be able to evolve enough resistance to the disease to persist, while other species appear less likely to do so. Led by researchers at the University of California, Santa Cruz, an international team sampled hibernating bats at five sites in China and five sites in the United States, using a standardized swabbing technique to detect and quantify the amount of fungus on each bat. "Uniformly, across all the species we sampled in China, we found much lower levels of infection—both the fraction of bats infected and the amount of fungus on infected bats were lower than in North America," said first author Joseph Hoyt, a graduate student at UC Santa Cruz. Co-first author Kate Langwig, a former UCSC graduate student now a postdoctoral researcher at Harvard University, said the team collected samples from bats at hibernation sites in Northeastern China and the Midwestern United States where the latitude and winter climate are very similar. The fungus that causes white-nose syndrome is endemic in Asia and Europe, so bats there have coexisted with it for a long time, whereas the disease only recently invaded North America, where it was first discovered in 2006. "This is the first study to compare disease dynamics in an endemic region and a region where the pathogen is invading, and the results can help us understand the course it might take in North America," Hoyt said. The researchers considered four possible hypotheses for the ability of Asian bats to persist with the fungus: host resistance, host tolerance, lower transmission due to smaller populations, or lower fungal growth rates due to environmental factors. Their results pointed toward host resistance and did not support the other hypotheses, Hoyt said. The variation in infection intensity observed within some North American species, such as the little brown bat, may be a hopeful sign, he said. Overall, little brown bats had much higher levels of infection than Asian bats, but some individual little brown bats had relatively low fungal loads. If the variation is a result of genetic differences, it could lead to the evolution of resistance in that species. Langwig noted that one North American species, big brown bats, have not suffered as dramatically from the disease as other North American species. In contrast, northern long-eared bats, which showed very low variability in fungal loads, have experienced drastic population declines. "The northern long-eared bat suffers really high fungal loads, and nearly all individuals are infected—there's no overlap with the Asian species," Langwig said. "From previous work, we've seen their populations crashing toward extinction, so it could be a poor omen for that species." The mechanisms underlying the resistance of Asian bat species remain unknown. "It doesn't have to be the same strategy for every species—it could be differences in the skin microbiome in one and hibernation behavior in another—but we just don't have those details yet," Langwig said. Explore further: Study of deadly bat disease finds surprising seasonal pattern of infections More information: Host persistence or extinction from emerging infectious diseases: insights from white-nose syndrome in endemic and invading regions, Proceedings of the Royal Society B: Biological Sciences, rspb.royalsocietypublishing.org/lookup/doi/10.1098/rspb.2015.2861

The following strains and heteroallelic combinations were used: y1w1118 as the wild-type stock (yw), aubHN2/QC42 (aub) and tud1/Df(2R)PurP133 (tud), for aub and tud mutants (loss-of-function), respectively15, 31, 32, 33. All flies were grown at 25 °C with 70% relative humidity on a 12-h light–dark cycle. The 2–4-day female flies were crossed to yw males for 2 days in standard cornmeal food supplied with yeast paste before ovary dissection. Embryos collected at well-defined time-windows were dechorionated in 50% commercial bleach for 2 min, washed extensively in water and collected in PBS or HBSS or fixation solution, depending on downstream applications. Antibody against Aubergine (Aub-83) was produced by immunizing rabbits with Aub peptide (HKSEGDPRGSVRGRC, in which terminal cysteine was used to couple to KLH; Genscript) and selected with peptide-affinity purification of sera. Other antibodies that were used in this study: mouse monoclonal anti-PABP (6E2 clone)34, E7 mouse monoclonal anti-β-tubulin (Developmental Studies Hybridoma Bank) and anti-Tudor mouse monoclonal (gift from M. Siomi). Fixation and immunohistochemistry of dissected ovaries and embryos was performed according to standard protocols. Primary antibodies against Aub and Tud were used at 1 ng μl−1 final concentration. Secondary antibodies conjugated to Alexa 488 and 594 (Life technologies) were used at 1:1,000 dilution. Ovary and embryo samples were imaged on Leica TCS SPE confocal microscope. CLIP was performed as previously described for Mili, Miwi and MOV10L1 (refs 17, 35, 36). The protocol is described in detail previously36 and uses stringent buffer conditions to ensure high specificity. The experiment was performed in three biological replicates for each condition (yw ovaries, yw embryos 0–2 h, tud embryos 0–2 h). Approximately 40 mg of Drosophila embryos (0–2 h) or ~80 ovaries from 4–6-day females were collected in ice-cold HBSS and ultraviolet-irradiated (3×) at 254 nm (400 mJ cm−2). The tissues were pelleted, washed with PBS and the final tissue pellet was flash-frozen in liquid nitrogen and kept at −80 °C. Ultraviolet-light-treated tissues were lysed in 350 μl PMPG (PBS (no Mg2+ and no Ca2+), 2% Empigen) with protease inhibitors and rRNasin (2 U μl−1) and no exogenous RNases; lysates were treated with DNase I (Promega) for 5 min at 37 °C, and then were centrifuged at 100,000g for 30 min at 4 °C. For each immunoprecipitation, approximately 10 μl of our anti-Aub antibody was bound on 150 μl (slurry) of protein A Dynabeads in Ab binding buffer (0.1 M Na-phosphate, pH 8, and 0.1% NP-40) at room temperature for 2 h; antibody-bound beads were washed three times with PMPG. Antibody beads were incubated with lysates (supernatant of 100,000g) for 3 h at 4 °C. Low- and high-salt washes of immunoprecipitation beads were performed with 1× and 5× PMPG (5× PBS, 2% Empigen). RNA linkers (RL3 and RL5), as well as 3′ adaptor labelling and ligation to CIP (calf intestinal phosphatase)-treated RNA CLIP tags were performed as previously described36. Immunoprecipitation beads were eluted at 70 °C for 12 min using 30 μl of 2× Novex reducing loading buffer. Samples were analysed by NuPAGE (4–12% gradient precast gels, run with MOPS buffer). Cross-linked RNA–protein complexes were transferred onto nitrocellulose (Invitrogen), and the membrane was exposed to film for 1–2 h. Membrane fragments containing the main radioactive signal and fragments up to ~15 kDa higher were excised (Fig. 1a). RNA extraction, 5′ linker ligation, Reverse transcriptase PCR (RT–PCR) and a second PCR step were performed with the DNA primers (DP3 and DP5, DSFP3 and DSFP5) as described previously36. Complementary DNA from two PCR steps was resolved on and extracted from 3% Metaphor 1× TAE gels. Size profiles of cDNA libraries prepared from the main radioactive signal and higher molecular mass signal were similar (Fig. 1a). DNA was extracted with QIAquick Gel Extraction kit and submitted for deep sequencing. The cDNA libraries were sequenced with Hi-Seq Illumina at 100 cycles. Solid-support directional RNA-seq was performed as previously described17, using total RNA (depleted of ribosomal RNA with Ribo-Zero (EpiCentre)) isolated from 0–2-h embryos of appropriate genotypes. Nycodenz density gradient separation of RNPs was performed as previously described17 with modifications. A 20–60% (top to bottom) Nycodenz gradient (4.8 ml) in 1× KMH150 (150 mM KCl, 2 mM MgCl , 20 mM HEPES, pH 7.4, 0.5% NP-40, 0.1 U μl−1 rRNasin, and protease inhibitors) was prepared as a step gradient by overlaying five equal parts of Nycodenz solutions and was left to diffuse overnight at 4 °C. 0.2 microlitres of post-nuclear yw embryo lysate in 1× KMH150 was laid over the gradient and centrifuged at 150,000g for 20 h. We used embryos of stages 4–6, to avoid earlier stages where mRNAs at the soma form distinct mRNPs than the ones formed in the pole plasm PGCs. The gradient was collected in 12 equal fractions. Samples from each fraction were used for protein determination by Bradford and RNA extraction with Trizol LS. Right before RNA extraction, 500 ng of in vitro transcript of Renilla luciferase mRNA was spiked in each fraction for normalization purposes in subsequent steps. An equal volume of RNA extracted from each fraction was reverse transcribed by Supersript III (Invitrogen 18080-051) in the presence of random hexamers. Equal volume of the cDNA was mixed with primers (gcl, osk, Hsp83, dhd, cycB: Qiagen QuantiTect Assay; Renilla luciferase (rLuc), forward: 5′-CGCTGAAAGTGTAGTAGATGTG-3′ and reverse: 5′-TCCACGAAGAAGTTATTCTCCA-3′) and Power SYBR Green reaction mix (Applied Biosystems 4367659). The reactions were run on a StepOnePlus System (Applied Biosystems) using the default program. Aub immunoprecipitation, 5′ end labelling of piRNAs and cDNA library preparation were carried out as previously described37, 38. We used CLIPSeqTools39, a bioinformatics suite that we created for analysis of CLIP-seq data sets (accessible at: http://mourelatos.med.upenn.edu/clipseqtools/) and a Perl programming framework that we developed40. The latter framework is named GenOO and has been specifically developed for analysis of high-throughput sequencing data. The source code for GenOO has been deposited in GitHub and can be accessed at https://github.com/genoo/. In statistical analyses, we ensured that the assumptions of each statistical test are met and that the statistical test used is appropriate for the analysis. In all analyses the statistical tests and methods used are clearly stated in relevant sections. No statistical methods were used to predetermine sample size. The experiments were not randomized and investigators were not blinded to allocation during experiments and outcome assessment. Drosophila (assembly dm3) transcript, exon and repeat genomic locations were downloaded from the UCSC genome browser (downloaded 22 March 2011 from http://genome.ucsc.edu). Repeat consensus sequences were downloaded from Flybase (http://flybase.org/ - transposon_sequence_set v9.42). Localization categories for Drosophila genes were taken from ref. 19. The localization annotation matrix was downloaded from (http://fly-fish.ccbr.utoronto.ca/annotation_matrix.csv). Τransposon categories were as in ref. 31. The 3′ end ligated adaptor (GTGTCAGTCACTTCCAGCGGTCGTATGCCGTCTTCTGCTTG) was removed from the sequences using the cutadapt software and a 0.25 acceptable error rate for the alignment of the adaptor on the read. To eliminate reads in which the adaptor was ligated more than one time, adaptor removal was performed three times. Reads for all samples were aligned against the dm3 Drosophila melanogaster genome assembly using the aligner bwa v0.6.2-r126, with the default settings41. Reads were also aligned against the Repeat consensus sequences using the same aligner. All mapped reads were divided in the following genomic categories: sense repeat, antisense repeat, non-coding RNA, (protein) coding RNA. The remaining reads were considered to be intergenic reads. Gene expression was defined as the number of reads that map on each gene and the values were normalized by the upper quartile normalization method42. The log gene expression levels of replicates are compared using the Pearson Correlation function in R. Reads mapping in the same position (same 5′ end mapping) were considered as coinciding. When comparing CLIP with immunoprecipitation libraries, the percentage of piRNA-size CLIP reads that had a coinciding start with any standard immunoprecipitation read were counted as positive. For each localization category, the quartile-normalized lgCLIP binding level (‘mRNA expression level’ in each CLIP library) is compared via a two-sided t-test between genes that belong to the category versus genes that do not belong to it. To compare two samples, we measure the difference in binding (per gene) between the two conditions (log (gene.expr.cond1/gene.expr.cond2)) and then perform a t-test of differences in genes belonging to the category versus genes not belonging in the category. The following twelve mRNA localization categories19 were found significantly depleted in tud embryo Aub CLIP libraries compared to yw embryo libraries, and were used in analyses were ‘posterior localized mRNAs’ are mentioned: ‘1:41:RNA islands’, ‘1:42:Pole buds’, ‘1:40:Pole plasm’, ‘3:265:Perinuclear around pole cell nuclei’, ‘4:370:Germ cell localization’, ‘4:403:Germ cell enrichment’, ‘3:348:Pole cell enrichment’, ‘2:141:Pole cell localization’, ‘2:153:Perinuclear around pole cell nuclei’, ‘2:142:Pole cell enrichment’, ‘3:347:Pole cell localization’, ‘1:59:Perinuclear around pole cell nuclei’ (http://fly-fish.ccbr.utoronto.ca/). The remaining mRNAs are mentioned as non-posterior localized mRNAs. The following three posterior localization categories were also depleted in tud embryo Aub CLIP libraries compared to yw: ‘1:39:Posterior localization’, ‘2:124:Posterior localization’, ‘3:352:Posterior localization’. Almost all of the mRNAs contained in the above twelve categories are also contained in these three, but these three categories also contain some mRNAs that do not actually localize in the pole plasm or the germ cells (that is, with apical localization); therefore, mRNAs belonging in any of these three localization categories but not in any of the above mentioned twelve posterior categories were not considered for the generation of the Supplementary Table 4. Many mRNAs do not have a designated localization pattern, and they are mentioned as ‘undetermined localization’. It is worth mentioning that this category contains several mRNAs with clear posterior–pole-plasm localization. Through manual searches of the Berkeley Drosophila Genome Project chromogenic ISH database (http://insitu.fruitfly.org/cgi-bin/ex/insitu.pl) we noticed that many Aub-bound mRNAs, the localization of which is not annotated in the Fly-FISH database, are indeed localized in the germ plasm/cells (such as CG4735/shu, CG7070/PyK, CG4903/MESR4, CG5452/dnk and CG9429/Calr), therefore our analysis is most likely underestimating the true number of Aub-bound mRNAs that are important for germline specification and function. Because of this, mRNAs with ‘undetermined localization’ were never mixed with ‘non-posterior localized’ mRNAs in our analyses. To identify highly bound genes, we used the rank product method43. Specifically, genes are sorted by expression per sample, and for each gene the product of their ranks is calculated. The probability of this rank product produced by chance is calculated by permutations of all non-zero value genes. We calculated the expression for protein-coding transcripts by counting the number of RNA-seq reads that map within the exons of each transcript. The counts were normalized using RPKM and upper quartile normalization, effectively dividing each count by the upper quartile of all counts42. The transcript with the highest RPKM score was used (‘best transcript’) unless otherwise noted. We calculated the expression for protein-coding transcripts by counting the number of CLIP reads that map within the exons of each transcript in the sense orientation. The counts were normalized using reads per million and upper quartile normalization, effectively dividing each count by the upper quartile of all counts42. Upper quartile normalized RPKM for RNA-seq was compared to similarly normalized CLIP binding levels defined as average number of reads per transcript in CLIP replicates. Correlation was calculated using the Pearson Correlation function in R. (1) Identified lgCLIP size reads (read length >35) that did not align to the genome. (2) Made a set of substrings from both ends of reads from (1) of piRNA size (L = [23,29]). (3) Identified the substring from (2) to full-length piRNAs (L = [23,29]) from corresponding low samples (Extended Data Fig. 1b) (4) The longest aligning piRNAs are retained and coupled with the remainder of the read as piRNA–lgCLIP couples. (5) The piRNA aligning fragment is cut from the read. Very small remainder reads (L = [<20]) are discarded. (6) The remainders are aligned to the genome (using bwa default settings). (7) Remainders aligned in one single position that is on a known mRNA are retained. (1) Regions of 200-nucleotide length were cut around the midpoint of the genomic alignment region from step 7 of previous routine. Specifically, if (d = 200 the length of the final region we want and L is the length of the read), a genomic region flanking the read on each side of length d/2 was excised from the chromosome sequence. If the alignment was located in the minus strand the sequence was reversed and complemented at this point. This total region has length d + L. We discard an equal number of nucleotides from each side to reach a final length of L (specifically we substring starting from int(L/2) and for d nucleotides. Note, int will always round down). At this point we have a region of length 200 nucleotides centred around the alignment region of the fragment. (2) We use a slightly modified Smith–Waterman44 alignment method (weights: match = +1, mismatch = −1, gap = −2) to align piRNAs on the 200-nucleotide long regions from (1). Differences of our alignment versus Smith–Waterman: (a) No penalties are given to non-matching nucleotides on the edges of the alignment. (b) If there are multiple optimal alignment scores, one is picked randomly. (c) Alignments in which part of one sequence is outside the boundaries of the other sequence are not considered. (3) The midpoint of the alignment (if k nucleotides matched that is the int(k/2) nucleotide) is used for graphs of alignment positioning on regions. We grouped piRNA sequences into families based on the first 23 nucleotides of each piRNA. Using the alignment algorithm described above we aligned one piRNA (the most abundant) for each of the top 2,000 families to the longest annotated transcript for each protein-coding gene. These 2,000 piRNA families represent ~37% of piRNA reads from low yw CLIP libraries. To factor in transcript abundance, we multiplied the RNA-seq (yw 0–2-h embryo) RPKM value for each mRNA with the number of predicted piRNA target sites found within the mRNA. This provides a ‘targeting potential’ of every mRNA species, corrected for its abundance. We then evaluated the targeting potential of each piRNA–mRNA pair using three different scoring schemes. For the first, we sum the alignment score of all putative piRNA binding sites on the mRNA. For the second, we calculated a weighted alignment score for each putative piRNA binding site and then we sum all scores similar to the previous scheme. The weighted score for each binding site is calculated based on the following formula ∑ x * A , in which x is 1 or 0 based on whether the nucleotide at position i of the piRNA is bound or not, and A is the weight for nucleotide i. For the third, we multiplied the total number of predicted complementary sites per piRNA, with the piRNA copy number. Transcript sequences (fasta file) for each species were downloaded from Flybase (ftp://ftp.flybase.net/genomes/ on 1 September 2015, current version used for each genome). For each gene (identified as the ‘parent’ tag in the fasta file header), the longest transcript length was identified. For the analysis of the expressed mRNAs (Fig. 4d), we used our yw embryo RNA-seq data to identify the longest transcript with the highest length normalized abundance. Orthologue gene tables were downloaded from Flybase (gene_orthologs_fb_2015_03.tsv.gz) and were used to identify orthologue genes across species. For each species, all genes that mapped to localized and unlocalized Drosophila melanogaster genes were used in the comparison and were assigned to the corresponding group as their D. melanogaster orthologue. Boxplots were created using the lattice package in R (bwplot) and omitting outliers, P values were calculated using the Wilcoxon exact rank test (wilcox.test in R) one-sided with the hypothesis that localized genes are longer than non-localized.

NOD/SCID Il2rgnull mice (Jackson Laboratory) were bred and maintained in the Stem Cell Unit animal barrier facility at McMaster University. All procedures were approved by the Animal Research Ethics Board at McMaster University. All patient samples were obtained with informed consent and with the approval of local human subject research ethics boards at McMaster University. Human umbilical cord blood mononuclear cells were collected by centrifugation with Ficoll-Paque Plus (GE), followed by red blood cell lysis with ammonium chloride (StemCell Technologies). Cells were then incubated with a cocktail of lineage-specific antibodies (CD2, CD3, CD11b, CD11c, CD14, CD16, CD19, CD24, CD56, CD61, CD66b, and GlyA; StemCell Technologies) for negative selection of Lin− cells using an EasySep immunomagnetic column (StemCell Technologies). Live cells were discriminated on the basis of cell size, granularity and, as needed, absence of viability dye 7-AAD (BD Biosciences) uptake. All flow cytometry analysis was performed using a BD LSR II instrument (BD Biosciences). Data acquisition was conducted using BD FACSDiva software (BD Biosciences) and analysis was performed using FlowJo software (Tree Star). To quantify MSI2 expression in human HSPCs, Lin− cord blood cells were stained with the appropriate antibody combinations to resolve HSC (CD34+ CD38− CD45RA− CD90+), MPP (CD34+ CD38− CD45RA− CD90−), CMP (CD34+ CD38+ CD71−) and EP (CD34+ CD38+ CD71+) fractions as similarly described previously18, 19 with all antibodies from BD Biosciences: CD45RA (HI100), CD90 (5E10), CD34 (581), CD38 (HB7) and CD71 (M-A712). Cell viability was assessed using the viability dye 7AAD (BD Biosciences). All cell subsets were isolated using a BD FACSAria II cell sorter (BD Biosciences) or a MoFlo XDP cell sorter (Beckman Coulter). HemaExplorer20 analysis was used to confirm MSI2 expression in human HSPCs and across the hierarchy. For all qRT–PCR determinations total cellular RNA was isolated with TRIzol LS reagent according to the manufacturer’s instructions (Invitrogen) and cDNA was synthesized using the qScript cDNA Synthesis Kit (Quanta Biosciences). qRT–PCR was done in triplicate with PerfeCTa qPCR SuperMix Low ROX (Quanta Biosciences) with gene-specific probes (Universal Probe Library (UPL), Roche) and primers: MSI2 UPL-26, F-GGCAGCAAGAGGATCAGG, R-CCGTAGAGATCGGCGACA; HSP90 UPL-46, F-GGGCAACACCTCTACAAGGA, R-CTTGGGTCTGGGTTTCCTC; CYP1B1 UPL-20, F-ACGTACCGGCCACTATCACT, R-CTCGAGTCTGCACATCAGGA; GAPDH UPL-60, F-AGCCACATCGCTCAGACAC, R-GCCCAATACGACCAAATCC; ACTB (UPL Set Reference Gene Assays, Roche). The mRNA content of samples compared by qRT–PCR was normalized based on the amplification of GAPDH or ACTB. MSI2 shRNAs were designed with the Dharmacon algorithm (http://www.dharmacon.com). Predicted sequences were synthesized as complimentary oligonucleotides, annealed and cloned downstream of the H1 promoter of the modfied cppt-PGK-EGFP-IRES-PAC-WPRE lentiviral expression vector18. Sequences for the MSI2 targeting and control RFP targeting shRNAs were as follows: shMSI2, 5′-GAGAGATCCCACTACGAAA-3′; shRFP, 5′-GTGGGAGCGCGTGATGAAC-3′. Human MSI2 cDNA (BC001526; Open Biosystems) was subcloned into the MA bi-directional lentiviral expression vector21. Human CYP1B1 cDNA (BC012049; Open Biosystems) was cloned in to psMALB22. All lentiviruses were prepared by transient transfection of 293FT (Invitrogen) cells with pMD2.G and psPAX2 packaging plasmids (Addgene) to create VSV-G pseudotyped lentiviral particles. All viral preparations were titrated on HeLa cells before use on cord blood. Standard SDS–PAGE and western blotting procedures were performed to validate the effects of knockdown on transduced NB4 cells (DSMZ) and overexpression on 293FT cells. Immunoblotting was performed with anti-MSI2 rabbit monoclonal IgG (EP1305Y, Epitomics) and β-actin mouse monoclonal IgG (ACTBD11B7, Santa Cruz Biotechnology) antibodies. Secondary antibodies used were IRDye 680 goat anti-rabbit IgG and IRDye 800 goat anti-mouse IgG (LI-COR). 293FT and NB4 cell lines tested negative for mycoplasma. NB4 cells were authenticated by ATRA treatment before use. Cord blood transductions were conducted as described previously18, 23. Briefly, thawed Lin− cord blood or flow-sorted Lin− CD34+ CD38− or Lin− CD34+ CD38+ cells were prestimulated for 8–12 h in StemSpan medium (StemCell Technologies) supplemented with growth factors interleukin 6 (IL-6; 20 ng ml−1, Peprotech), stem cell factor (SCF; 100 ng ml−1, R&D Systems), Flt3 ligand (FLT3-L; 100 ng ml−1, R&D Systems) and thrombopoietin (TPO; 20 ng ml−1, Peprotech). Lentivirus was then added in the same medium at a multiplicity of infection of 30–100 for 24 h. Cells were then given 2 days after transduction before use in in vitro or in vivo assays. For in vitro cord blood studies biological (experimental) replicates were performed with three independent cord blood samples. Human clonogenic progenitor cell assays were done in semi-solid methylcellulose medium (Methocult H4434; StemCell Technologies) with flow-sorted GFP+ cells post transduction (500 cells per ml) or from day seven cultured transduced cells (12,000 cells per ml). Colony counts were carried out after 14 days of incubation. CFU-GEMMs can seed secondary colonies owing to their limited self-renewal potential24. Replating of MSI2-overexpressing and control CFU-GEMMs for secondary CFU analysis was performed by picking single CFU-GEMMs at day 14 and disassociating colonies by vortexing. Cells were spun and resuspended in fresh methocult, mixed with a blunt-ended needle and syringe, and then plated into single wells of a 24-well plate. Secondary CFU analysis for shMSI2- and shControl-expressing cells was performed by harvesting total colony growth from a single dish (as nearly equivalent numbers of CFU-GEMMs were present in each dish), resuspending cells in fresh methocult by mixing vigorously with a blunt-ended needle and syringe and then plating into replicate 35-mm tissue culture dishes. In both protocols, secondary colony counts were done following incubation for 10 days. For primary and secondary colony forming assays performed with the AHR agonist FICZ (Santa Cruz Biotechnology), 200 nM FICZ or 0.1% DMSO was added directly to H4434 methocult medium. Two-way ANOVA analysis was performed to compare secondary CFU output and FICZ treatment for MSI2-overexpressing or control conditions. Colonies were imaged with a Q-Colour3 digital camera (Olympus) mounted to an Olympus IX5 microscope with a 10× objective lens. Image-Pro Plus imaging software (Media Cybernetics) was used to acquire pictures and subsequent image processing was performed with ImageJ software (NIH). Transduced human Lin− cord blood cells were sorted for GFP expression and seeded at a density of 105 cells per ml in IMDM 10% FBS supplemented with human growth factors IL-6 (10 ng ml−1), SCF (50 ng ml−1), FLT3-L (50 ng ml−1), and TPO (20 ng ml−1) as previously described25. To generate growth curves, every seven days cells were counted, washed, and resuspended in fresh medium with growth factors at a density of 105 cells per ml. Cells from suspension cultures were also used in clonogenic progenitor, cell cycle and apoptosis assays. Experiments performed on transduced Lin− CD34+ cord blood cells used serum-free conditions as described in the cord blood transduction subsection of Methods. For in vitro cord blood studies, biological (experimental) replicates were performed with three independent cord blood samples. Cell cycle progression was monitored with the addition of BrdU to day 10 suspension cultures at a final concentration of 10 μM. After 3 h of incubation, cells were assayed with the BrdU Flow Kit (BD Biosciences) according to the manufacturer’s protocol. Cell proliferation and quiescence were measured using Ki67 (BD Bioscience) and Hoechst 33342 (Sigma) on day 4 suspension cultures after fixing and permeabilizing cells with the Cytofix/Cytoperm kit (BD Biosciences). For apoptosis analysis, Annexin V (Invitrogen) and 7-AAD (BD Bioscience) staining of day 7 suspension cultures was performed according to the manufacturer’s protocol. Lin− cord blood cells were initially stained with anti-CD34 PE (581) and anit-CD38 APC (HB7) antibodies (BD Biosciences) then fixed with the Cytofix/Cytoperm kit (BD Biosciences) according to the manufacturer’s instructions. Fixed and permeabilized cells were immunostained with anti-MSI2 rabbit monoclonal IgG antibody (EP1305Y, Abcam) and detected by Alexa-488 goat anti-rabbit IgG antibody (Invitrogen). CD34+ cells were transduced with an MSI2-overexpression or MSI2-knockdown lentivirus along with their corresponding controls and sorted for GFP expression 3 days later. Transductions for MSI2 overexpression or knockdown were each performed on two independent cord blood samples. Total RNA from transduced cells (>1 × 105) was isolated using TRIzol LS as recommended by the manufacturer (Invitrogen), and then further purified using RNeasy columns (Qiagen). Sample quality was assessed using Bioanalyzer RNA Nano chips (Agilent). Paired-end, barcoded RNA-seq sequencing libraries were then generated using the TruSeq RNA Sample Prep Kit (v2) (Illumina) following the manufacturer’s protocols starting from 1 μg total RNA. The quality of library generation was then assessed using a Bioanalyzer platform (Agilent) and Illumina MiSeq-QC run was performed or quantified by qPCR using KAPA quantification kit (KAPA Biosystems). Sequencing was performed using an Illumina HiSeq2000 using TruSeq SBS v3 chemistry at the Institute for Research in Immunology and Cancer’s Genomics Platform (University of Montreal) with cluster density targeted at 750,000 clusters per mm2 and paired-end 2 × 100-bp read lengths. For each sample, 90–95 million reads were produced and mapped to the hg19 (GRCh37) human genome assembly using CASAVA (version 1.8). Read counts generated by CASAVA were processed in EdgeR (edgeR_3.12.0, R 3.2.2) using TMM normalization, paired design, and estimation of differential expression using a generalized linear model (glmFit). The false discovery rate (FDR) was calculated from the output P values using the Benjamini–Hochberg method. The fold change of logarithm of base 2 of TMM normalized data (logFC) was used to rank the data from top upregulated to top downregulated genes and FDR (0.05) was used to define significantly differentially expressed genes. RNA-seq data have been deposited in NCBI’s Gene Expression Omnibus (GEO) and are accessible through GEO Series accession number GSE70685. iRegulon26 was used to retrieve the top 100 AHR predicted targets with a minimal occurrence count threshold of 5. The data were analysed using GSEA27 with ranked data as input with parameters set to 2,000 gene-set permutations. The GEO dataset GSE28359, which contains Affymetrix Human Genome U133 Plus 2.0 Array gene expression data for CD34+ cells treated with SR1 at 30 nM, 100 nM, 300 nM and 1,000 nM was used to obtain lists of genes differentially expressed in the treated samples compared to the control ones (0 nM)2. Data were background corrected using Robust Multi-Array Average (RMA) and quantile normalized using the expresso() function of the affy Bioconductor package (affy_1.38.1, R 3.0.1). Lists of genes were created from the 150 top upregulated and downregulated genes from the SR1-treated samples at each dose compared to the non-treated samples (0 nM). The data were analysed using GSEA with ranked data as input with parameters set to 2,000 gene-set permutations. The normalized enrichment score (NES) and false discovery rate (FDR) were calculated for each comparison. The GEO data set GSE24759, which contains Affymetrix GeneChip HT-HG_U133A Early Access Array gene expression data for 38 distinct haematopoietic cell states4, was compared to the MSI2 overexpression and knockdown data. GSE24759 data were background corrected using Robust Multi-Array Average (RMA), quantile normalized using the expresso() function of the affy Bioconductor package (affy_1.38.1, R 3.0.1), batch corrected using the ComBat() function of the sva package (sva_3.6.0) and scaled using the standard score. Bar graphs were created by calculating for significantly differentially expressed genes the number of scaled data that were above (>0) or below (<0) the mean for each population. Percentages indicating for how long the observed value (set of up- or downregulated genes) was better represented in that population than random values were calculated from 1,000 trials. A unique list of genes closest to AHR-bound regions previously identified from TCDD-treated MCF7 ChIP–seq data14 was used to calculate the overlap with genes showing >1.5-fold downregulation in response to treatment with UM171 (35 nM) or SR1 (500 nM) relative to DMSO-treated samples3 as well as with genes significantly downregulated in MSI2-overexpressing versus control treated samples (FDR < 0.05). The percentage of downregulated genes with AHR-bound regions was then plotted for each gene set. P values were generated with Fisher’s exact test for comparisons between gene lists. AHR transcription factor binding sites in downregulated gene sets were identified with oPOSSUM-328. Genes showing >1.5-fold downregulation in response to treatment with UM171 (35 nM) or SR1 (500 nM) relative to DMSO-treated samples3 were used along with significantly downregulated genes (FDR < 0.05) with EdgeR-analysed MSI2-overexpressing versus control-treated samples. The three gene lists were uploaded into oPOSSUM-3 and the AHR:ARNT transcription factor binding site profile was used with the matrix score threshold set at 80% to analyse the region 1,500 bp upstream and 1,000 bp downstream of the transcription start site. The percentage of downregulated genes with AHR-binding sites in their promoters was then plotted for each gene set. Fisher’s exact test was used to identify significant overrepresentation of AHR-binding sites in gene lists relative to background. Eight- to 12-week-old male or female NSG mice were sublethally irradiated (315 cGy) one day before intrafemoral injection with transduced cells carried in IMDM 1% FBS at 25 μl per mouse. Injected mice were analysed for human haematopoietic engraftment 12–14 weeks after transplantation or at 3 and 6.5 weeks for STRC experiments. Mouse bones (femurs, tibiae and pelvis) and spleen were removed and bones were crushed with a mortar and pestle then filtered into single-cell suspensions. Bone marrow and spleen cells were blocked with mouse Fc block (BD Biosciences) and human IgG (Sigma) and then stained with fluorochrome-conjugated antibodies specific to human haematopoietic cells. For multilineage engraftment analysis, cells from mice were stained with CD45 (HI30) (Invitrogen), CD33 (P67.6), CD15 (HI98), CD14 (MφP9), CD19 (HIB19), CD235a/GlyA (GA-R2), CD41a (HIP8) and CD34 (581) (BD Biosciences). For MSI2 knockdown in HSCs, 5.0 × 104 and 2.5 × 104 sorted Lin− CD34+ CD38− cells were used per short-hairpin transduction experiment, leading to transplantation of day zero equivalent cell doses of 10 × 103 and 6.25 × 103, respectively, per mouse. For STRC LDA transplantation experiments, 105 sorted CD34+CD38+ cells were used per control or MSI2-overexpressing transduction. After assessing levels of gene transfer, day zero equivalent GFP+ cell doses were calculated to perform the LDA. Recipients with greater than 0.1% GFP+CD45+/− cells were considered to be repopulated. For STRC experiments that read out extended engraftment at 6.5 weeks, 2 × 105 CD34+ CD38+ cells were used per overexpressing or control transduction to allow non-limiting 5 × 104 day zero equivalent cell doses per mouse. For HSC expansion and LDA experiments, CD34+CD38− cells were sorted and transduced with MSI2-overexpressing or control vectors (50,000 cells per condition) for 3 days and then analysed for gene-transfer levels (% GFP+/−) and primitive cell marker expression (% CD34 and CD133). To ensure that equal numbers of GFP+ cells were transplanted into both control and MSI2-overexpressing recipient mice, we added identically cultured GFP− cells to the MSI2 culture to match the % GFP+ of the control culture (necessary owing to the differing efficiency of transduction). The adjusted MSI2-overexpressing culture was recounted and aliquoted (63,000 cells) to match the output of half of the control culture. Three day 0 equivalent GFP+ cell doses (1,000, 300 and 62 cells) were then transplanted per mouse to perform the D3 primary LDA. A second aliquot of the adjusted MSI2-overexpressing culture was then taken and put into culture in parallel with the remaining half of the control culture to perform another LDA after 7 days of growth (10 days total growth, D10 primary LDA). Altogether, four cell doses were transplanted; when converted back to day 0 equivalents these equalled approximately 1,000, 250, 100, and 20 GFP+ cells per mouse, respectively. Pooled bone marrow from six engrafted primary mice that received D10 cultured control or MSI2-overexpressing cells (from the two highest doses transplanted) was aliquoted into five cell doses of 15 million, 10 million, 6 million, 2 million and 1 million cells. The numbers of GFP+ cells within primary mice was estimated from nucleated cell counts obtained from NSG femurs, tibias and pelvises and from Colvin et al.29. The actual numbers of GFP+ cells used for determining numbers of GFP+ HSCs and the number of mice transplanted for all LDA experiments is shown in Supplementary Tables 3–5. The cut-off for HSC engraftment was a demonstration of multilineage reconstitution that was set at bone marrow having >0.1% GFP+ CD33+ and >0.1% GFP+ CD19+ cells. HSC and STRC frequency was assessed using ELDA software30. For all mouse transplantation experiments, mice were age- (6–12 week) and sex-matched. All transplanted mice were included for analysis unless mice died from radiation sickness before the experimental endpoint. No randomization or blinding was performed for animal experiments. Approximately 3–6 mice were used per cell dose for each cord blood transduction and transplantation experiment. CLIP–seq was performed as previously described15. Briefly, 25 million NB4 cells (a transformed human cell line of haematopoietic origin) were washed in PBS and UV-cross-linked at 400 mJ cm−2 on ice. Cells were pelleted, lysed in wash buffer (PBS, 0.1% SDS, 0.5% Na-deoxycholate, 0.5% NP-40) and DNase-treated, and supernatants from lysates were collected for immunoprecipitation. MSI2 was immunoprecipitated overnight using 5 μg of anti-MSI2 antibody (EP1305Y, Abcam) and Protein A Dynabeads (Invitrogen). Beads containing immunoprecipated RNA were washed twice with wash buffer, high-salt wash buffer (5× PBS, 0.1% SDS, 0.5% Na-Deoxycholate, 0.5% NP-40), and PNK buffer (50 mM Tris-Cl pH 7.4, 10 mM MgCl , 0.5% NP-40). Samples were then treated with 0.2 U MNase for 5 min at 37° with shaking to trim immunopreciptated RNA. MNase inactivation was then carried out with PNK + EGTA buffer (50 mM Tris-Cl pH 7.4, 20 mM EGTA, 0.5% NP-40). The sample was dephosphorylated using alkaline phosphatase (CIP, NEB) at 37° for 10 min followed by washing with PNK+EGTA, PNK buffer, and then 0.1 mg ml−1 BSA in nuclease-free water. 3′RNA linker ligation was performed at 16° overnight with the following adaptor: 5′P-UGGAAUUCUCGGGUGCCAAGG-puromycin. Samples were then washed with PNK buffer, radiolabelled using P32-y-ATP (Perkin Elmer), run on a 4–12% Bis-Tris gel and then transferred to a nitrocellulose membrane. The nitrocellulose membrane was developed via autoradiography and RNA–protein complexes 15–20 kDa above the molecular weight of MSI2 were extracted with proteinase K followed by RNA extraction with acid phenol-chloroform. A 5′RNA linker (5′HO-GUUCAGAGUUCUACAGUCCGACGAUC-OH) was ligated to the extracted RNA using T4 RNA ligase (Fermentas) for 2 h and the RNA was again purified using acid phenol-chloroform. Adaptor ligated RNA was re-suspended in nuclease-free water and reverse transcribed using Superscript III reverse transcriptase (Invitrogen). Twenty cycles of PCR were performed using NEB Phusion Polymerase using a 3′PCR primer that contained a unique Illumina barcode sequence. PCR products were run on an 8% TBE gel. Products ranging between 150 and 200 bp were extracted using the QIAquick gel extraction kit (Qiagen) and re-suspended in nuclease-free water. Two separate libraries were prepared and sent for single-end 50-bp Illumina sequencing at the Institute for Genomic Medicine at the University of California, San Diego. 47,098,127 reads from the first library passed quality filtering, of which 73.83% mapped uniquely to the human genome. 57,970,220 reads from the second library passed quality filtering, of which 69.53% mapped uniquely to the human genome. CLIP-data reproducibility was verified through high correlation between gene RPKMs and statistically significant overlaps in the clusters and genes within replicates. CLIP–seq data have been deposited in NCBI’s GEO and are accessible through GEO Series accession number GSE69583. Before sequence alignment of CLIP–seq reads to the human genome was performed, sequencing reads from libraries were trimmed of polyA tails, adapters, and low quality ends using Cutadapt with parameters–match-read-wildcards–times 2 -e 0 -O 5–quality-cutoff' 6 -m 18 -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b TGGAATTCTCGGGTGCCAAGG -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA-b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT. Reads were then mapped against a database of repetitive elements derived from RepBase (version 18.05). Bowtie (version 1.0.0) with parameters -S -q -p 16 -e 100 -l 20 was used to align reads against an index generated from Repbase sequences31. Reads not mapped to Repbase sequences were aligned to the hg19 human genome (UCSC assembly) using STAR (version 2.3.0e)32 with parameters–outSAMunmapped Within –outFilterMultimapNmax 1 –outFilterMultimapScoreRange 1. To identify clusters in the genome of significantly enriched CLIP–seq reads, reads that were PCR replicates were removed from each CLIP–seq library using a custom script of the same method as in ref. 33; otherwise, reads were kept at each nucleotide position when more than one read’s 5′-end was mapped. Clusters were then assigned using the CLIPper software with parameters–bonferroni–superlocal–threshold-34. The ranked list of significant targets was calculated assuming a Poisson distribution, where the observed value is the number of reads in the cluster, and the background is the number of reads across the entire transcript and or across a window of 1000 bp ± the predicted cluster. Transcriptomic regions and gene classes were defined using annotations found in gencode v17. Depending on the analysis, clusters were associated by the Gencode-annotated 5′UTR, 3′UTR, CDS or intronic regions. If a cluster overlapped multiple regions, or a single part of a transcript was annotated as multiple regions, clusters were iteratively assigned first as CDS, then 3′UTR, 5′UTR and finally as proximal (<500 bases from an exon) or distal (>500 bases from an exon) introns. Overlapping peaks were calculated using bedtools and pybedtools35, 36. Significantly enriched gene ontology (GO) terms were identified using a hypergeometric test that compared the number of genes that were MSI2 targets in each GO term to genes expressed in each GO term as the proper background. Expressed genes were identified using the control samples in SRA study SRP012062. Mapping was performed identically to CLIP–seq mapping, without peak calling and changing the STAR parameter outFilterMultimapNmax to 10. Counts were calculated with featureCounts37 and RPKMs were then computed. Only genes with a mean RPKM > 1 between the two samples were used in the background expressed set. Randomly located clusters within the same genic regions as predicted MSI2 clusters were used to calculate a background distribution for motif and conservation analyses. Motif analysis was performed using the HOMER algorithm as in ref. 34. For evolutionary sequence conservation analysis, the mean (mammalian) phastCons score for each cluster was used. CD34+ cells (>5 × 104) were transduced with an MSI2-overexpression or control lentivirus. Three days later, GFP+ cells were sorted and then put back in to StemSpan medium containing growth factors IL-6 (20 ng ml−1), SCF (100 ng ml−1), FLT3-L (100 ng ml−1) and TPO (20 ng ml−1). A minimum of 10,000 cells were used for immunostaining at culture days 3 and 7 after GFP sorting. Cells were fixed in 2% PFA for 10 min, washed with PBS and then cytospun on to glass slides. Cytospun cells were then permeabilized (PBS, 0.2% Triton X-100) for 20 min, blocked (PBS, 0.1% saponin, 10% donkey serum) for 30 min and stained with primary antibodies (CYP1B1 (EPR14972, Abcam); HSP90 (68/hsp90, BD Biosciences)) in PBS with 10% donkey serum for 1 h. Detection with secondary antibody was performed in PBS 10% donkey serum with Alexa-647 donkey anti-rabbit antibody or Alexa-647 donkey anti-mouse antibodies for 45 min. Slides were mounted with Prolong Gold Antifade containing DAPI (Invitrogen). Several images (200–1,000 cells total) were captured per slide at 20× magnification using an Operetta HCS Reader (Perkin Elmer) with epifluorescence illumination and standard filter sets. Columbus software (Perkin Elmer) was used to automate the identification of nuclei and cytoplasm boundaries in order to quantify mean cell fluorescence. A 271-bp region of the CYP1B1 3′UTR that flanked CLIP–seq-identified MSI2-binding sites was cloned from human HEK293FT genomic DNA using the forward primer GTGACACAACTGTGTGATTAAAAGG and reverse primer TGATTTTTATTATTTTGGT AATGGTG and placed downstream of renilla luciferase in the dual-luciferase reporter vector pGL4 (Promega). A 271-bp geneblock (IDT) with 6 TAG > TCC mutations was cloned in to pGL4 using XbaI and NotI. The HSP90 3′UTR was amplified from HEK293FT genomic DNA with the forward primer TCTCTGGCTGAGGGATGACT and reverse primer TTTTAAGGCCAAGGAATTAAGTGA and cloned into pGL4. A geneblock of the HSP90 3′UTR (IDT) with 14 TAG > TCC mutations was cloned in to pGL4 using SfaAI and NotI. Co-transfection of wild-type or mutant luciferase reporter (40 ng) and control or MSI2-overexpressing lentiviral expression vector (100 ng) was performed in the NIH-3T3 cell line, which does not express MSI1 or MSI2 (50,000 cells per co-transfection). Reporter activity was measured using the Dual-Luciferase Reporter Assay System (Promega) 36–40 h later. For MSI2-overexpressing cultures with the AHR antagonist SR1, Lin− CD34+ cells were transduced with MSI2-overexpression or control lentivirus in medium supplemented with SR1 (750 nM; Abcam) or DMSO vehicle (0.1%). GFP+ cells were isolated (20,000 cells per culture) and allowed to proliferate with or without SR1 for an additional 7 days at which point they were counted and immunophenotyped for CD34 and CD133 expression. For MSI2-overexpressing cultures with the AHR agonist FICZ, Lin− CD34+ cells were transduced with MSI2-overexpression or control lentivirus. GFP+ cells were isolated (20,000 cells per culture) and allowed to proliferate with FICZ (200 nM; Santa Cruz Biotechnology) or DMSO (0.1%) for an additional 3 days, at which point they were immunophenotyped for CD34 and CD133 expression. Lin− CD34+ cells were cultured for 72 h (lentiviral treated but non-transduced flow-sorted GFP− cells) in StemSpan medium containing growth factors IL-6 (20 ng ml−1), SCF (100 ng ml−1), FLT3-L (100 ng ml−1) and TPO (20 ng ml−1) before the addition of the CYP1B1 inhibitor TMS (Abcam) at a concentration of 10 μM or mock treatment with 0.1% DMSO. Equal numbers of cells (12,000 per condition) were then allowed to proliferate for 7 days at which point they were counted and immunophenotyped for CD34 and CD133 expression. Unless stated otherwise (that is, analysis of RNA–seq and CLIP–seq data sets), all statistical analysis was performed using GraphPad Prism (GraphPad Software version 5.0). Unpaired student t-tests or Mann–Whitney tests were performed with P < 0.05 as the cut-off for statistical significance. No statistical methods were used to predetermine sample size.

No statistical methods were used to predetermine sample size. Experiments were not randomized, and investigators were not blinded to allocation during experiments and outcome assessment. The recombineering technique21 was adapted to construct all targeting vectors for homologous recombination in ES cells. Retrieval vectors were obtained by combining 5′ miniarm (NotI/SpeI), 3′ miniarm (SpeI/BamHI) and the plasmid PL253 (NotI/BamHI). SW102 cells21 containing a BAC encompassing the carboxy-terminal part of the gene encoding the remodeller, were electroporated with the SpeI-linearized retrieval vector. This allowed the subcloning of genomic fragments of approximately 10 kilobases (kb) comprising the last exon of the gene encoding each remodeller. The next step was the insertion of a TAP-tag into the subcloned DNA, immediately 3′ to the coding sequence. The TAP-tag was (Flag) -TEV-HA for Chd1, Chd2, Chd4, Chd6, Chd8, Ep400, Brg1 and 6His-Flag-HA for Chd9. We first inserted the TAP-tag and an AscI site into the PL452 vector, to clone 5′ homology arms as SalI/AscI fragments into the PL452TAP-tag vector. 46C ES cells were electroporated with NotI-linearized targeting constructs and selected with G418. In all cases, G418-positive clones were screened by Southern blot. Details on the Southern genotyping strategy, as well as sequences of primers and plasmids used in this study are available on request. Correctly targeted ES cell clones were karyotyped, and the expression of each tagged remodeller was controlled by western blot analysis, using antibodies against Flag and haemagglutinin (HA) epitopes (see Extended Data Fig. 6). We also verified by immunofluorescence, using monoclonal antibodies anti-Flag (M2, Sigma F1804) and anti-HA (HA.11, Covance MMS-101P) epitopes, that each tagged remodeller was properly localized in the nucleus of ES cells. ES cell lines expressing a tagged remodeller were all indistinguishable in culture from their mother cell line (46C). Pluripotency of tagged ES cell lines was verified by detecting alkaline phosphatase activity on ES cell colonies 5 days after plating, using the Millipore alkaline detection kit, following manufacturer’s instructions. In addition, we verified by immunofluorescence using an antibody against Oct4 (also known as Pou5f1) (Abcam ab19857, lot 943333) that expression of this pluripotency-associated transcription factor was uniform in each tagged ES cell line. Mouse 46C ES cells have been described previously22. 46C ES cells and their tagged derivatives were cultured at 37 °C, 5% CO , on mitomycin C-inactivated mouse embryonic fibroblasts, in DMEM (Sigma) with 15% fetal bovine serum (Invitrogen), l-glutamine (Invitrogen), MEM non-essential amino acids (Invitrogen), penicillin/streptomycin (Invitrogen), β-mercaptoethanol (Sigma), and a saturating amount of leukaemia inhibitory factor (LIF), as described previously23. Mouse ES nucleosomal tags were acquired from a published MNase-seq data set7 to make the reference map shown in Fig. 2. Reference nucleosomes were called using MACS 2.0 before assigning the first MNase-resistant nucleosome upstream and downstream of TSSs as −1 and +1, respectively. Because long NFRs may actually contain MNase-sensitive nucleosome-like structures or histone-containing complexes, defining the first downstream MNase-resistant nucleosome as ‘+1’ is problematic, and so we refer to it as the ‘first stable nucleosome’. Regions between the associated −1 and +1 (or first stable) nucleosomes were defined as NFRs. We further defined narrow and wide NFR categories, which have the median width of 28 bp and 808 bp, respectively. We define HFRs as lacking histones as defined by ChIP-seq. The list of 14,623 genes used in Figs 1 and 2 was obtained by filtering all mm9 RefSeq genes24. We removed redundancies (that is, genes having the same start and end sites), unmappable genes, blacklisted genomic regions (those with artefact signal regardless of which NGS techniques were used), and genes shorter than 2 kb. The purpose of this last filtering step was to unambiguously distinguish the promoter region from the end of the genes in heat maps. Lists of genes defined as having H3K4me3 and bivalent promoters: we first defined, among the 14,623 RefSeq genes, those with a promoter that was positive for H3K4me3 (accession number: GSM590111). This was accomplished by operating with the seqMINER platform. Tag densities from this data set were collected in a −500/+1,000-bp window around the TSS, and subjected to three successive rounds of k-means clustering, to remove all genes with a promoter that was clustered with low H3K4me3. We next conducted on this series of H3K4me3-positive promoters three successive rounds of k-means clustering, using several published data sets for H3K27me3. The genes with a promoter positive for H3K27me3 in four distinct H3K27me3 data sets (accession numbers: GSM590115, GSM590116, GSM307619 and GSM392046/GSM392047) were considered as bivalent. We eventually obtained a list of 6,481 genes with H3K4me3-only promoters, and a list of 3,411 bivalent genes. A detailed version of this protocol is available on the protocol exchange website: http://dx.doi.org/10.1038/protex.2014.040. In brief, about 400 million ES cells were fixed either with formaldehyde, or with a combination of disuccinimidyl glutarate (DSG) and formaldehyde (Supplementary Table 1), then permeabilized with IGEPAL, and incubated with 2,800 units of micrococcal nuclease (MNase, New England Biolabs) in order to fragment the genome into mononucleosomes (Extended Data Fig. 1). This nucleosome preparation was next incubated with agarose beads coupled with an antibody anti-HA or anti-Flag. Anti-HA-agarose (ref. A2095) and anti-Flag-agarose (ref. A2220) beads were purchased from Sigma. After a series of washes, tagged remodeller–nucleosome complexes were eluted, either by TEV protease cleavage or by peptide competition (Supplementary Table 1). The eluted complexes were then subjected to a second immunopurification step, using beads coupled to the antibody specific of the second HA or Flag epitope. After elution, DNA was extracted from the highly purified mononucleosome fraction, and processed for high-throughput sequencing (see below). As a negative control, chromatin from untagged ES cells was subjected to the same protocol to define background signal. Two biological replicates were used for each tagged and control ES cell line, using independent cell cultures and chromatin preparations. After crosslink reversion, phenol–chloroform extraction and ethanol precipitation, the DNA from remodeller–nucleosome complexes was quantified using the picogreen method (Invitrogen) or by running 1/20 of the ChIP material on a high sensitivity DNA chip on a 2100 Bioanalyzer (Agilent). Approximately 5–10 ng of ChIP DNA was used for library preparation according to the Illumina ChIP-seq protocol (ChIP-seq sample preparation kit). Following end-repair and adaptor ligation, fragments were size-selected on an agarose gel in order to purify nucleosome-sized genomic DNA fragments between 140 and 180 bp. Purified fragments were next amplified (18 cycles) and verified on a 2100 Bioanalyzer before clustering and single-read sequencing on an Illumina Genome Analyzer (GA) or GA II, according to manufacturer’s instructions. Sequencing characteristics are shown in Supplementary Table 1. Chd1, Chd2, Chd4, Chd6, Chd8, Chd9, Ep400 and Brg1 MNase remodeller ChIP-seq short reads were mapped to mouse mm9 genome using Bowtie 0.12.7 with the followings settings: -a -m1–best–strata -v2 -p3. Data sets were next converted to BED format files, and data analysis was performed using the seqMINER platform25 (Fig. 1c). To examine the distribution of remodellers at individual genes, we used WigMaker3 (default settings) to convert BED files into wig files, which were uploaded onto the IGV genome browser (Extended Data Fig. 2). Nucleosome calls were made from MNase remodeller ChIP-seq tags using GeneTrack26 with the following parameters: sigma = 20, exclusion = 146. We then globally shifted tags to the median value of half distances of all nucleosome calls. GRO-seq tags10 sharing the same or opposite orientation with the TSS were assigned as ‘sense’ and ‘divergent’ tags, respectively. The orientation of each NFR was arranged so that sense transcription proceeds to the right. ES nucleosomal tags, globally shifted tags from MNase remodeller ChIP-seq (this current study), tags from DHS regions (Mouse ENCODE), GRO-seq oriented tags from transcriptionally engaged Pol II and CpG islands (UCSC, mm9 build) were then aligned to the midpoint of each NFR. Promoter regions were then sorted by NFR length and visualized by Java TreeView (Fig. 2a, b). CpG island information was retrieved from UCSC (mm9 build) and assigned to the closest TSS by using bedtools. We noticed that promoters with wide NFRs were mostly CpG island (CpGI)-rich, while those with narrow NFRs were globally CpGI-poor, in agreement with a previous report showing that CpGIs induce nucleosome exclusion9 (Fig. 2b). Tags from reference nucleosomes7, remodeller-interacting nucleosomes (this study) and transcriptionally engaged Pol II (GRO-seq)10 were aligned to nucleosome −1 and +1 (or the first stable nucleosome) dyad positions. The direction of each dyad was assigned according to the orientation of its associated TSS, the orientation of which was arranged so that the transcription proceeds to the right. After normalization to the gene count in the two different NFR subclasses, tags were plotted from the NFR midpoint to 500 bp distal to the reference nucleosome. An x axis gap in the NFR was introduced to normalize variations in NFR length inside each class. We used DNaseI-Seq data from the mouse ENCODE consortium (GSM1004653) for the identification of DHS regions in the mouse ES cell genome. DHS regions were defined using MACS 2.0 (ref. 27) (default setting), which resulted in the identification of 139,454 DHS regions. Each of these DHS regions was represented as a 500-bp window (−250 bp/+250 bp) centred on the midpoint of the DHS peak. DHS regions overlapping with the blacklisted (high background signal) genomic areas (mm9) were removed, resulting in a final list of 138,582 DHS regions. Tags from each tested ChIP-seq data set were summed up for each DHS region before pair-wise Pearson correlation comparison. The R2 value from each pair-wise Pearson correlation was then visualized by heat map (Fig. 1a). Pearson correlation analysis at promoter-like DHS regions. Operating with the seqMINER platform, we retrieved, from the 138,582 DHS regions list, those positive for H3K4me3, TBP and Pol II S5ph. We obtained 16,300 promoter-like DHS regions befitting the criteria. Pair-wise Pearson correlation was performed and plotted (Fig. 1b) as described for Fig. 1a. We used the pHYPER shRNA vector for remodeller depletion in ES cells, as previously described28. shRNA design was performed using DSIR software (http://biodev.extra.cea.fr/DSIR/DSIR.html). Below are the shRNAs selected for each remodeller. The sense strand sequence is given; the rest of the shRNA sequence is as described previously28. Chd1 shRNA 1: 5′-GCAAAGACGGCGACTAGAAGA-3′; Chd1 shRNA 2: 5′-GACAGTGCTTAATCAAGATCG-3′; Chd4 shRNA 1: 5′-GGACGACGATTTAGATGTAGA-3′; Chd4 shRNA 2: 5′-GCTGACGTCTTCAAGAATATG-3′; Chd6 shRNA 1: 5′-GTACTATCGTGCTATCCTAGA-3′; Chd6 shRNA 2: 5′-CAGTCAGAACCCACAATAACT-3′; Chd8 shRNA 1: 5′-GCAGTTACACTGACGTCTACA-3′; Chd8 shRNA 2: 5′-GACTTTCTGTACCGCTCAAGA-3′; Chd9 shRNA 1: 5′-TATACCAATTGAACAAGAGCC-3′; Chd9 shRNA 2: 5′-AGTTAAAGTCTACAGATTAGT-3′; Ep400 shRNA 1: 5′-GGTAAAGAGTCCAGATTAAAG-3′; Ep400 shRNA 2: 5′-GGTCCACACTCAACAACGAGC-3′; Smarca4 shRNA 1: 5′-ACTTCTTGATAGAATTCTACC-3′; Smarca4 shRNA 2: 5′-CCTTCGAACAGTGGTTCAATG-3′. Each shRNA was transfected in its corresponding tagged ES cell line, to follow remodeller depletion by western blotting using monoclonal antibodies anti-Flag (M2, Sigma F1804), or anti-HA (H7, Sigma H3663) epitopes (Extended Data Fig. 6), in comparison with the signal obtained with a control antibody anti-Gapdh (Abcam ab9485). The pHYPER shRNA vectors were transfected in ES cell by electroporation, using an Amaxa nucleofector (Lonza). Twenty-four hours after transfection, puromycin (2 μg ml−1) selection was applied for an additional 48 h period, before cell collection and RNA preparation, except for Chd4, for which cells were collected after 30 h of selection. Total RNA was extracted using an RNeasy kit (Qiagen). Total RNA yield was determined using a NanoDrop ND-100 (Labtech). Total RNA profiles were recorded using a Bioanalyzer 2100 (Agilent). For each remodeller, RNA was prepared from three independent transfection experiments, and processed for transcriptome analysis. 46C ES cells were amplified on feeder cells except for the last passage, at which point cells were plated onto 60-mm dishes coated with gelatine, and grown to 70% confluence in D15 medium with LIF. Total RNA was extracted using an RNeasy Kit (Qiagen). The RNA quality was verified on a 2100 Bioanalyzer. Library preparation was performed using the Illumina mRNaseq sample preparation kit according to manufacturer’s instructions. Briefly, the total RNA was depleted of ribosomal RNA using the Sera-mag Magnetic Oligo (dT) Beads (Illumina) and after mRNA fragmentation, reverse transcription and second strand cDNA synthesis the Illumina specific adaptors were ligated. The ligation product was then purified and enriched with 15 cycles of PCR to create the final library for single-read sequencing of 75 bp carried out on an Illumina GAIIx. To keep only sequences of good quality, we retained the first 40 bp of each read and discarded all sequences with more than 10% of bases having a quality score below 20, using FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/). Mapping of these sequences onto the mm9 assembly of mouse genome and RPKM computation were then performed using ERANGE v3.1.0 (ref. 29) and bowtie v0.12.0 (ref. 30). In brief, a splice file was created with UCSC known genes and maxBorder = 36. We created an expanded genome containing genomic and splice-spanning sequences using bowtie-build and bowtie was used to map the reads onto this expanded genome. Then the ERANGE runStandardAnalysis.sh script was used to compute RPKM values following steps previously described29, using a consolidation radius of 20 kb. Random-primed reverse transcription was performed at 52 °C in 20 μl using Maxima First strand cDNA synthesis kit (Thermo Scientific) with 1 μg of total RNA isolated from ES cells (Qiagen), quantified with NanoDrop instrument (Thermo Scientific). Reverse transcription products were diluted 40-fold before use. Composition of quantitative PCR assay included 2.5 μl of the diluted RT reaction, 0.2–0.5 mM forward and reverse primers, and 1× Maxima SYBR Green qPCR Master Mix (Thermo Scientific). Reactions were performed in a 10 μl total volume. Amplification was performed as follows: 2 min at 95 °C, 40 cycles at 95 °C for 15 s and 60 °C for 60 s in the ABI/Prism 7900HT real-time PCR machine (Applied Biosystems). The real-time fluorescent data from qPCR were analysed with the Sequence Detection System 2.3 (Applied Biosystems). Each qPCR reaction was performed using the set of primer pairs listed in Supplementary Table 2, validated for their specificity and efficiency of amplification. All reactions were performed in triplicates, using RNA prepared from three independent cell transfection experiments. Control reactions without enzyme were verified to be negative. Relative expression was calculated after normalization with three reference genes (Actb, Nmt1 and Ddb1), validated for this study. cRNA was synthesized, amplified and purified using the Illumina TotalPrep RNA Amplification Kit (Life Technologies) following Manufacturer’s instructions. In brief, 200 ng of RNA were used to prepare double-stranded cDNA using a T7 oligonucleotide (dT) primer. Second-strand synthesis was followed by in vitro transcription in the presence of biotinylated nucleotides. cRNA samples were hybridized to the Illumina BeadChips Mouse WG-6v2.0 arrays. These BeadChips contain 45,281 unique 50-mer oligonucleotides in total, with hybridization to each probe assessed at 30 different beads on average. A total of 26,822 probes (59%) are targeted at RefSeq transcripts, and the remaining 18,459 (41%) are for other transcripts. BeadChips were scanned on the Illumina iScan scanner using Illumina BeadScan image data acquisition software (version 2.3). Data were then normalized using the ‘normalize quantiles’ function in the GenomeStudio Software (version 1.9.0). Following analyses were done using Genespring software (version 13.0-GX). For Brg1, we used a previously published transcriptome data set, in which loss of Brg1 function was obtained by genetic ablation18. All array analyses were undertaken using the Limma package from the R/Bioconductor software (R-Development-Core-Team, 2007). Microarray spot intensities were normalized using the RMA method as implemented in the R affy package. Normalized measures served to compute the log ratios for each gene between the wild-type strain and the Brg1 knockout mutant. Then, to identify genes with a log ratio significantly different between the mutant and wild- type strain, P values were calculated for each gene using a moderated t-test. The moderated t-test applied here was based on an empirical Bayes analysis and was equivalent to shrinkage (or expansion) of the estimated sample variances towards a pooled estimate, resulting in a more stable inference. Finally, adjusted P values were calculated using the false discovery rate (FDR)-controlling procedure of Benjamini and Hochberg. We identified deregulated genes using the thresholds of 0.05 for the P value, and 1.5 for the fold change (FC 1.5). This FC 1.5 threshold was chosen based on a previous study on Brg1 (ref. 18), and also because it was compatible with the analysis of the remodellers more modestly involved in transcriptional control in ES cells such as Chd1, Chd6 and Chd8. Note that seemingly modest fold changes might arise from many sources including a response lag, residual remodelling activity, and relatively high experimental background. Using a FC 2 threshold, we could, however, confirm that Ep400, Chd4 and Brg1 are important transcriptional regulators in ES cells, with 535, 293 and 570 genes deregulated, respectively. This level of deregulation is indicative of a context-specific function of remodellers in transcriptional activation or repression, which is distinct from the function of general transcription factors, whose depletion is expected to affect most genes. Statistical analysis of the differences in transcriptional activation and repression by remodellers was performed using a two-sample test for equality of proportions with continuity correction. For the generation of GC-content-based lists of promoters, we used the list of promoters defined in figure 3 of ref. 15, which we crossed with the 14,623 promoter list, to obtain a list of 6,317 promoters rank ordered according to GC content. In Fig. 3b, we compared the percentages of genes either down- or upregulated by loss of function of each remodeller in the following two groups: (1) NFR length classes: genes from the narrow and wide NFR classes shown in Fig. 2a were each further divided into two subclasses, which resulted in the following four categories: narrow NFR subclass 1 (NFR < 15 bp), narrow NFR subclass 2 (15–115 bp NFR), wide NFR subclass 1 (116–504 bp) and wide NFR subclass 2 (505–1,500 bp). Genes in these groups were further subdivided into H3K4me3 and bivalent subgroups. (2) GC content classes: genes were divided into four quartiles based on GC content at promoters and further subdivided into H3K4me3 and bivalent subclasses. The number of genes analysed in Fig. 3b is indicated in brackets for the following subgroups. H3K4me3 genes: narrow NFR subclass 1 (739), subclass 2 (1,829), wide NFR subclass 1 (2,613), subclass 2 (1,253), GC content quartile 1 (low GC content) (450), quartile 2 (719), quartile 3 (644), quartile 4 (high GC content) (430). Bivalent genes: narrow NFR subclass 1 (271), subclass 2 (866), wide NFR subclass 1 (2,266), subclass 2 (1,184), GC content quartile 1 (220), quartile 2 (485), quartile 3 (750) and quartile 4 (1149). FAIRE was performed as described31 with modifications. 46C ES cells were amplified as described above for RNA preparation. Formaldehyde was added directly to the growth media (final concentration 1%), and cells were fixed for 5 min at room temperature. After quenching with glycine (125 mM) and several washes, cells were collected, resuspended in 500 μl of cold lysis buffer (2% Triton X-100, 1% SDS, 100 mM NaCl, 10 mM Tris-HCl, pH 8.0 and 1 mM EDTA) and disrupted using glass beads for five 1-min sessions with 2-min incubations on ice between disruption sessions. Samples were then sonicated for 16 sessions of 1 min (30 s on/30 s off) using a bioruptor (Diagenode) at max intensity, at 4 °C. After centrifugation, the supernatant was extracted twice with phenol–chloroform. The aqueous fractions were collected and pooled, and a final phenol–chloroform extraction was performed before DNA precipitation. FAIRE experiments were realized in triplicate, using independent ES cell cultures. Before sequencing, FAIRE DNA was analysed and quantified by running 1/25 of the FAIRE material on a high sensitivity DNA chip on a 2100 Bioanalyzer (Agilent, USA). Approximately 20 ng of FAIRE DNA was used for library preparation according to manufacturer’s instructions using the ChIP-seq sample preparation kit (Illumina). Single-read sequencing (36 bp) was performed on a Genome Analyzer II (Illumina). ES cells were grown and transfected with shRNA vectors as described for RNA analysis. Biological replicates were obtained by performing two independent transfection experiments for each shRNA vector. ATAC-seq libraries were constructed by adapting a published protocol20. In brief, 50,000 cells were collected, washed with cold PBS and resuspended in 50 μl of ES buffer (10 mM Tris, pH 7.4, 10 mM NaCl, 3 mM MgCl ). Permeabilized cells were resuspended in 50 μl transposase reaction (1× tagmentation buffer, 1.0–1.5 μl Tn5 transposase enzyme (Illumina)) and incubated for 30 min at 37 °C. Subsequent steps of the protocol were performed as previously described20. Libraries were purified using a Qiagen MinElute kit and Ampure XP magnetic beads (1:1.6 ratio) to remove remaining adapters. Libraries were controlled using a 2100 Bioanalyzer, and an aliquot of each library was sequenced at low depth onto a MiSeq platform to control duplicate level and estimate DNA concentration. Each library was then paired-end sequenced (2 × 100 bp) on a HiSeq instrument (Illumina). As ATAC-seq libraries are composed in large part of short genomic DNA fragments, reads were cropped to 50 bp using trimmomatic-0.32 to optimize paired-end alignment. Reads were aligned to the mouse genome (mm9) using Bowtie with the parameters -m1-best-strata -X2000, with two mismatches permitted in the seed (default value). The -X2000 option allows the fragments <2 kb to align and -m1 parameter keeps only unique aligning reads. Duplicated reads were removed with picard-tools-1.85. To perform differential analysis, libraries were adjusted to 33 million aligned reads using samtools-1.2 and by making a random permutation of initial input libraries (shuf linux command line). Adjusted BAM data sets were next converted to BED. We used the seqMINER platform with the lists of 6,481 H3K4me3-only and 3,411 bivalent genes described above, to collect tag densities from ATAC-seq data sets, in a window of −2 kb/+2 kb around the TSS. Output tag density files were analysed using R software to establish average ATAC-seq signal profiles shown in Extended Data Fig. 8. ES cells were grown and transfected with shRNA vectors as described above. Biological replicates were obtained by performing two independent transfection experiments for each shRNA vector. For each experiment, 1 million cells were fixed 10 min in ES cell culture medium containing 1% formaldehyde, quenched with glycine (125 mM), washed with PBS buffer, collected in 175 μl of solution I (15 mM Tris-HCl, pH 7.5, 0.3 M sucrose, 60 mM KCl, 15 mM NaCl, 5 mM MgCl and 0.1 mM EGTA), and stored on ice. Cells were permeabilized by adding 175 μl of solution II (solution I with 0.8% Igepal CA-630 (Sigma)) and incubating for 15 min on ice. We next added 700 μl of MNase digestion buffer (50 mM Tris-HCl, pH 7.5, 0.3 M sucrose, 15 mM KCl, 60 mM NaCl, 4 mM MgCl and 2 mM CaCl2), 4 U of MNase, and incubated for 10 min at 37 °C. MNase digestion was stopped by adding 10 mM EDTA (final concentration), and storing on ice. Cells were then disrupted by 15 passages through a 25 G needle, followed by a 10 min centrifugation at 18,000g. The supernatant was collected and incubated for 1 h at 65 °C with 15 μg of RNase A. We next added 10 μg of proteinase K, adjusted each sample to 0.1% SDS (final concentration) and incubated for 2 h at 55 °C. NaCl concentration was then adjusted to 200 mM and the samples were incubated overnight at 65 °C for crosslink reversal. DNA was purified from each sample by phenol–chloroform extraction followed by ethanol precipitation. Purified DNA (20 ng) was used for library preparation according to manufacturer’s instructions, using Ultralow ovation library system (Nugen). Following end-repair and adaptor ligation, fragments were size-selected onto an agarose gel in order to purify genomic DNA fragments between ~60 and 220 bp. Libraries were verified using a 2100 Bioanalyzer before clustering and paired-read sequencing. Sequencing of each sample was performed in a single lane of a HiSeq instrument (Illumina). The midpoint of each paired-end sequencing read was used to represent dyad location of each nucleosomal tag. We assumed that remodeller depletion has no bulk effect on nucleosome occupancy, hence the total reads of control and remodeller-depleted cells were adjusted to be the same. The adjusted tags were aligned to −1 nucleosome dyads (determined by the first MNase-defined peak upstream of annotated RefSeq TSS), or the first stable (MNase-defined) nucleosome dyad position downstream of the TSS for different NFR categories. These tags were further normalized to the amount of genes involved in each NFR class. The normalized tags were then binned (5 bp) and smoothed (10-bin moving average) before plotting (Fig. 3c). Distances (bp) are indicated relative to these reference points. An x axis gap in the NFR was introduced to normalize variations in NFR length inside each class. ES cells were grown and transfected with shRNA vectors as described above. Biological replicates were obtained by performing two independent transfection experiments for each shRNA vector. Following a 10 min fixation with 1% formaldehyde in ES cell culture medium, chromatin was prepared from 5–10 million cells and sonicated as described32. ChIP-exo experiments were carried out essentially as described33. This included an immunoprecipitation step using antibodies against Pol II (sc-899, Santa Cruz Biotechnology) attached to magnetic beads, followed by DNA polishing, A-tailing, Illumina adaptor ligation (ExA2), and lambda and recJ exonuclease digestion on the beads. After elution, a primer was annealed to EXA2 and extended with phi29 DNA polymerase, then A-tailed. A second Illumina adaptor was then ligated, and the products PCR-amplified and gel-purified. Sequencing was performed using NextSeq500. Uniquely aligned sequence tags were mapped to the mouse genome (mm9) using BWA-MEM (version 0.7.9a-r786)34. The uniquely aligned sequence tags were used for the downstream analysis. The 5′ end of mapped tags, representing exonuclease stop sites, were consolidated into peak calls (sigma = 5, exclusion = 20) using GeneTrack26, and peak pairs were matched when found on opposite strands and 0–100 bp apart in the 3′ direction. Tags were globally shifted to the median value of half distance between all peak pairs. These global shifted tags were then aligned relative to the annotated RefSeq TSSs for H3K4me3-only and bivalent promoters separately before further carved out remodeller-affected genes. We assumed that having remodeller deletion bore no bulk change on Pol II occupancy, and hence total tags among wild type and all remodeller mutants were normalized to be the same. To make direct comparison between different gene groups, we further normalized tags to the amount of genes within the group. These normalized tags were then smoothed (5 bp binned before 10-bin moving average) before plotting (Extended Data Fig. 9a). To examine Pol II occupancy change in remodeller mutants among different promoter groups, we first calculated total Pol II occupancy by summing up tags from transcript start to end sites (annotated RefSeq TSS and TES, respectively24) for the tested genes. Change in Pol II occupancy was calculated by dividing the total Pol II occupancy of mutant by that of wild type before log transformation and bargraph plotting (Extended Data Fig. 9b). Genes were rank-ordered according to reads per kb of transcript per million mapped reads (rpkm) and divided in four quartiles (highest: Q4, second: Q3, third: Q2 and lowest: Q1). Operating with the k-means clustering function of seqMINER, genes in each quartile were further subdivided in H3K4me3-only and bivalent genes, as described above. Using these lists of genes, tag densities from remodeller ChIP-seq data sets were collected in a window of −2 kb/+2 kb around the TSS, except for Chd2, for which densities were collected from the TSS until +4 kb. Output tag density files were first analysed using R software to establish average binding profiles. Statistical comparisons were performed between remodeller distributions at H3K4me3 promoters, to assess a significant increasing trend among distributions. Differences between successive pairs of quartiles (Q4 − Q3, Q3 − Q2 and Q2 − Q1) were compared against a null distribution using a one side t-test. The respective P values are reported for each remodeller: Chd1, Q4 − Q3 P = 1.371138 × 10−27; Q3 − Q2 P = 1.728126 × 10−16; Q2 − Q1 P = 7.985217 × 10−23. Chd2, Q4 − Q3 P = 7.543473 × 10−33; Q3 − Q2 P = 1.115223 × 10−25; Q2 − Q1 P = 3.283427 × 10−38. Chd4, Q4 − Q3 P = 0.2094255; Q3 − Q2 P = 0.1081455; Q2 − Q1 P = 0.07202865. Chd6, Q4 − Q3 P = 0.4168748; Q3 − Q2 P = 0.1534144; Q2 − Q1 P = 0.01138035. Chd8, Q4−Q3 P = 4.031959 × 10−15; Q3 − Q2 P = 1.231527 × 10−6; Q2 − Q1 P = 1.34455 × 10−9. Chd9, Q4 − Q3 P = 9.484578 × 10−44; Q3 − Q2 P = 1.059783 × 10−14; Q2 − Q1 P = 4.646352 × 10−28. Ep400, Q4 − Q3 P = 3.046796 × 10−20; Q3 − Q2 P = 1.215304 × 10−14; Q2 − Q1 P = 6.462667 × 10−11. Brg1, Q4 − Q3 P = 3.512021 × 10−24; Q3 − Q2 P = 2.515217 × 10−7; Q2 − Q1 P = 0.977422. We concluded from this analysis that Chd1, Chd2, Chd9 and Ep400 binding at promoters is tightly linked to gene expression level. By contrast, Brg1, Chd4 and Chd6 deposition showed little correlation with gene expression level (statistical test failed for at least one comparison for these remodellers). While statistical analysis of Chd8 distributions concluded to significant differences between quartiles, inspection of distributions in Extended Data Fig. 3 showed that Chd8 binding profile was intermediate between these two categories.

Discover hidden collaborations