Karen Ambrose is the database staff lead at Francis Crick Institute in St. Pancras in London. We caught up along with her on the Percona Live 2019 convention in Amsterdam to grasp the complexities concerned in managing databases in a scientific set up. Karen has been with the Francis Crick for about 5 years. She has a background in Bioinformatics and it was throughout her Masters that she received within the utility of know-how to get higher insights into scientific data.
Karen began her profession on the Sanger Institute in Cambridge on the time once they have been mapping the Human Genome, earlier than shifting on to the Francis Crick Institute. The Francis Crick Institute itself took place as a merger between numerous analysis organisations together with the National Institute of Medical Research (NIMR) and London Research Institute (LRI).
Her first job was emigrate the data from the totally different databases within the numerous organisations: “We initially had a time frame I think of about nine months to a year to physically migrate and move into the Francis Crick. And so we have to migrate about 300 databases. But that was in a landscape where the groups weren’t entirely moving in one go. So you might have a group, which essentially will talk to a cluster of databases at one site. Half of that group is then moved into the Francis Crick, and the other half is staying in because they have to shut down their lab in order to move. And we’ve got to make that data available at the new site and the old site.”
What made it much more difficult was that it wasn’t only a set of databases assigned to 1 group that was shifting; some of these databases have been being shared between 5 teams that have been shifting at totally different occasions. Karen describes the migration as shuffling chess items throughout which she needed to be sure that they do not corrupt any data and that it is accessible to the groups that have been nonetheless engaged on it, with the least quantity of downtime, if any.
It appears like a herculean job, and given their strict deadlines would absolutely have required a military of database wranglers: “There’s four of us in the team, including me.”
“Over the years we’ve basically been building a scientific data mountain. Data doesn’t get smaller, it just seems to get more complex and large.”
The institute has about 1500 individuals, together with about 1300 scientists and 200 operational employees. There are some 130 lab teams supported by about 18 to 20 Scientific Technology Platforms (STPs) that present the core companies to the lab teams to have the ability to additional their science: “So things like structural biology, and electromagnetic microscopy, high throughput sequencing, scientific computing, of which the database team which I manage is part of. So we provide a core service to the rest of the Institute.”
“For us, it’s very much about the data that comes off these instruments”, Karen tells us. Besides ensuring they supply the correct platform to assist scientists examine the uncooked data that comes off the machines, a significant job for Karen and her staff is to retailer the data effectively: “We need to work out what can we contain within the storage that we have within the institute, and also what other strategies do we need to incorporate, in terms of maybe looking at cloud, to help us provide the scientific insights that a particular lab group requires.”
The first problem, she tells us, is to handle and safe all of the generated data: “If people generate data, they generally want to keep everything, because you never quite know when you might need it. But we can’t physically keep everything.” So her staff works with the lab teams to determine the vital data and separate it from the data that may be generated.
The subsequent problem is efficiency. While for some scientists throughput is not vital so long as they’ll entry the data, for others efficiency is vital: “We’re always looking how can we best design their database, how does their data need to be structured so that it will be performant.” Once once more, the answer Karen says comes up in discussions with the labs to grasp what they should obtain from the data.
The Open Source benefit
The Francis Crick Institute makes use of numerous varieties of databases. While for the enterprise aspect of issues, they use Oracle or SQL Server, Karen tends to steer the science teams in the direction of open supply databases. The Institute makes use of relational databases like MySQL and Postgres, however is beginning to discover NoSQL databases like MongoDB, Neo4j, Cassandra, and others. She’s notably eager on investigating Neo4j as a result of “it’s interesting in terms of how it graphs the relationships between data.”
Karen additionally likes working with open supply databases as a result of of their open developmental mannequin: “If you come up with something, a new problem that you want to solve, it’s a lot easier to be able to talk to all the community to be able to come up with a solution. They’re always innovating, always pushing things forward. So you never feel like you’re always going to be confined by stagnant release process.”