The Checkmate Scientist: big data

Showing posts with label big data. Show all posts

Friday, January 29, 2016

Internet connected inhalers: Technology to watch

Novartis wants every puff of its emphysema drug Onbrez to go into the cloud.
The Swiss drugmaker has teamed up with U.S. technology firm Qualcomm to develop an internet-connected inhaler that can send information about how often it is used to remote computer servers known as the cloud.

This kind of new medical technology is designed to allow patients to keep track of their drug usage on their smartphones or tablets and for their doctors to instantly access the data over the web to monitor their condition.

It also creates a host of "Big Data" opportunities for the companies involved - with huge amounts of information about a medical condition and the efficacy of a drug or device being wirelessly transmitted to a database from potentially thousands, even millions, of patients.

This technology has amazing potential. If you have an idea regarding how much this device would cost, send me a message on Twitter.

Presumably a more pricey inhaler wouldn't be disposable like the current plastic devices; It would be a reusable device that simply accepts replacement Onbrez cartridges as new prescriptions are filled.
In this case, the inhaler cost becomes less relevant as it's amortized over the life of the patient's disease (long) versus the life of the patient's prescription fill (short).

Since it's internet connected, presumably it would be easy to add features like a reminder at the next dose (i.e. sound or LED), automatic prescription refills, etc.

All in all, very nice!

Monday, January 4, 2016

Why academia has a data sharing problem

Martin Bobrow, Chair of the Wellcome Trust's advisory group on data access, submitted an enlightening summary of data sharing problems in Nature, where he asked:

Most research-funding agencies, and most scientists, now agree that research data should be shared — provided that those who donate their data and samples are protected. This approach is strongly advocated by organizations such as the Global Alliance for Genomics and Health. But data sharing will work well only when it is streamlined, efficient and fair. How can more scientists be encouraged and helped to make their data available, without adding an undue administrative burden?

I think the burden he's addressing is actually split into at least two parts:

1. The burden of actually sharing data. This is what usually comes to mind when people think of data sharing being difficult, and it involves hammering down infrastructure and data formats to enable sharing.

2. The burden created by actually making data available. Being the 'owner' of data brings both the opportunity for first crack at investigating that data and also the responsibility to share it. There's a real cost that sharing imposes, both in serving people that want access to data and the cost of storing it (though both are continually falling).

Thinking realistically, there's an actual disincentive to share academically generated data. Sharing data essentially gives potential competitors 'your data' at no cost, which may vaporize whatever competitive scientific advantage you may have gained.

Further on in the article, Bobrow offers this explanation:

It is reasonable for scientists to impose certain conditions or restrictions on the use of their hard-earned data sets, but these should be proportionate and kept to a minimum. Justifiable conditions can range from requiring secondary users to acknowledge the source of the data in publications, to stipulating a fair embargo time on the use of new data releases. Whatever the conditions imposed, they need to be presented clearly to data users.

Criteria used to judge academic careers still focus heavily on individual publication records and provide little incentive for wider data sharing. Scientists who let others use their data deserve reward too.

So yes, the issue with academic data sharing is incentive.

People who put together well designed data sets should be rewarded for their expertise and talents in doing so. Good data isn't as simple as sending a box of samples to a [insert your favourite high-throughput technology] production center; it requires knowledge of what constitutes 'normal' samples, experimental design, not to mention actually handling the logistics of obtaining the right samples in the first place.

Why wouldn't someone deserve credit for that?

Wednesday, October 7, 2015

Severin Schwan on Big Data and Big Pharma

Severin Schwan, CEO of Roche:

Top tech companies like Google, IBM, SAP and their ilk are all obviously interested in healthcare, Schwan told Japan's Nikkei news service. And those companies are experts at digitizing and analyzing data; they have the tools and algorithms necessary to make sense of mountains of information. "But what they miss is the medical knowledge, the understanding of biology," Schwan pointed out. "They can't ask the right questions. They can program, but they don't know what to program."

This reminds me of a time when a software-developer-turned-bioinformatics-guy asked me about some RNA-seq data, and was confused at the fact that a lot of genes weren't being used: "Why do cells code for all these genes when they're not using most of them?" he asked.

To which I replied: "That's because many different kinds of cells have to use the same genome. You don't run every command in each R package you load, do you?"

Monday, July 20, 2015

AncestryHealth: Another company moves into consumer genomics

Sarah Buhr, at TechCrunch:

Family history site Ancestry launched a new generational health database called AncestryHealth today. The news comes right as AncestryDNA – Ancestry’s genetics site that connects those on the platform with distant relatives – announced it now has genotyped more than a million customers.

Ancestry.com launched in the early 80’s and went public in 2009. It is now the largest genealogical database in the world, holding more than 16 billion family history records from all over the world and more than 70 million user created family trees.

The company saw an opportunity in consumer genetic testing similar to 23andMe three years ago and launched AncestryDNA as a subsidiary of Ancestry.com. Ancestry’s patented algorithm began matching users to relatives as well as DNA matches to ancestors as far back as the 1700’s.

By applying genotyping chips to geneaology, AncestryDNA has put themselves in a good niche vis a vis 23andMe. Not only is it potentially much easier to link distant relatives using SNP chip data than it would be to, say, predict users' health risks from the same information, it's very likely to be legally safer. I doubt there are many (any?) regulatory pitfalls in telling people that they're potentially related to someone that's their second cousin once removed.

Despite the health spin, I'm not so sure how valuable correlating health information between really distant individuals in your family tree will be, except in odd cases of familial diseases. like some cancers. However, where I think AncestryHealth might actually prove very useful is for users that don't have tight relationships with family members and would like to know the health problems of their grandparent's generation, though that would still depend on someone related to you providing information to AncestryHealth.

In terms of information, the data scientists at organizations like AncestryDNA must be thrilled to wade through a treasure trove of a million genotypes, relatively unrestricted by data access rules in the public sector. This data aggregation model is essentially similar to 23andMe's: Collect your data first, then decide how to monetize it.

When monetization comes into play, it's easy to put a pessimistic spin on big data collection.DNAeXplained points out that they didn't receive much information beyond what was input, and asks: "What did Ancestry get? Health, ethnicity and lifestyle information for you and your family to sell along with your DNA information, if you signed the informed consent. If you don’t sign the informed consent, your information can still be utilized, just without your identity attached, per the verbiage in their terms and conditions, privacy statement and informed consent documents."

With feedback like this, I suspect AncestryHealth is going to work very, very, hard to design some kind of user report that avoids leaving people with the impression that they've just given something precious away and received little in return.

Finally, customer experiences aside, I'll leave you with this thought. I tend to side with Ken Chahine, SVP at Ancestry.com, on this one:

I began my career in the pharmaceutical and biotechnology industry where it progressed from bench scientist to CEO. During those years I experienced first-hand the inefficiencies and frustration of bringing medicines to market. Simply put, the healthcare industry makes data collection and sharing difficult [Emphasis mine]. The lack of data was one of the two primary reasons that brought me to my current position where I'm helping bring personal genomics to all through direct-to-consumer tests based on the newest breakthroughs in science and technology.

Friday, June 19, 2015

Genomics England picks these smaller names to crunch UK100K Genomes Project data

The most stunning news is delivered in the opener of Nick Paul Taylor's report on FierceBiotechIT:

Genomics England has named the four companies it wants to work with on the interpretation of the first 8,000 genomes from its massive sequencing effort. The list of successful bidders is lacking some big-name applicants, notably Illumina which Genomics England asked not to tender for the contract.

Though it's shocking that Illumina's BaseSpace wasn't a contender, in reading Genomics England's news release there's no mention of Illumina being asked not to tender, rather they were simply not asked to tender, a much less damning conclusion than what's mentioned above implies, which is that Illumina's analyses were so bad that no one in the UK ever wanted to see them again. Sadly, there was no mention of DNAnexus or Ingenuity either, but I suspect they were in the running as well.

But that's beside the point. Starting in August, Omicia, NantHealth, WuXi NextCODE, and Congenica will each provide reports on 2,500 patients from within Genomics England's data centers. Some of these shortlisted companies are not too surprising; for example Omicia is the primary licensor of the VAAST mutation analysis software (which is pretty good at analyzing family mutation patterns that I expect the UK100K genomes project contains a ton of), while Congenica is partnered with Genomics England to begin with, making them a natural fit.

WuXi NextCODE is a spinout of deCODE genetics (which was a hot company at one point) and has become an interesting arm of WuXi AppTec, a large Chinese CRO that's listed on the NYSE and has a $3B market cap. I'd like dig deeper into their business models in future posts.

However, the most enigmatic company of the set is NantHealth. This company, led by Patrick Soon-Shiong, has been trying to make a huge splash into the genomics market, mostly by building systems for hospitals to crunch genomic big data and present treatment propositions for patients, according to Matthew Herper at Forbes. This is the same Dr. Soon-Shiong that brought the world Abraxane, so he has credibility (which I admit was slightly reduced when I saw him holding a Circos plot on a BlackBerry Passport). If you read other stories by Herper you get the sense that there's a good dose of hyperbole coming from Soon-Shiong and NantHealth, so how that translates into results with the UK 100K Genomes Project is up in the air.

We'll have to wait until next year when the four companies complete the pilot phase of the study, and hopefully one will be an obvious winner to crunch the rest of the data.

Thursday, July 3, 2014

More Data Doesn't Mean More Interesting Data

David Beer, at Adaptive Computing, writes:

One of the keys to winning at Big Data will be ignoring the noise. As the amount of data increases exponentially, the amount of interesting data doesn’t.

He describes the problem of predicting what online video a user is going watch next, and how an analysis can quickly run the number of predictions up into thousands of possible 'next steps' to evaluate.

These are then compared with all of the other empirical data from all other customers to determine the likelihood that you might also want to watch the sequel, other work by the director, other work from the stars in the movie, things from the same genre, etc. As I perform these calculations, how much data should be ignored? How many people aren’t using the multiple user profiles and therefore don’t represent what one person’s interests might be? How many data points aren’t related to other data points and therefore shouldn’t be evaluated as a valid permutation the same as another point?

Thes points are probably the biggest value that an experienced scientist can provide to the scale of these data problems. This kind of person has at least several years of work experience in a hypothesis driven research environment and is able to solve problems using incomplete data. They probably have a PhD to go with that quantitative experience.

The first point, working in a hypothesis driven environment, demonstrates that that person should be able to devise a strategy to prove/disprove the hypothesis (I hypothesize that this customer will watch video Y after video X), and figure out how to do that efficiently without getting stuck in the weeds, or the irrelevant data Beer describes. Unfortunately, it does take some skill to interview a person before you determine that they can actually do this, especially there are differences between yourself and the interviewee.

The second point, being able to use incomplete data, is something seems to come from experience. Most people trained in research fields start off trying to collect the most data possible, and don't make a decision until 'more data is collected'. It's easy to get stuck in a data collection rut, but eventually most people realize that it's actually OK to come to a conclusion before seeing the whole picture.

Collecting a lot of extra data costs time, resources, and puts a demand on your attention span until that elusive point of having 'enough data' is reached. Sometimes that data is worth it, but many times it's not. It just sits there because no one has time to do anything with it, so the data remains idle and risks becoming stale. Unless it's actually your job to do so, be careful of making data for the sake of making data.

ASIDE: One of the neatest things I find about the customer analytics field (as compared with genomics or computational biology) is that data is basically being generated by the study population itself, for what is essentially free.

Monday, December 2, 2013

Christophe Lambert of Golden Helix, on the Utility of Electronic Health Records

Christophe Lambert, Chairman of Golden Helix Inc., recently gave a great lecture on the many facets of the health system can be improved using different approaches to analyzing the wealth of information in health records. He captures the general idea at around 41:00:

I've done some interesting work a project with Medco and Golden Helix , where we were looking at millions of patients records for drug safety and efficacy.

The end game that we were envisioning, was if you're sick, it would be great to look at tens of millions of records to find patients who were similar to you, anonymously, but then find out evidence based, what courses of treatment led to the best outcomes, and have a set of possibilities to present to a doctor as here are the many outcomes of various drugs, various treatments, and so forth.

Check it out on his Vimeo channel.

Monday, October 21, 2013

Big Data in Biology: Too Much to Handle

Here's a timely article on how biologists are handling the information overload that's come with (relatively) inexpensive sequencing hardware:

Researchers need more computing power and more efficient ways to move their data around. Hard drives, often sent via postal mail, are still often the easiest solution to transporting data, and some argue that it’s cheaper to store biological samples than to sequence them and store the resulting data. Though the cost of sequencing technology has fallen fast enough for individual labs to own their own machines, the concomitant price of processing power and storage has not followed suit. “The cost of computing is threatening to become a limiting factor in biological research,” said Folker Meyer, a computational biologist at Argonne National Laboratory in Illinois, who estimates that computing costs ten times more than research. “That’s a complete reversal of what it used to be.”

Exactly.

Monday, May 27, 2013

"Data Science": Just a Buzzword?

Fred Ross writes:

A good hunk of the data science certificate gets taught to physics majors in one semester of their second or third year at University of Virginia as “Fundamentals of scientific computing”. I single out University of Virginia’s class as an example because I happened to be there when it started in 2005, and remember talking about what should be in it with Bob Hirosky, its creator. My friends were the teaching assistants.

And the topics in these certificates are the basics, not the advanced material. Not that there aren’t legions of professional analysts out there with less statistical skill and no knowledge of programming, but no one would dream of giving them a title other than “Excel grunt”—sure, gussied up somehow to stroke their ego, but that’s what it comes down to.

Here's the original NY Times article Ross references, which describes "data science" as a hot new field. There's some truth to Ross' claim that journalists have pumped up data and science as the new big thing to report on.

I don't know about you, but data and science go hand in hand. There aren't any scientists (that I know of) that don't work with data. But more fundamentally, you can't expect to understand what to do in science without the ability to analyze data in ways more abstract than what's given from default outputs of algorithms or R vignettes, and you can't imagine what to do with your data with understanding even a little about the science than went into generating it.

A data science certificate is a small part on the way towards becoming a "Data Scientist" but it won't magically convert you into one of these folks.

Tuesday, March 19, 2013

On Science, Statistics, and Lamp-posts

"An unsophisticated forecaster uses statistics as a drunken man uses lamp-posts - for support rather than for illumination." - Andrew Lang

I first heard this quote a few weeks ago at AGBT, describing an unsophisticated scientist for the sake of the audience.

The gist of the talk was to explore how high-throughput sequencing of RNA can help discover alternative forms of a gene that happens to be a drug target - this is particularly important as the alternative forms might bind the drug target poorly, or not at all. The speaker emphasized that statistics, particularly when dealing with big data like genomic sequencing, should be designed and used very carefully and that users need to be aware of what stats can tell you, what they can't, and when obsessing about the right statistics to use isn't worthwhile.

Most importantly, she criticized the use of stats used to lend significance to an already obvious conclusion. If two observations are really that different, a t-test to hammer home the message with a P-value of 10^-167 is overkill, especially if it's part of a series of experiments that close the book on the statistical question in the next figure.

Using stats for the wrong purposes can also raise a red flag for people reviewing your work, for if the test you've used isn't appropriate for your data, it can mean two things: you weren't aware of the right test to use (which is ok), or you don't really care and your stats are just for show (which is not).

In the end, it's better to use rely on statistics to help guide questions than to 'prove' something that's already self evident in the data you're communicating.

Wednesday, March 13, 2013

A Very Thought Provoking Interview with a Google Statisitician

From an interview with Nick Chamandy, via Simply Statistics:

Grad school teaches us to spend weeks thinking about a problem and coming up with an elegant or novel methodology to solve it (sometimes without even looking at data). This process certainly has its place, but in some contexts a better outcome is to produce an unsophisticated but useful and data-driven answer, and then refine it further as needed. Sometimes the simple answer provides 80% of the benefit, and there is no reason to deprive the consumers of your method this short-term win while you optimize for the remaining 20%.

I wish more people doing academic research thought this way. Striving for perfection is a habit that dies very hard.

Link

Friday, February 8, 2013

Big Data, insurance companies, and how both can improve people's health

Kat McGowan, interviewing Colin Hill, of GNS Healthcare:

Defining personalized medicine only in the context of genomics for drug discovery, patient stratification, and biomarker discovery is thinking too small, says Hill, a former computational physicist who founded GNS Healthcare, a health data analytics company, in 2000. We need to consider the whole universe of clinical information generated by randomized clinical trials, claims, electronic health care records, payers and providers—what’s being called the “data exhaust” of the digital healthcare era.

You have to read this article - it provides a small snippet of what big data sets can do to improve people's health.

Data scientists are in demand: but sexy?

The Harvard Business Review claims as much. 'Data scientist' seems like a catch-all marketable title that many quantiative researchers can fall under if you're speaking to someone outside of academia. As a computational biologist, I will vouch for HBR and tell you they definitely hit home with several observations in this article. Here's a bit of what 'data scientists' can do:

More than anything, what data scientists do is make discoveries while swimming in data. It’s their preferred method of navigating the world around them. At ease in the digital realm, they are able to bring structure to large quantities of formless data and make analysis possible. They identify rich data sources, join them with other, potentially incomplete data sources, and clean the resulting set.

Why they do it:

The data scientists we’ve spoken with say they want to build things, not just give advice to a decision maker. One described being a consultant as “the dead zone—all you get to do is tell someone else what the analyses say they should do.” By creating solutions that work, they can have more impact and leave their marks as pioneers of their profession.

And how they like to do things:

Data scientists don’t do well on a short leash. They should have the freedom to experiment and explore possibilities. That said, they need close relationships with the rest of the business. The most important ties for them to forge are with executives in charge of products and services rather than with people overseeing business functions. As the story of Jonathan Goldman illustrates, their greatest opportunity to add value is not in creating reports or presentations for senior executives but in innovating with customer-facing products and processes.

Link

Wednesday, January 16, 2013

Big Data firm nets Big Bucks

Ayasdi, a company that developed a suite of network based analysis and visualization tools, has netted a big investment, writes Guardian:

A US big data firm is set to establish algebraic topology as the gold standard of data science with the launch of the world's leading topological data analysis (TDA) platform.
Ayasdi, whose co-founders include renowned mathematics professor Gunnar Carlsson, launched today in Palo Alto, California, having secured $10.25m from investors including Khosla Ventures in the first round of funding.

The article cites health care, and specifically cancer research, as the primary beneficiary of topological data analysis, otherwise commonly known as network analysis in biological circles. Plenty of resources like Reactome, Cytoscape, GeneMANIA are well known to researchers interested in biological networks.

Among other things, network based approaches have also been used to redefine basketball player positions and move the number of player types from five to thirteen. There's a good video from the Sloan Sports Conference on the approach here.

Sunday, January 13, 2013

Data-driven scientists are lazy

Ouch. So says petrkiel at R-bloggers as a commentary to Hans Rosling in the BBC's Joy of Stats. Maybe not as concisely as 'lazy', but 'failure of imagination' when one is expected to be imaginative is pretty damn close:

Data-driven scientists (data miners) ... believe that data can tell a story, that observation equals information, that the best way towards scientific progress is to collect data, visualize them and analyze them (data miners are not specific about what analyze means exactly). When you listen to Rosling carefully he sometimes makes data equivalent to statistics: a scientist collects statistics. He also claims that “if we can uncover the patterns in the data then we can understand“. I know this attitude: there are massive initiatives to mobilize data, integrate data, there are methods for data assimilation and data mining, and there is an enormous field of scientific data visualization. Data-driven scientists sometimes call themselves informaticians or data scientists. And they are all excited about big data: the larger is the number of observations (N) the better.

And the punchline:

Emphasisizing data at the expense of hypothesis means that we ignore the actual thinking and we end up with trivial or arbitrary statements, spurious relationships emerging by chance, maybe even with plenty of publications, but with no real understanding. This is the ultimate and unfortunate fate of all data miners. I shall note that the opposite is similarly dangerous: Putting emphasis on hypotheses (the extreme case of hypothesis-driven science) can lead to a lunatic abstractions disconnected from what we observe. Good science keeps in mind both the empirical observations (data) and theory (hypotheses, models).

Statistics is not Mathematics

Rafael Irrizarry posted a thoughtful argument on why work focused on Statistics should live within it's own department, separate from a Division of Mathematical Sciences in the National Science Foundation, where Statistics currently falls under.

Statistics is analogous to other disciplines that use mathematics as a fundamental language, like Physics, Engineering, and Computer Science. But like those disciplines, Statistics contributes separate and fundamental scientific knowledge. While the field of applied mathematics tries to explain the world with deterministic equations, Statistics takes a dramatically different approach. In highly complex systems, such as the weather, Mathematicians battle LaPlace’s demon and struggle to explain nature using mathematics derived from first principles. Statisticians accept that deterministic approaches are not always useful and instead develop and rely on random models. These two approaches are both important as demonstrated by the improvements in meteorological predictions achieved once data-driven statistical models were used to compliment deterministic mathematical models.

Given the huge importance of statistics in genome sciences and other big data sciences, new tools need to be created by statisticians wholly dedicated to solving problems created by new technologies in the biosciences.