The Checkmate Scientist: May 2013

Friday, May 31, 2013

Let's Just Let the "Valley of Death" Die

A few authors at The Guardian put a new spin on making science and innovation work; reframe challenges in a positive light:

Let's do away with valley of death metaphors, which rest on flawed linear assumptions and reinforce the idea that linking science to policy is the task of heroic pioneers. It doesn't take a solitary, genius scientist or a brave, visionary policy-maker to cross the valley of death and come out the other side. It takes an entirely different kind of courage: to work in teams, to share accountability and to develop and maintain complex relationships with other people who have different training, expertise and interests.

They also make the observation that Sir Paul Nurse, recipient of the 2001 Nobel Prize in Medicine, argued for more support for "the best scientists" which would in turn solve problems like those associated with the "Valley of Death". If those "best scientists" happen to be involved in basic research and have nothing to do with any research that's supposed to be translated to the marketplace - arguably having to pass through the existing disconnect - throwing more funds at them will not solve this particular problem.

Monday, May 27, 2013

"Data Science": Just a Buzzword?

Fred Ross writes:

A good hunk of the data science certificate gets taught to physics majors in one semester of their second or third year at University of Virginia as “Fundamentals of scientific computing”. I single out University of Virginia’s class as an example because I happened to be there when it started in 2005, and remember talking about what should be in it with Bob Hirosky, its creator. My friends were the teaching assistants.

And the topics in these certificates are the basics, not the advanced material. Not that there aren’t legions of professional analysts out there with less statistical skill and no knowledge of programming, but no one would dream of giving them a title other than “Excel grunt”—sure, gussied up somehow to stroke their ego, but that’s what it comes down to.

Here's the original NY Times article Ross references, which describes "data science" as a hot new field. There's some truth to Ross' claim that journalists have pumped up data and science as the new big thing to report on.

I don't know about you, but data and science go hand in hand. There aren't any scientists (that I know of) that don't work with data. But more fundamentally, you can't expect to understand what to do in science without the ability to analyze data in ways more abstract than what's given from default outputs of algorithms or R vignettes, and you can't imagine what to do with your data with understanding even a little about the science than went into generating it.

A data science certificate is a small part on the way towards becoming a "Data Scientist" but it won't magically convert you into one of these folks.

Friday, May 24, 2013

An Argument Against Academics Hiring Academics

Not sure that I agree with Istvan Aranyosi's position:

[Academic] hiring decisions should be taken out of academic hands and given to managers who are able ruthlessly to apply genuine market principles when these are called for.

But the article still makes for a good read.

From what I've observed, most new hires to academic departments come in with skills that are complementary to those that are already nearby, so I'd argue against his opinion that academic people tend to hire those that he considers "pet students and research projects".

Of course, if you show me the data on this I might change my mind!

Wednesday, May 22, 2013

Tips on Choosing the Right Statistical Tool

Here's a poster that provides a very quick overview of a resource that helps choose statistical tests for biological data, or most other data for that matter.
The basic BioStat Decision Tool, developed by two biologists at the University of Auckland, is available as a free web tool and also as a paid app for portable devices like iPads, which is an interesting decision on part of the creators. It basically runs through a decision tree to help people choose which statistical test to use for their data and seems like a useful site to bookmark.

What caught my attention was the presentation of how bad the issue of statistical fumbling in literature was (thus making the BioStat Tool necessary):

Common errors found in half of published biomedical papers (1979–2003)

Failure to adjust or account for multiple comparisons

Reporting that a result is “significant” without conducting a statistical test

Use of statistical tests that assume a normal distribution on data that is skewed

Unlabelled or inappropriate error bars/measures of variability

Failure to describe the tests performed

Really? Granted, these statements are derived from a review of articles in Infection in Immunity, but I didn't think the error rate was that high. Maybe the rate is lower for other journals, or more recent articles, but 50% is pretty shameful.
Finally, there's a call to action in the poster, too:

If we are to improve the quality of published biomedical papers, it is clear that we need a paradigm shift in the way biologists approach data presentation, statistics and data analysis.

I don't know about paradigm shift, but easy to use tools that guide people through the many options offered by the stats field are a step in the right direction. Some education and pushback from referees and editors would be even better.

Tuesday, May 21, 2013

3D Face Portraits Inferred from DNA

Smithsonian Mag writes about Heather Dewey-Hagerborg's work in facial rendering using genomic infomation:

The 30-year-old PhD student, studying electronic arts at Rensselaer Polytechnic Institute in Troy, New York, extracts DNA from each piece of evidence she collects, focusing on specific genomic regions from her samples. She then sequences these regions and enters this data into a computer program, which churns out a model of the face of the person who left the hair, fingernail, cigarette or gum behind.

Image from Dewey-Hagerborg's "Stranger Visions"

The comments in the Smithsonian article are plentiful but fall into a few categories, ranging from those firmly against the approach to many that doubt the program's accuracy. I'm not convinced, either, after seeing the comparison of Manu Sporny's 3D image (though the image of Kurt Anderson's 25 year old equivalent is pretty good). Then again, it's portrayed as art, not a highly accurate system to generate faces from DNA.

I was also left wondering about what data her method was based on, as the Smithsonian only mentioned a lab returning "about 400 base pair sequences" (Sanger sequencing?) and "about 40 or 50 different traits" that the artist is currently considering, which I thought to be SNPs.

In a blog posting last year, Dewey-Hagerborg explained the method "Stranger Visions" is based on in more detail:

I have worked with facial recognition algorithms in the past and one technique I had read about was the use of a morphable 3d facial model to attempt to recognize faces at weird angles. As I have previously turned systems like this, intended for surveillance, into systems for creativity, it naturally occurred to me that the same system could be used to generate faces. And if you can determine what correlates certain types of faces you can then generate faces with those characteristics using principal components analysis.
It turns out the research group behind the Basel Morphable Model realized this as well and after much digging I figured out how to use their matlab model as a starting point for many different types of parameters. They had already found the primary axes for gender, age and weight so the main parameter missing was ethnicity.

Further on in that article, she also explains how she expanded the training data set of images to encompass a more diverse set of ethnicities to better reflect the composition of the United States. Nevertheless, it still leaves the following process in my mind: DNA --> SNPs --> Ethnicity --> Facial Features.

Without seeing more of the code used to fit the models, I have to assume that the fairly subjective parameter of 'ethnicity' sits in the middle of the information flow. The method could be dramatically improved by eliminating 'ethnicity' from the workflow by using a bigger and better set of data that correlates SNPs to facial features directly.

I'm not aware of any data sets that offer that, but I wouldn't be surprised to see it done in the next few years. As there are apparently only 14 different kinds of noses, deeper knowledge of genetics would yield a more objective predictor of facial features.

Wednesday, May 15, 2013

Delivering the Hard Truth in a Science Talk

Karen Lips writes:

It’s profoundly frustrating to have a platform and a voice, but not to have a clear call to action for the public. A common theme in science communication is that we have to the audience care. And people do care – a lot! They are eager to help, to offer suggestions, to get involved. But at the end of my talks there is no magic bullet. The truths I have to offer are not easy, they don’t instantly make us feel better. If there is tough love, let’s call what I have to offer “hard hope”.

Part of the research process is to define problems, show how you solve them, and present the knowledge you've created in that cycle of work. That's the happy ending scenario. Usually a few minutes are tacked on to discuss the next steps: your current work and unsolved problems. It's part of a delivery format that most scientists are taught to follow first, and I think what Lips' example speaks to. Scientists aren't taught to give happy endings, but it's important to learn how to do exactly that.

If the problems you face are still huge and potentially unsolvable (like the extinction of species, as Lips writes about), the audience departs on a sad note.

Most of the time selling science isn't like selling a book or a product: there's no action that makes the audience feel better. Selling your science is about educating your audience about something new, novel, or useful to them, which helps them in whatever they do, regardless of whether they're researchers or a more general audience.

The hard part is convincing them that you've given something valuable in return for their time. Only then do they buy into what you're speaking about.

Monday, May 13, 2013

Barns Are Red Because of How Nuclear Fusion Works

Yonathan Zunger offers a tongue-in-cheek, yet accurate explanation of why barns are usually colored red:

The answer ... is “because red paint is cheaper,” which is absolutely true, but it doesn’t really tell you why red paint is cheaper. It clearly isn’t because the Central Committee for the Pricing of Paints has decreed that red shall be in vogue this century, or because of the secret Communist sympathies of early American farmers. In fact, to answer this we have to go all the way to the formation of matter itself.

Stars will burn light elements in a well established order of fusion reactions, going through stages of burning hydrogen, helium, lithium, and other successively heavier elements:

Until it hits 56. At that point, the reactions simply stop producing energy at all; the star shuts down and collapses without stopping. This collapse raises the pressure even more, and sets off various nuclear reactions which will produce even heavier elements, but they don’t produce any energy: just stuff. These reactions only happen briefly, for a few centuries (or for some reactions, just a few hours!) while the star is collapsing, so they don’t produce very much stuff that’s heavier than 56.
What has 56 nucleons in it and is stable? A mixture of 26 protons and 30 neutrons -- that is, iron.

And it's the iron that ends up in red ochre (Fe2O3, aka hematite), the pigment used in barn paint, explains Zunger. I have to wonder about the other iron based ochres like Yellow ochre (Fe2O3•H2O aka Limonite), Purple ochre, which is like red ochre but with a coarser particle size, and Brown ochre (goethite), which is made of partly hydrated iron oxide (rust) and why they're passed over. I'm also surprised at the number of reference books available that explain the basis of colors, like The Chemical History of Color.

The comments have a few other interesting side notes, like this one from Francisco D'Antonia:

There is a specific combination of paint colors made from raw materials that when combined can create all the natural colors of the living world. Its been awhile, but yellow ochre, burnt sienna, cadmium yellow, burnt umber, cobalt blue and titanium white are a few. I met someone once that worked for a company that made several of the raw colored powders from metal. Fascinating process.

The final question I'm left with is whether does Red ochre based paints really have that much of a cost advantage over other colors? I've seen green barns and a blue barn or two, which suggests that not all farmers are the rational price-optimizing paint pickers that Zunger imagines.

Nonetheless, I give two thumbs up for his explanation!

Thursday, May 9, 2013

The Art of Self Reliance

Most complex fields, like science, are collaborative by nature. People specialize in a field they are talented in and contribute to projects based on their talents. When another person can do some work that's required better, faster, and cheaper, some coordination of efforts takes place and the project is passed around like a relay baton. They call this collaboration and it's supposed to be seamless.

In reality, there's a cost to collaboration. It's the overhead required to coordinate all these separate parts of work; the meetings, identification of work to be shared, and usually some hunting around for the right person to do the experiments in just the right way, followed by informal negotiation of when the work gets done.

Sometimes, the quantum of work to be delegated is so small that it's not even worth spreading the collaboration out. You simply have to find the best person on your team (sometimes just you) and have them get the job done. So someone that's never build a Markov Model will learn how to build one, or learn how to prepare next generation sequencing libraries. Some would argue that, in the long run, this is still an inefficient way to get things done.

But the process is educational, moves you to self-reliance, and it builds an appreciation for the difficulties other people live with. You might even learn that pitching some "collaborations" on prospective partners are much bigger requests of them than you initially imagined. Displaying sensitivity to their time might even help move them from No to Yes!

Tuesday, May 7, 2013

US Proposal to Replace Peer Review with "Political Review" in Grants Process

Steven Novella, at Science Based Medicine, writes:

[U.S.] Representative Lamar Smith has been developing legislation that would in effect replace the peer-review process by which grants are currently given with a congressional process. Rather than having relevant scientists and experts decide on the merits of proposed research Smith would have politicians decide. It is difficult to imagine a more intrusive and disastrous process for public science funding.

Novella also points out the three basic tenets of Smith's proposed legislation are reasonable, that is 1) science must advance the prosperity of the United States, 2) be of the highest quality and be groundbreaking, and 3) not duplicate other research projects.

Sounds good. We want research that's useful, excellent, and efficiently delivered.

The problem with the first two points is that clear goals and applications need to be defined for research to be useful or deemed to be groundbreaking in order to receive funds. This generally rules out a lot of academic research, which usually has a clear goal but not necessarily a good application for the knowledge that's to be acquired, while 'groundbreaking' research is usually recognized as such only after the fact.

He also points out that duplication of efforts are needed to tackle scientific problems.

To a point, this is true, as I've seen numerous cases of nearly identical articles being published in the same issue of journals. I tend to believe that even if we were to demand absolutely zero duplication of efforts, most 'duplicated' research really attacks the same question using two or more complementary approaches, which make the end results(s) much more believable, so you could argue that the research isn't really a duplication of effort at all.

Duplication of effort aside, it might be worthwhile to argue for is better coordination between groups interested in the same questions, but that's what conferences are for.

The last point I like in Novella's post is that while some political forces are eager to attack wasteful government spending, even private funding isn't as efficient as some would like to believe:

This can happen with private funding as well. I have seen it happen with disease research. Private charitable organizations raise money to research a disease. The organizers want that money to go to research that will directly benefit patients (who are often their primary donors). But if this prematurely pushes researchers toward clinical studies when we don’t have the basic science sufficiently worked out yet, you end up wasting a lot of time and resources on dead ends.

Just something to keep in mind when trying to reach a research goal, be it academic or applied, before the results are ready to stand on their own.

Friday, May 3, 2013

Genomic Sequencing Companies Continue To Evolve

There's a nice, short review of the evolving genome sequencing market in Nature Reviews Drug Discovery:

Historically, manufacturers have relied on selling sequencing technologies and reagents. Today, Illumina and other leading companies operate complex business models that encompass the manufacture of genomic sequencing technologies, the provision of commercial genomic sequencing services and the sale of products in the informatics and diagnostics markets.

I recently mentioned that sequencing companies are positioning themselves to become the backbone of the medical system, as it's the kind of technology that's suited to having a single point of contact if genomic information is needed from a wide range of samples.

I'm starting to think the trend isn't even limited to human health or research uses; Jay Flatley, Illumina's CEO quipped on a recent earnings call that "ultimately, ... you're going to be doing genotyping on every cow that's born and using that as a way to triage its future". Over 30 million calves make that another huge application that isn't mired in the safety issues relevant to humans.

Returning to the Nature Review, it's important to keep several obstacles in mind that are still blocking genomic technologies from widespread use, in addition to the analysis bottleneck of being able to analyze all the data, which is the province of computational biologists like myself:

Despite the rapid progress in the development of sequencing strategies, the era of personalized medicine is still a distant goal. Several challenges remain, including the inadequate training of physicians in the area of personalized medicine, attaining the $1,000 genome, enhanced pharmaceutical R&D processes to leverage genomic advances and an international framework for regulating the use of genomic data in the clinic and thereby protecting patient privacy.

Thursday, May 2, 2013

The Art of Fitting Distributions With R

Here's some good advice on fitting distributions using R from Marcus Gesmann, a mathematician involved in the analysis of insurance markets. He makes it clear that it's a bit of an art:

Suppose I have only 50 data points, of which I believe that they follow a log-normal distribution. How much variance can I expect? Well, let's experiment. I draw 50 random numbers from a log-normal distribution, fit the distribution to the sample data and repeat the exercise 50 times and plot the results using the plot function of the fitdistrplus package.

I notice quite a big variance in the results. For some samples other distributions, e.g. logistic, could provide a better fit. You might argue that 50 data points is not a lot of data, but in real life it often is, and hence this little example already shows me that fitting a distribution to data is not just about applying an algorithm, but requires a sound understanding of the process which generated the data as well.

He also republished a handy guide for deciding what distribution your data might belong to, taken from Probabilistic Approaches to Risk by Aswath Damodaran.

Wednesday, May 1, 2013

A What Point Does More Detail = Less Understanding?

I enjoy visiting Martin Krzywinski's homepage at the BC Genome Sciences Centre from time to time, as it's fascinating collection of great design ideas for communicating scientific data. This time around, a presentation on designing effective visualizations in the biological sciences was worth the visit.

One slide caught my eye with a warning that most people probably consider obvious: "DO NOT DIVIDE YOUR SCALE INTO MORE THAN 500 INTERVALS". Regrettably, I can remember a few biologists that would disagree and try to put everything possible into one intricately prepared figure.

Slide 15 from "Designing Effective Visualizations"

You could half-jokingly claim that most hyperdetailed scales are of limited use, except perhaps for pointing out how not to design a scale. A scientist might counter with "The figure contains all the data!" but as a tool to communicate a concept they fall short.

It also turns out that designing good biological data visualizations isn't just an aesthetic exercise; it actually has an ironic origin in biology. The example above reminded me of a very similar example in a book I received as a gift many years ago.

In Hack #34, O'Reilly's Mind Hacks points out that there's a limit to the visual selective attention the mind gives to groups of crowded dots or lines. Basically, when details are crammed together beyond this limit, the viewer can't willfully focus their attention on any particular detail.

I tried the examples in Mind Hacks (again) and found that truly, I can't concentrate on something as simple as an individual dot on a crowded field. The surrounding points draw my attention away from the points I look at, again, and again, and again.

Which leads me to believe that if you're cramming data into scatterplots, I probably won't be able to focus on the points you think are important.

Mind Hacks contains many fun examples of when the average person's perception starts to breaks down and is a good guide for becoming aware of some very basic limitations of your eyes and brain.