To quote Albert Einstein: "Erst die Theorie entscheidet darüber, was man beobachten kann". The theory decides first what can be observed. The deductive approach starts by building assumptions and first principles upon which a mechanistic model is built that predicts what could be observed as a consequence of the theory.
Statistics is the science of learning from experience; learning from data. The approach is inductive: data drive the building of knowledge. These data driven methods are collectively called statistical inference. In statistics, observations decide which theory is correct. In contemporary biomedical research and in the Big Data era, both approaches are useful and crossfertilize each other.
We provide consulting to help researchers prepare their experimental design and before they collect data.
Never ask a statistician to analyse the data you already have.
To quote R.A. Fisher, most often it will be a post-mortem diagnostic.
It is always good practice to meet a biostatistician when designing your experiments and
be ready to discuss the following:
What is your primary research question ?
What is the nature of your expected outcome (continuous, categorical, binary, count variables or
survival/censored variables or ranked values) ?
What should be your quantitative measure of success ?
What are the sources of variability ?
What do you expect the outcome to be depending on ? What is your covariates (predictors) list ?
Are your covariates possibly correlated ? Pay attention to confounding and possible multicolinearity issues.
Do you have a statistical model to fit ? How good is the fit of the model to the data ? Are you looking for outliers ?
To determine a sample size, n, you will need to know the variability (σ) of your outcome,
fix the effect size (δ) you want to detect, set a value for the risk of false positive results
(type I error, α) and require a minimal
power of your setup (i.e. the probability to detect an effect if there is
truly an effect = 1-β ).
Maybe is it advisable to carry out a pilot test before you proceed further with the full study ?
Can you trust a panel of raters to monitor the quality of a food or beverage manufactured product ?
How do you assess the consistency of a panel of raters or the objectivity among the jury's members ?
This is where ranked outcome and non-parametric statistical methods come into play.
Permutation and bootstrap methods could be helpful to get the empirical distribution of your outcome variable,
at least under particular assumptions.
You might as well require simulated datasets to test the performances of your statistical analytical toolbox.
Quite often in molecular biology experiments, the measures made on the experimental units (observations) do not comply with the classical assumptions on the statistical distributions. As an example, codon usage frequencies for a given amino acid observed across transcripts expressed in cells in a given condition (treatment vs. control) may not be normally distributed and may be skewed. For those situations, nonparametric statistical methods are of interest: Wilcoxon rank sum tests (Mann-Whitney U test statistics) when variances are equal, Fligner-Policello median test when the variances are not equal, Ansari-Bradley rank test for dispersion when medians are equal and Kolmogorov-Smirnov distribution-free test for general differences between two populations.
Will you suffer the curse of dimensionality with your big data ? If the number of variables is much larger than the the number of experimental subjects, you certainly will. Should you filter out and prune some possible irrelevant variables ? There are unsupervised or supervised machine learning techniques which could be useful to help you get better insights in your big data : classification and regression tree, hierarchical clustering, nearest neighbours (kNN), principal components analysis (PCA), support vector machine (SVM), random forest (RF), Lasso (Least Absolute Shrinkage and Selection Operator), just to mention a few.
We illustrate hereafter in a few selected examples some of the above issues and how they are dealt with:
- Logistic regression with generalized estimating equation (GEE) to select the best possible chemical additive to increase a product shelf life (download this report here);
- Nonparametric statistical analysis and unsupervised learning with principal components analysis supporting evidence that protein translation efficiency depends on positively charged amino acid location and on transcripts codon usage (download this report soon here);
- Support Vector Machine (SVM) to build a classifier for a lung cancer metabolomic signature in patients blood samples (download this report here);
- Unsupervised and supervised machine learning and data mining methods in breast cancer diagnostics (download this report here).
Should you consider Bayesian methods instead of the frequentist approach? How reliable is the prior expert knowledge?
Two examples are given below providing a flavour of the Bayesian approach to statistical analysis.
The full power of Bayesian methods has emerged in the electronic computing era over the last three decades.
We present hereafter, in a very intuitive way, the main sampling algorithms useful in Bayesian advanced analysis to evaluate the posterior probability
density (when problems are not amenable to closed form analytical solutions), and know as:
- MCMC: Markov Chain Monte Carlo among which belong the two following:
- Metropolis algorithm
- Gibbs sampler