Statistics is the science of learning from experience.
We provide consulting to help researchers prepare their experimental design and before they collect data.
Never ask a statistician to analyse the data you already have.
To quote R.A. Fisher, most often it will be a post-mortem diagnostic.
It is always good practice to meet a biostatistician when designing your experiments and be ready to discuss the following:
What is your primary research question ? What is the nature of your expected outcome (continuous, categorical, binary, count variables or survival/censored variables or ranked values) ? What should be your quantitative measure of success ? What are the sources of variability ? What do you expect the outcome to be depending on ? What is your covariates (predictors) list ? Are your covariates possibly correlated ? Pay attention to confounding and possible multicolinearity issues. Do you have a statistical model to fit ? How good is the fit of the model to the data ? Are you looking for outliers ? To determine a sample size, n, you will need to know the variability (σ) of your outcome, fix the effect size (δ) you want to detect, set a value for the risk of false positive results (type I error, α) and require a minimal power of your setup (i.e. the probability to detect an effect if there is truly an effect = 1-β ). Maybe is it advisable to carry out a pilot test before you proceed further with the full study ?
Can you trust a panel of raters to monitor the quality of a food or beverage manufactured product ? How do you assess the consistency of a panel of raters or the objectivity among the jury's members ? This is where ranked outcome and non-parametric statistical methods come into play. Permutation and bootstrap methods could be helpful to get the empirical distribution of your outcome variable, at least under particular assumptions. You might as well require simulated datasets to test the performances of your statistical analytical toolbox.
Will you suffer the curse of dimensionality with your big data ? If the number of variables is much larger than the the number of experimental subjects, you certainly will. Should you filter out and prune some possible irrelevant variables ? There are unsupervised or supervised machine learning techniques which could be useful to help you get better insights in your big data : classification and regression tree, hierarchical clustering, nearest neighbours (kNN), principal components analysis (PCA), support vector machine (SVM), random forest (RF), Lasso (Least Absolute Shrinkage and Selection Operator), just to mention a few.
We illustrate hereafter in a few selected examples some of the above issues and how they are dealt with:
- Logistic regression with generalized estimating equation (GEE) to select the best possible chemical additive to increase a product shelf life (download this report here);
- Support Vector Machine (SVM) to build a classifier for a lung cancer metabolomic signature in patients blood samples (download this report here);
- Unsupervised and supervised machine learning and data mining methods in breast cancer diagnostics (download this report here).
Should you consider Bayesian methods instead of the frequentist approach? How reliable is the prior expert knowledge?
Two examples are given below providing a flavour of the Bayesian approach to statistical analysis.
The full power of Bayesian methods has emerged in the electronic computing era over the last three decades.
We present hereafter, in a very intuitive way, the main sampling algorithms useful in Bayesian advanced analysis to evaluate the posterior probability density (when problems are not amenable to closed form analytical solutions), and know as:
- MCMC: Markov Chain Monte Carlo among which belong the two following:
- Metropolis algorithm
- Gibbs sampler