Skip to content



Unit(s) of assessment: Allied Health Professions, Dentistry, Nursing and Pharmacy

Research theme: Health and Wellbeing

School: School of Science and Technology


Transforming Biological Data into Clinical Benefits

Medics and leaders in healthcare understand that we must base the creation of new drugs and diagnostics on the analysis of Big Data, but the pharma and diagnostics industry does not yet know how to do so. The Ball group provides that know-how.

We can now answer the critical questions in biotechnology today:

  • What are the most important genetic factors or drivers in the disease I am studying that research has not yet identified?
  • Is my biomarker panel the optimum panel - in terms of for brevity, sensitivity and specificity?
  • What simple test can I use as a companion diagnostic for my new drug, with confidence that the test will work?
  • How can I split my patient population so that I can focus my trial on the right patient group, improving the likelihood of success in my clinical trial?
  • Do the results of my failed trial conceal a successful drug, but for a sub-population of patients?
  • Can you identify a better target for drug discovery in the disease I am studying?
  • Can you derive a test to identify who will respond to my drug, and who will not suffer significant toxicity issues?
  • Existing diagnostics are inadequate for the disease I am researching as they are not fast enough, accurate enough or suitable for point of care diagnosis. Can you help?
  • Can you find a new molecular druggable target for the disease I am working on?

Professor Ball collaborates with top researchers in the field of cancer and bioinformatics from a number of universities and SME’s across the UK, Europe and USA.


Since the sequencing of the human genome new approaches for studying disease systems at the genomic, epigenetic, proteomic and metabolomics levels are being continually developed.  One of the challenges with the analysis of such data is the volume, resolution and complexity of the data generated; plus, the quality of the data. These issues from a bioinformatics perspective are often typified by the criticism that different data sets do not yield consistent results.  This would indicate that often such individual data sets have high levels of noise and thus do not have sufficient cases to achieve a sufficient statistical power.  Thus, analysis of such data requires careful consideration, paying attention to the non-linearity of biological systems, the interaction of molecular entities in pathways, the fluidity of biological systems and the need for determination of consistent entities across multiple data sets.

We have developed cutting edge systems biology and bioinformatics approaches, based on computational intelligence, which identify robust nonlinear biomarkers associated with clinical features which are concordant across multiple data sets. Furthermore, we have developed approaches which study the interactions between key features in the context of a given problem.  These approaches in effect determine the level of influence of a set of driver markers in a given biological system.  This approach allows us to determine the molecular drivers of a system which result in a given phenotype.

Based on our patented technology we have developed a number of approaches to address the challenges of “Big Data” whist ensuring answer the questions in terms relevant results.

This is the first algorithm to be applied and it mines the data distilling out the key biomarkers that can predict the clinical question being asked.  This is achieved through the production of a ranking of biomarkers based on the error of predictions for an unseen data set averaged across multiple repeats.  The findings made in this approach feed into other approaches described below.

Expression array studies are frequently underpowered if considered in isolation.  This approach applies the data mining algorithms described above across multiple data sets.  The top ranked probes are then cross compared to find commonality (Figure 2.).  Probes that are found in common between multiple data sets are unlikely to have occurred by random chance.  This approach utilizes this statistical principle to increase the certainty of the markers discovered.  This approach is used to Increase the statistical power of the marker set discovered and reduce the risk of false discovery.  Furthermore, the approach increases the generality of markers discovered so they are more likely to be able to predict for the general population.  Integration is achieved at the probe level for the same array platform and at the gene level for different array platforms. The approach has been used in the Abdel fatal (2016, Lancet Oncology) study to identify a key set of markers for proliferation in breast cancer by integrating datamining across 4 datasets.  Here the biomarkers identified had a false discovery probability of 2x10-74 and have now been validated using immuno-histo-chemistry in over 15000 cases.

By using the above algorithm in a stepwise additive fashion on large molecular datasets, we can build an optimized panel using the best subset of markers that have the greatest sensitivity and specificity.  If this is coupled with the concordance analysis a very robust diagnostic panel can be developed.  Classifier panels are assessed for both seen and unseen data to assess their suitability as a diagnostic for the wider population and performance is evaluated based on Receiver Operating Characteristic (ROC) Curves.  The population is then characterized and ranked based on the probability of disease for a given individual.

This approach adopts a systems biology and pathways analysis approach. A set of markers derived from the core analysis described above are used in network inference algorithms to identify a network of interactions. This network is analysed to identify the key molecular drivers and the most influential in a given system. The approach has been used in a commercial contract with Syngenta to identify transcriptomic regulators of ripening in tomato. Pan Y. et al, Plant Physiology, 2013.

The approach was also used in the Abdel fatal (2016), Lancet Oncology study. This use of a systems approach goes further than a simple list of markers because biology is defined by the interaction of molecular markers. The approach refines the marker set identified and can be used to identify molecular based disease processes that differentiate between a healthy population a diseased population (Therapeutic target identification In Silico), processes associated with therapeutic response or pharmacodynamics.

Knowledge of biological pathways is essential for drug discovery and development. However, our knowledge of the biology of pathways and what drives them is scant. Often pathways have been modelled based on simple reductionist experiments in the lab and the results do not reflect the nature of the pathway in a given disease or reflect a broader evaluation of the whole transcriptomic or proteomic nature of the pathway.  In short there are numerous gaps and omissions in our knowledge of pathways, yet they are the basis of many expensive drug discovery strategies.

We have, based on the success of this work, developed a new systems biology and pathway interrogation methodology (Intellomx Pathway Miner). This approach first uses our patented Distiller algorithm to mine transcriptomic data for selected diseases using existing pathway knowledge as the framework for data interrogation. This approach identifies new disease specific pathway features from across the whole proteome/transcriptome, creating in effect an augmented pathway specific to the disease being investigated.  The extensive cross validation conducted in distiller ensures that the markers we identify and robust having been validated across multiple data sets.

Once the augmented pathway has been created, new and existing features are added to the Driver algorithm. This uses the network inference and systems biology algorithms to identify the strength and nature of the influence of each molecule on the pathway. In this way the most influential and the most influenced molecules in a given pathway for a given disease can be identified. These influential molecules, given their relationship to phenotype as described above, are the most likely druggable target contenders.

Through the interrogation of Drug Gene Interaction Databases, we can identify existing drugs that will interact with these molecules. Thus, we can identify using In silico methods potential repurposing targets.

Our ability to identify new potential targets is demonstrated by the results of our R&D programme where we have analysed a number of high quality Public Data Sets for lung cancer and through application of our Pathway Miner methods, identified a number of novel targets for known drugs and potential therapies. Below we present some of the results for unknown but influential genes for enriched MEKK pathway in lung cancer (See Figure 6). We have also applied our approaches to a wide range of other cancers and pathways generating a pathway driver IP portfolio.




PYK2 promotes HER-2 positive breast cancer invasion. Al-Juboori SI, Vadakekolathu J, Idri S, Wagner S, Zafeiris D, Pearson JR, Almshayakchi R, Caraglia M, Desiderio V, Miles AK, Boocock DJ, Ball GR, Regad T.(2019). J Exp Cancer Res. 38(1): 210. Doi: 10.1185/s13046-019-1221-0.

A parsimonious 3-gene signature predicts clinical outcomes in an acute myeloid leukemia multicohort study. Wagner S, Vadakekolathu J, Tasian S, Altmann H, Bornhauser M, Pockley AG, Ball GR, Rutella S. (2019). Blood Adv. 3(8):1330-1346. Doi: 10.1182/bloodadvances.2018030726.

MTSS1 and SCAMP1 cooperate to prevent invasion in breast cancer. Vadakekolathu J, Al-Juboori SIK, Johnson C, Schneider A, Buczek ME, Di Biase A, Pockley AG, BallGR, Powe DG, Regad T. (2018). Cell Death Dis. 9(3):344. Doi:10.1038/s41419-018-0364-9.

An Artificial Neural Network Integrated Pipeline for Biomarker Discovery using Alzheimer’s Disease as a Case Study. Zafeiris D, Rutella S, Ball GR. (2018). Comput Struct Biotechnol J. 16:77-87. Doi:10.1016/j.csbj.2018.02.001.eCollection.

Discovery and application of immune biomarkers for hematological malignancies. Zafeiris D, Vadakekolathu J, Wagner S, Pockley AG, Ball GR, Rutella S. (2017). ExpertRev Mol Diagn. (17(11):983-1000. Doi: 10.1080/14737159.2017.1381560.

Related Projects

Research Excellence Framework (REF) 2021

The Bioinformatics Research Group submitted an impact case study to REF 2021. 99% of NTU's research submitted to the 'Allied Health Professions, Dentistry, Nursing and Pharmacy' Unit of Assessment was considered to be either world-leading or internationally excellent in terms of quality.

Discover the real-world impact of the research below.


We have access within the John van Geest Cancer Research Centre to instrumentation to facilitate multi-omic analyses including gene expression micro-array, Nanostring platform, mass spectrometers and laser capture microdissection.