How to Use Big Data in Healthcare

In recent years, Big Data technologies left the business area of IT and were introduced into the different fields of our everyday life: state structures, business, science, healthcare, etc. On the example of the healthcare area, you can see that Big Data technologies can not only optimize production and increase the revenues of the companies, but also solve the global problems and save the lives of millions of people.

Each year the population of our planet increases – that is why, predicting and preventing epidemics, fighting diseases are the main tasks of modern society. This can be done more efficiently with the help of Data Science.

In this article, we will tell you about using Big Data in the healthcare industry. You will learn about electronic health record systems, how portable electronic devices can be helpful and treatment, how Big Data algorithms can save people’s life in reanimation. Also, we will talk about telemedicine – remote medical consultations between the doctor and the client.

Also, we will tell you about the Data Science project created by our team. It is about analyzing the statistics of the World health organization and defining the most frequent mortality causes among the European population. Stay with us, it is going to be interesting!

Data Science and Healthcare

In the future, the history of all medical procedures from the very birth of the human will be stored in the electronic database.

The machine learning algorithms are capable of finding the statistical correlations in world-wide volumes of medical data. This will help to quickly give the recommendations for the patients and their doctors.

Analysis of all known clinical records will help to introduce the support system for the doctors. They will receive access to the experience of tens of thousands of colleagues all over the world. Let’s see the benefits of introducing Big Data technologies in the healthcare area.

Electronic health records

The electronic health record is a system for collecting information from the different sources. Here you can find the information about the patient’s diagnosis, medicine, current health issues, passed procedures, medical screenings, etc. The smart health record cards are capable of sending the patients emails about the necessity of completing the recommendation of the doctor.

By using the data from the electronic health record system, the doctor may find the correlations between the completely different (as it may seem) diseases. For example, the risk management system developed by the members of Kaiser Permanente Consortium can calculate the risk of mental diseases among diabetes patients. By using this model, the American army tries to minimize the suicide rate among veterans.

Portable electronic devices

The number of portable electronic devices, such as fitness watches, increases every year. Now, in the USA, the practice of passing the data to the attending medical doctor is introduced.

Even if the patient’s health is normal, petabytes of the collected information form a flexible and constantly growing database. The neural networks will be able to find the correlations between the tracker data and a person’s liability to diseases. This is the example of data analytics in healthcare that can spots the weak points in people’s health, predict the probable diseases, and give recommendations on how to prevent them. For the doctors, it is a way to predict the result of a particular way of treatment based on its results among patients with similar genetics and lifestyle.

Reanimation & Aggressive treatment

When it comes to reanimation, predictive analysis comes to the first place in terms of care for the patients. The most vulnerable ones are liable to sudden deterioration of health due to infections. These cases cannot always be predicted by busy reanimation and aggressive therapy doctors. But it can be done by healthcare data science algorithms.

They will help the doctors to ensure that they do not miss the important information about the patients, like their liability to sepsis. Since it does not have strong symptoms on the early stages of the disease, the doctors mostly detect it when it is too late – that is why about 40% of sepsis cases are lethal. By analyzing the patient’s state and millions of similar cases, the system will detect the risk of sepsis that will help the doctors to detect the disease as early as possible.


Another example of using Big Data in healthcare is the development of telemedicine. This term includes both primary diagnostics and complex monitoring of the patient’s state. Due to telemedicine development, it is possible to communicate with a doctor remotely.

The doctors use telemedicine in order to create an individual treatment plan for the patient and to prevent hospitalization. By decreasing the number of patients in the hospital, it is possible to reduce the financial expenses for medical services without losing the quality of the service. The consultation can be handled at any time from any place that is convenient both for the patient and for the doctor.

Challenges of introducing Big Data in healthcare

Despite the progressiveness of Big Data technology, it would be wrong to consider it the key to all knowledge in the world. While processing the huge amounts of information, one may face three major challenges.

Unstructured data

The algorithms for dealing with the text information were created long ago and are widely used in Big Data processing. Yet, it is unclear, what to do with audio and video information. If we use the standard algorithms of transforming speech into the text, the volumes of information will become too big.

It will imply additional difficulties in finding useful information. About 78% of medical data is unstructured and it is too expensive to filter and analyze such amounts of information.

Lots of junk information

Big data experts are sure that the majority of projects in this area fail due to an abundance of irrelevant information in the information subject to analysis.

Collecting the information does not cause any difficulties and storing the data is cheaper than destroying. Yet, lots of low-quality information can lead the analytical systems to false conclusions, for example finding the false correlation between the disease and the external factors.

Lack of standards

Big Data companies need to create a universal protocol for exchanging medical information. The more standardized medical information is available, the more precise the analytical description of diseases and the predictions systems will be.

How we made a healthcare project using big data

We have created a report that shows a negative mortality trend since 2012. No wonder because since in 2012 the World Health Organization (WHO) revised the budget that responds to both the new reality of financial austerity and a series of reforms being undertaken to improve the overall performance of WHO. These reforms include an improved results-based management and accountability framework and a more realistic and flexible funding model.

It can be concluded that the innovations of the WHO work because a sharp decrease in the number of dead and dead per 100 000 is observed.

healthcare graphicshealthcare graphics

For collecting the raw for the analysis, we used API from the official website of World Health Organization in order to obtain the statistics regarding the death causes. To get the information about the financial state in the European countries, we used the data from the website of the Central Bank of Europe.

Our goal was to define the most frequent death causes in the different European countries and to find the correlation between the economic conditions in the country and the mortality rate.

Practical application of the project: Development process

Many factors influence health status and a country’s ability to provide quality health services for its people. Ministries of health are important actors, but so are other government departments, donor organizations, civil society groups and communities themselves. For example investments in roads can improve access to health services; inflation targets can constrain health spending; and civil service reform can create opportunities – or limits – to hiring more health workers. The visualization shows the mortality per 100 000 people in Europe. This visualization shows which countries need reforms and measures to reduce this indicator.

deaths statisticsdeaths statistics

As can be seen from visualization, such measures are urgently needed by many countries, in particular the CIS countries, Moldova, Malta, Romania, Latvia, Montenegro, Bulgaria, Poland and Hungary.

Measuring how many people die each year and why they died, is one of the most important means – along with gauging how diseases and injuries are affecting people – for assessing the effectiveness of a country’s health system.

Cause-of-death statistics help health authorities determine the focus of their public health actions. A country in which deaths from heart disease and diabetes rise rapidly over a period of a few years, for example, has a strong interest in starting a vigorous program to encourage lifestyles to help prevent these illnesses. Improvements in producing high-quality cause-of-death data are crucial for improving health and reducing preventable deaths in these countries.

Let’s look closer at the development process stages

We decided to choose Power BI for business analysis platform. This tool offers a wide range of local and cloud-based BI  and analytical opportunities. Power VI offers preparation and detection of data, interactive information panels with project tools allowing to integrate different data sources and the third-party applications. The interface is very clear and convenient, the pricing is moderate as well.

Project stages

The process of data mining for this project was divided into the following stages:

  • Collecting the information

At this stage, we filtered several sources of data from the WHO official site and the sources from the European central bank with information about poverty in the different countries.

  • Preparing the information

We have filtered out the data from invalid figures, such as skipped values, typos, impossible figures. Some data was transformed by creating the auxiliary values for analysis, for example, the mortality сoefficient for 100000 people. Then, the data from health organizations were compiled with the information of the European central bank.

  • Data research

This stage was necessary for a deep understanding of the data. We created the algorithms that managed to find the correlations and differences basing on describing and visual methods. While making the project, we found out that the level of mortality for 100 thousand people correlates with the living standards of society – i.e., the higher is the poverty rate, the higher is mortality. We also tried to check the correlation between the level of alcohol consumption and mortality, yet did not find an obvious correlation.

  • Data modeling

At this stage, we used some ML methods for modeling. We have analyzed the methods of regression and decided that the most suitable one is the decision tree. We chose to use decision trees since the linear regression is not effective when it comes to big volumes of complicated data. The tree perfectly suits for studying complicated nonlinear decisions and have better productivity, like neural networks. The tree helped us to model the value of mortality coefficient depending on the country, body mass ratio and number of years spent for studying – we came to the conclusion that when these values are high, the mortality rate is low.

  • Modeling and visualization

When we finished processing the data, we decided to visualize the results by presenting our conclusions and predictions. This approach is universal since notwithstanding the area, it is necessary to deeply understand and analyze the information. Modeling is the main factor for building a working system or business model.


Using the decision tree of this report, you can simulate the mortality rate per 100,000 people by selecting a country. The coefficient is calculated based on BMI, the number of years of study (Schooling). The decision tree showed that the higher the BMI — the higher the coefficient, the higher the Schooling — the lower the coefficient.

bmi graphicsbmi graphics

The following analysis tools were used in the work: maps for displaying countries and numerical characteristics, time series for analyzing mortality by year, a decision tree for modeling mortality rate, hexbin scatterplot for analyzing data with several numerical characteristics, heatmap for visualization of the dependence of various characteristics and classification of diabetes (high, medium, low incidence).


The number of people with diabetes in Europe is estimated to be 58 million representing 8.8% of the population aged 20-79 years, including 22 million undiagnosed cases. While Europe has the second-lowest age-adjusted comparative diabetes prevalence of any IDF region (after IDF Africa Region), there are still many countries with relatively high diabetes prevalence rates. Turkey has the highest age-adjusted comparative prevalence (12.1%) and the third-highest number of people with diabetes in Europe (6.7 million), after Germany (7.5 million) and the Russian Federation.

More than 477,000 deaths among people aged 20 – 79 are attributed to diabetes in Europe (9% of all mortality).

These problems show that the issue of diabetes is an important and relevant topic of health care.

You are welcome to check the full statistics here.


Upon reading this article, you are aware of how to use big data in healthcare. We have studied applying data science in healthcare:

  • Electronic health records that will allow having the fool picture of the health state of a particular person
  • Portable electronic devices can be used for tracking the everyday state of the organism
  • The efficiency of reanimation and aggressive therapy will be increased since the algorithms will be able to predict the possible complications.

We have also told you about our recent healthcare project based on Big Data analysis. It demonstrates the possibility of using the algorithms for obtaining a picture of the global situation. Such statistics highlight the weak points of human society that demonstrates the areas of improvement.

Generally, Big Data opens huge possibilities for the development of preventive measures in medicine. Since it is easier to prevent the disease than to cure it, we can expect that Data Science improves the quality of medical services all over the world greatly.

Posted by Contributor