Data Projects Portfolio

Childhood Obesity in the UK

Analysis of the nature and causes of childhood obesity in England for the Guy's and St Thomas' charity

python

geovisualisation

plotly

statistics

data analysis

feature selection

Background & Objectives

This analysis was originally commissioned by the Guy's and St Thomas' Charity through the viz for social good website as a crowd-sourced data visualisation project. The Guy's and St Thomas' Charity is an independent, place-based foundation working to improve the health of people in the London boroughs of Lambeth and Southwark.

The aim of this project is for the charity to understand where to target their resources and which areas to fund in the London boroughs of Lambeth and Southwark in order to help lower childhood obesity levels. This will involve:

identifying places most (and least) affected by childhood obesity
identifying what the contributing factors are
identifying how the childhood obesity has evolved over time
Identifying neighborhoods across the UK that have similar characteristics to neighborhoods in Lambeth and Southwark with high obesity rates.

Data

We have two datasets containing obesity rates data over 8 time periods (2008 to 2018). Each dataset represents a stage of the UK national school curriculum: Reception relates to children aged between 4 and 5 years, Year 6 relates to children between 10 and 11 years old.

In addition, we have been provided with a dataset containing sociodemographic data for all of the UK statistical areas. The features available (over 100) cover various subjects such as health, employment, local amenities and the education levels of each area's population.

In terms of the data quality of both datasets, my analysis found a large number of missing values in the obesity rates, but only for specific areas. Where the latest rate could be inferred from previous years' data, a calculated rate has been used to replace the missing values. Where not enough data was available to infer rates, these were left blank and will simply not be included in the analysis. This was deemed an acceptable solution as the areas with missing values are not the most relevant for the GSST charity, but must be borne in mind as this will affect the accuracy of the reporting for rankings in particular

Analysis

This section will present the key findings from the analysis. For the full step-by-step investigation including all the data cleaning & processing steps, as well as the Python code used, please refer to the Jupyter notebook available on github.

Though the study will primarily focus on the London areas of Lambeth and Southwark, it is important to place it in the national context initially, in order to get an overall view of the issue and to understand how these specific areas may differ from the UK average. Looking at the average childhood obesity rates across the UK, there doesn't seem to have been any overall increase since 2008 (figure below, left). However, when we look at the distribution across the country, we can see that the inequalities have increased between areas over the last 10 years

The latest figures for Year 6 children range from 3% to 41%, with a standard deviation of 5.8%, confirming the high level of variation between MSOAs. We also note from the first chart that the obesity levels are much lower for Reception age children, and there has been no increase in inequalities for this age group. Unsurprisingly, the obesity rates of the two age groups are correlated, but perhaps not as strongly as anticipated with a correlation coefficient of 0.63. This suggests that the two age groups may be affected by environmental factors in different ways and in different magnitudes. (we also saw that the Reception age group did not suffer from a worsening of inequalities over time). Hover over the chart below to view the data for each MSOA

How do Lambeth & Southwark compare with the national picture?

To place them in the national context, we can visualise the distribution of the obesity rates in those boroughs against the national distribution. From the chart below, it looks like the MSOAs in both boroughs fare worse than the national average. Southwark MSOAs seem worse than Lambeth MSOAs, with a skewed left distribution indicating a higher proportion of areas with very high obesity rates. The wider range for Southwark areas also indicate more inequalities in that borough. Hover over charts to view data points and click on legend names to isolate each dataset.

Looking at the evolution over time for these 2 boroughs, there doesn't seem to have been any particular worsening, but as there is much variation within these areas, it might be more meaningful to look at the individual timelines for individual MSOAs within the boroughs instead.

Most MSOAs seem to fluctuate only slightly over time but a few do stand out as having more interesting patterns. The MSOAs following a clear downward trend could be investigate to identify any positive factors pontentially leading to better rates and significant downward or upward trends can also be taken into account as criteria when determining which areas will need more effort from the charity. Double-click on an MSOA name to isolate its line.

Identifying areas most and least affected by childhood obesity

Before computing these rankings, we must first decide what constitutes an area. There is a choice between aggregating at the Local Authority level (e.g. Lambeth, Southwark) or using the MSOA level figures. The average standard deviation within each Local Authority is 4.15%, which is quite close to the standard deviation at the national level and we risk losing valuable information if we aggregate. Therefore, we will look at areas at the MSOA level.

I computed the rankings nationally, as well as for Lambeth & Southwark together, and separately. The most recent rate for each MSOA was used to determine the rank but the evolution over time could also be looked at later as an alternative in order to visualise which MSOAs have had the worst increase in rate since 2008

Two MSOAs from the Southwark borough appear amongst the 20 most affected areas nationally: Southwark 014 (2nd worst affected in the UK) and Southwark 021 (14th). To decide where to prioritise their budget spend, I would recommend GSST look at the combined Lambeth & Soutwark rankings, in order to target the areas that are the worst affected overall. This will mean spending a larger part of their budget on Southwark areas and is only feasible if they are not required to spend equal amounts in each borough.

The wider range of rates in Southwark identified previously is also evident from these rankings.

The choropleth map below can help visualise any localised patterns in London boroughs, and to place Southwark & Lambeth within the London context. Zoom in to view specific areas and hover over to view the data.

Identify the factors contributing to high obesity rates

The challenge here is that we have a large number of features in the sociodemographic dataset and they cannot be visualised together. After calculating the correlation coefficient between each feature and the year 6 obesity rate, we can make the following observations:

None of the variables in our dataset have a very strong correlation with the obesity rate. The highest correlation coefficient lies at -0.68. This may be an indication that our dataset is missing some important features and that more research is necessary to be able to confidently ascertain the factors driving the obesity rates.
The highest correlating features differ at the national level and for the Lambeth & Southwark boroughs. There are significant difference even between the factors affecting the rate in Lambeth and in Southwark. For example, the reception age obesity rate seems highly positively correlated with the rate in year 6 in Southwark, but not at all in the Lambeth borough.

In addition to looking at the Pearson correlation coefficients, I used a feature selection technique (more often used to prepare a dataset for statistical modelling) to find other potential non-linear relationships between the features and the obesity rate. Combining the findings from the 2 methods, I arrived at the final lists of factors below.

Female & Male healthy life expectancy at birth
Percentage of children in poverty (after housing costs)
Numbers of households in poverty
Percentage of people who participated in sport and physical activity at least twice in the last 28 days
Net annual household income estimate after housing costs
Proportion of households with no car
Index of Multiple Deprivation (IMD) Score

The factors above are calculated at the national level. When looking at the correlation coefficients for Lambeth and Southwark individually, they are quite different from each other, with no strong correlations showing at all for Lambeth. There is probably not enough data available to make reliable inferences from those coefficients for each borough independently.

From the list above, we can observe that the most impactful features relate to poverty and general physical health. They are very high level features and unfortunately not very helpful for GSST to develop an action plan from. I would recommend looking at a different set of features rather than high level sociodemographic data in order to make more specific recommendations on what issues to target in those areas to reduce the children obesity rate. In particular, studying the impact of different measures & initiatives implemented in the past (at the local or national level) could prove very informative.

Since a large majority of the features have a correlation coefficient between 0.2 and 0.5, developing a model could be useful here to determine if perhaps a combination of lower correlation features can have more impact together, and this could help make further recommendations to GSST in order to support the development of an action plan.