Data analysis#

Learning goals

After finishing this chapter, you are expected to be able to

  • import SciPy and use some of its functions

  • create and manipulate Pandas dataframes

  • import data in CSV format

  • process this data

  • write data to different file formats

There are hundreds if not thousands of packages available for Python. In the final chapter of this manual, we’ll introduce some additional commonly used Python packages, in addition to NumPy and Matplotlib which you have already used. We will show how to use SciPy to perform mathematics, how to use Pandas to represent data, and how to use Seaborn to make beautiful figures.

In addition to these, there are many other packages with more specialized uses that we will not introduce here but which you’ll probably encounter during your studies. Examples include

  • The scikit-image package offers a lot of functionality in image analysis.

  • The scikit-learn package provides tools for machine learning.

  • The PyTorch package lets you build and train neural networks.

SciPy#

SciPy provides algorithms for many mathematical problems like optimization, integration, interpolation, eigenvalue problems, algebraic equations, differential equations, statistics and many other classes of problems. It uses NumPy arrays (which you already have used) for most of these things.

Read the docs

As always, we cannot mention all possible functionality. To grasp everything what you can do with SciPy, take a look at the documentation.

SciPy works similarly to NumPy. It is a package, and its code is organized in multiple different modules. There is a module for optimization, for differential equations, etc. To import a module, the convention is to use from scipy import <module_name>. For example

from scipy import ndimage # This import the SciPy module that lets you work with images

The ndimage module allows you to work with images. It contains many different kinds of filters (that let you change an image) and operations on images. For example, take a look at the rotate function here. It lets you rotate an image by a prespecified angle.

Exercise 10.1

  1. Download this image file.

  2. Load the image file using the imageio package as follows

import imageio

image = imageio.imread('art.jpeg')
  1. Show the image using Matplotlib’s imshow function, give the figure a descriptive title.

  2. Rotate the image by 90 degrees using Scipy.

  3. Visualize the rotated image using Matplotlib, give the figure a decriptive title.

Opdracht 11.1

  1. Download dit bestand dat een plaatje bevat.

  2. Laad het plaatje in Python met behulp van de imageio package, als volgt

import imageio

image = imageio.imread('art.jpeg')
  1. Bekijk het plaatje met Matplotlib’s imshow functie (zoals in Hoofdstuk 8). Geef de figuur een goede titel.

  2. Roteer het plaatje 90 graden met behulp van Scipy.

  3. Bekijk het geroteerde plaatje met Matplotlib. Geef de figuur weer een goede titel.

Another useful Scipy module is the integrate module that lets you numerically integrate functions in a given range. For this, we need to explain one important thing. Remember in Chapter 7 when we told you that everything in Python is an object? Well, even functions are actually objects. We can define the following function

def square(x):
    return x**2

and use it like we have used functions so far, e.g.

square(4)
16

or

square(-3)
9

But we can also get the type of the function when we use its name without ( ). E.g.,

type(square)
function

and Python tells us it has type function (no surprises there). Now, if we refer to square in this way, we can also pass the function itself as an argument to other functions, and that is what happens when we use the SciPy integrate package.

For example, we can integrate our square function between \(x=0\) and \(x=4\), i.e., compute

\[\int_0^4x^2 dx\]

SciPy provides multiple ways to integrate. Here, we’ll just use (an optimized version of) Gaussian quadrature.

from scipy import integrate

integrate.quad(square, 0, 4) # Use quadrature to integrate function x2 (x**2) between 0 and 4
(21.333333333333336, 2.368475785867001e-13)

Take a look a the function call and what is returned. The first return value is the result of the integral, the second the absolute error. We can verify that the answer is approximately correct by analytically computing the integral as

\[\int_0^4x^2 dx = \frac{1}{3}4^3 - \frac{1}{3}0^3 = \frac{64}{3} \approx 21.33\]

Exercise 10.2

Consider the function \(y=\sin(x)\).

  1. Plot this function between \(x=0\) and \(=\pi\).

  2. Compute the integral between \(x=0\) and \(x=\pi\) using Scipy.

  3. Verify that the answer is correct by computing the analytical gradient.

Opdracht 10.2

Gegeven de functie \(y=\sin(x)\).

  1. Plot de functie tussen \(x=0\) en \(x=\pi\) met behulp van Matplotlib.

  2. Integreer de functie tussen \(x=0\) en \(x=\pi\) met Scipy.

  3. Ga na dat het antwoord dat je krijgt klopt door zelf de integraal te berekenen.

Pandas#

The second library that we’ll look into is the Pandas package. Pandas is a very popular library that lets you work with large tables or data files, which it represents as DataFrames. A DataFrame is used to store tabular data: data that has rows (from top to bottom) and columns (from left to right). This data can be diverse. Note that Pandas does not support data structures with more than two dimensions. An important difference between NumPy arrays and Pandas dataframes is that arrays have one data type (dtype) per array, while dataframes have one dtype per column. This makes them more versatile. Moreover, we can give labels to columns, which makes it easier to understand what exactly is in the DataFrame, and avoids mixing up of columns.

The image below shows an example DataFrame. The DataFrame has five rows and five columns. Each row has an index label, which can be used to identify the row. Each column has a column label. In each column, there is one data type: the Name column contains strings, the Age column contains integers, the Marks column contains floating point numbers, the Grade and Hobby columns contain strings.

df

Series

Pandas also contains the Series class, which is similar to a DataFrame but only supports 1D arrays. Because DataFrames are much more commonly used, we will here focus on DataFrames.

Just like other packages, Pandas needs to be imported into Python. We import Pandas as follows. Here, pd is an alias for the pandas package name that is widely used (just like np is short for numpy).

import pandas as pd

To create a new DataFrame, we call its constructor (as you have learned in Chapter 7). The DataFrame constructor has three commonly used arguments:

pd.DataFrame(data=None, index=None, columns=None)
  1. data: Here, you can use most of the data structures that you have seen so far: lists, NumPy arrays, dictionaries. For example, if we here provide a 2D NumPy array, columns in the DataFrame will correspond to columns in the array, and rows will correspond to rows (see example below).

  2. index: This defines how rows should be indexed. By default, this is just \(0, 1, 2, \ldots, n_{rows}-1\).

  3. columns: This defined how columns should be labeled. You can here provide, e.g., a list of strings. By default, this is just \(0, 1, 2, \ldots, n_{columns}-1\).

As you can understand, there are many ways to construct a new DataFrame object. For example, we can initialize an object based on a NumPy array:

import numpy as np

data = np.random.randint(0, 8, (5, 4)) # Create a random NumPy array
df = pd.DataFrame(data, columns=['A', 'B', 'C', 'D']) # Create a DataFrame based on the array, where column labels are A, B, C, D
print(df)
   A  B  C  D
0  4  5  5  7
1  0  3  7  0
2  4  1  0  5
3  3  2  1  7
4  4  1  6  0

We can also provide row indices using the index parameter:

df = pd.DataFrame(data, index=['Row 1', 'Row 2', 'Row 3', 'Row 4', 'Row 5'], columns=['A', 'B', 'C', 'D'])
print(df)
       A  B  C  D
Row 1  4  5  5  7
Row 2  0  3  7  0
Row 3  4  1  0  5
Row 4  3  2  1  7
Row 5  4  1  6  0

We can also initialize a DataFrame using a dictionary. The dictionary keys will then be used as column labels, and the dictionary values will be used to fill the columns. For example, the code below creates a DataFrame that contains names and ages. To initialize the dataframe, we provide a dictionary where the keys correspond to the column name and the values to the values in the column.

import numpy as np

names = ['Mila', 'Mette', 'Tycho', 'Mike', 'Valentijn', 
         'Fay', 'Phileine', 'Sem', 'Selena', 'Dion'] # Make a list of names
ages = np.random.randint(18, 80, len(names))         # Make a list of ages

df = pd.DataFrame(data={'Names': names, 'Ages': ages}) # Initialize a DataFrame with columns 'Names' and 'Ages'

Inspecting a Pandas DataFrame#

Pandas DataFrames can be very large. There are several ways to quickly check what’s in them. The head method will list only the first few rows in a dataframe, and the tail method only the last few.

df.head()
Names Ages
0 Mila 26
1 Mette 40
2 Tycho 70
3 Mike 47
4 Valentijn 60
df.tail()
Names Ages
5 Fay 68
6 Phileine 74
7 Sem 27
8 Selena 24
9 Dion 26

Optionally, you can add an index to head or tail to show only the rows up to some index, or starting at some index.

print(df.head(3))
   Names  Ages
0   Mila    26
1  Mette    40
2  Tycho    70

The info method provides a summary of the columns and data types (dtypes) in the DataFrame.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Names   10 non-null     object
 1   Ages    10 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 288.0+ bytes

You can also use the describe method to get a description of all numerical columns including some statistics.

df.describe()
Ages
count 10.000000
mean 46.200000
std 20.334973
min 24.000000
25% 26.250000
50% 43.500000
75% 66.000000
max 74.000000

Indexing and slicing a DataFrame#

Just as in a list or a NumPy array, we can use indexing and slicing to select specific rows.

df[4:7]
Names Ages
4 Valentijn 60
5 Fay 68
6 Phileine 74

Similarly as in a NumPy array with advanced indexing (see Chapter 6), we can select rows that match a particular pattern. For example, to get the rows of all people older than 40 years, we can index the DataFrame with an array of booleans.

df[df['Ages']>40]
Names Ages
2 Tycho 70
3 Mike 47
4 Valentijn 60
5 Fay 68
6 Phileine 74

Although indexing and slicing in this way works, it is recommend that in Pandas you use the loc and iloc methods for indexing. Using loc, to select the same rows as above, we can do.

df.loc[df['Ages'] > 40]
Names Ages
2 Tycho 70
3 Mike 47
4 Valentijn 60
5 Fay 68
6 Phileine 74

We can also select a specific column using loc. For example, to get all the names of people younger than 40.

df.loc[df['Ages'] < 40, 'Names']
0      Mila
7       Sem
8    Selena
9      Dion
Name: Names, dtype: object

The iloc method works similarly, but just takes indices or ranges as inputs. For example, to get the age of the person in the fourth row, we use

df.iloc[3, 1]
47

Here, we can use all indexing and slicing tricks we have seen so far. For example, to select every the age of every third person:

df.iloc[2::3, 1]
2    70
5    68
8    24
Name: Ages, dtype: int64

Editing a Pandas DataFrame#

You can remove rows or columns by calling the drop method. Columns are removed by using their name, as in

df.drop(columns=['Ages'])
Names
0 Mila
1 Mette
2 Tycho
3 Mike
4 Valentijn
5 Fay
6 Phileine
7 Sem
8 Selena
9 Dion

and rows can be dropped by using their index

df.drop([0, 1, 2])
Names Ages
3 Mike 47
4 Valentijn 60
5 Fay 68
6 Phileine 74
7 Sem 27
8 Selena 24
9 Dion 26

Here, you can also use a range to remove, e.g., every even row

df.drop(range(0, len(df), 2))
Names Ages
1 Mette 40
3 Mike 47
5 Fay 68
7 Sem 27
9 Dion 26

The drop function returns a new object with the rows or columns of your choosing dropped. This might be inefficient for very large DataFrames. In that case, you can use inplace=True, which drops columns and rows in-place, or in the DataFrame that the method is working on. Take a look at the following example code

df = pd.DataFrame({'Names': names, 'Ages': ages})

print('Original DataFrame')
print(df.head())
print('\nOriginal DataFrame after drop with inplace=False')
df.drop(columns=['Names'])
print(df.head())
print('\nOriginal DataFrame after drop with inplace=True')
df.drop(columns=['Names'], inplace=True)
print(df.head())
Original DataFrame
       Names  Ages
0       Mila    26
1      Mette    40
2      Tycho    70
3       Mike    47
4  Valentijn    60

Original DataFrame after drop with inplace=False
       Names  Ages
0       Mila    26
1      Mette    40
2      Tycho    70
3       Mike    47
4  Valentijn    60

Original DataFrame after drop with inplace=True
   Ages
0    26
1    40
2    70
3    47
4    60

In the last case, we’ve permanently dropped the Names column from the DataFrame.

We can add rows to a dataframe using the append method. Make sure to set ignore_index=True so the new row will be assigned an index that matches the already existing indices in the dataframe.

df = pd.DataFrame({'Names': names, 'Ages': ages})
df.append({'Names': 'Ivo', 'Ages': 55}, ignore_index=True)
Names Ages
0 Mila 26
1 Mette 40
2 Tycho 70
3 Mike 47
4 Valentijn 60
5 Fay 68
6 Phileine 74
7 Sem 27
8 Selena 24
9 Dion 26
10 Ivo 55

To add multiple rows at the same time, it is recommended that you use the concat function of Pandas.

names_new = ['Sebastiaan', 'Noëlle', 'Pien', 'Luke', 'Samuel', 'Jet', 'Louise', 'Noëlle', 'David', 'Abel']
ages_new = np.random.randint(18, 80, len(names_new)) # Make a list of ages

df_new = pd.DataFrame({'Names': names_new, 'Ages': ages_new})
df = pd.concat([df, df_new], ignore_index=True)
print(df)
         Names  Ages
0         Mila    26
1        Mette    40
2        Tycho    70
3         Mike    47
4    Valentijn    60
5          Fay    68
6     Phileine    74
7          Sem    27
8       Selena    24
9         Dion    26
10  Sebastiaan    24
11      Noëlle    20
12        Pien    74
13        Luke    26
14      Samuel    38
15         Jet    26
16      Louise    27
17      Noëlle    42
18       David    49
19        Abel    73

Another nice thing in Pandas is that we can easily define new columns. For example, in the DataFrame above, we can add a column that simply contains whether someone is over 60 or not, a column that checks whether someone is the oldest person in the group, a column that checks whether someone is celebrating a lustrum etc.

df['Senior'] = df['Ages'] > 60
df['Oldest'] = df['Ages'] == df['Ages'].max()
df['Lustrum'] = df['Ages'] % 5 == 0
print(df)
         Names  Ages  Senior  Oldest  Lustrum
0         Mila    26   False   False    False
1        Mette    40   False   False     True
2        Tycho    70    True   False     True
3         Mike    47   False   False    False
4    Valentijn    60   False   False     True
5          Fay    68    True   False    False
6     Phileine    74    True    True    False
7          Sem    27   False   False    False
8       Selena    24   False   False    False
9         Dion    26   False   False    False
10  Sebastiaan    24   False   False    False
11      Noëlle    20   False   False     True
12        Pien    74    True    True    False
13        Luke    26   False   False    False
14      Samuel    38   False   False    False
15         Jet    26   False   False    False
16      Louise    27   False   False    False
17      Noëlle    42   False   False    False
18       David    49   False   False    False
19        Abel    73    True   False    False

This makes it very easy and interpretable to work with datasets.

Sorting DataFrames#

We can sort a DataFrame according to one of its colums. For example, we can sort all rows alphabetically by name

print(df.sort_values('Names'))
         Names  Ages  Senior  Oldest  Lustrum
19        Abel    73    True   False    False
18       David    49   False   False    False
9         Dion    26   False   False    False
5          Fay    68    True   False    False
15         Jet    26   False   False    False
16      Louise    27   False   False    False
13        Luke    26   False   False    False
1        Mette    40   False   False     True
3         Mike    47   False   False    False
0         Mila    26   False   False    False
17      Noëlle    42   False   False    False
11      Noëlle    20   False   False     True
6     Phileine    74    True    True    False
12        Pien    74    True    True    False
14      Samuel    38   False   False    False
10  Sebastiaan    24   False   False    False
8       Selena    24   False   False    False
7          Sem    27   False   False    False
2        Tycho    70    True   False     True
4    Valentijn    60   False   False     True

or sort by age, in descending or ascending order. For example, to get the four youngest people in the list, we can use

print(df.sort_values('Ages', ascending=True)[:4])
         Names  Ages  Senior  Oldest  Lustrum
11      Noëlle    20   False   False     True
10  Sebastiaan    24   False   False    False
8       Selena    24   False   False    False
0         Mila    26   False   False    False

Exercise 6.7 revisited

Consider the grades.npy file in Exercise 6.7. Using Pandas, with a few lines of Python we can address all questions in that exercise. First, we create a DataFrame

grades = np.load('grades.npy')
df = pd.DataFrame(data=grades, columns=['Exam', 'Resit'])

Then, we answer the questions as follows:

  1. How many students are there in the class?

len(df)
  1. How many students passed the exam the first time (grade \(\geq 5.5\))?

df['Passed'] = df['Exam'] >= 5.5  # Make a new column that says which students passed
df['Passed'].sum()                # Count all True values
  1. What was the average grade of students that passed the exam the first time?

df.loc[df['Passed'], 'Exam'].mean() # Compute the average grade of those students that passed
  1. Not all students that failed the exam, also took the resit. How many students didn’t?

df['No show'] = df['Resit'].isna() & ~df['Passed'] # Make a new column that contains the students that didn't take the resit (isna), and didnt't pass (the ~ is short for negative)
df['No show'].sum()

In the end, the DataFrame df then looks like

Exam

Resit

Passed

Failed

No show

0

3.0

6.2

False

True

False

1

6.5

NaN

True

False

False

2

6.1

5.9

True

False

False

3

3.9

5.4

False

True

False

4

6.5

NaN

True

False

False

5

1.3

6.4

False

True

False

6

7.4

NaN

True

False

False

7

8.5

NaN

True

False

False

8

1.8

NaN

False

True

True

9

5.3

7.2

False

True

False

10

3.4

6.8

False

True

False

11

5.5

6.0

True

False

False

12

1.4

NaN

False

True

True

13

5.4

7.2

False

True

False

14

4.6

5.0

False

True

False

15

6.3

NaN

True

False

False

16

6.5

NaN

True

False

False

17

4.0

NaN

False

True

True

18

5.3

6.2

False

True

False

19

5.4

5.3

False

True

False

20

3.7

6.2

False

True

False

21

1.2

3.4

False

True

False

22

9.4

NaN

True

False

False

Plotting Pandas data#

Pandas allows you to directly plot data from a DataFrame in a way that seemlessly integrates with Matplotlib. For example, to make a bar plot of all ages in our example DataFrame, we can use

import matplotlib.pyplot as plt   # Import matplotlib

df = pd.DataFrame({'Names': names, 'Ages': ages}) # Create dataframe

fig, ax = plt.subplots() # Create figure and axes 
df.plot(x='Names', y='Ages', ax=ax, kind='bar') # Plot directly from dataframe. Note that we provide the ax object as axes.
ax.set_title('My first DataFrame plot'); # Add title
../../_images/048177223f875f48008bb3b267b95b076983a3b530ad33e153dbc84dc7dd3d89.png

The kind argument lets you choose the kind of plot you want to make for a DataFrame. Take a look at the plot documentation to see which kinds of plots you can make.

Reading data to Pandas#

Pandas is able to read several useful file formats, the most common ones being CSV (comma-separated values) and Excel files. To read a CSV file, use read_csv.

df = pd.read_csv(<filename>)

Moreover, you can directly read Excel files using pd.read_excel. It could be that you have to install one additional package for that like openpyxl.

Exercise 10.3

  1. Download a CSV file here.

  2. Load this CSV file into a Pandas DataFrame.

  3. Inspect the contents of the DataFrame. The first column contains (if all is well) all last names in The Netherlands.

  4. How many times does your own last name occur?

  5. Find out what are the ten most common names in The Netherlands. Make a bar plot for the occurence of these names.

  6. See if you can find who among your fellow students has the most common last name.

Opdracht 11.3

  1. Download hier een CSV bestand.

  2. Laad dit CSV bestand in een Pandas DataFrame.

  3. Bekijk de inhoud van het DataFrame met de inspect en info methods. Bekijk de eerste paar rijen van het DataFrame.

  4. Het bestand bevat (als het goed is) alle achternamen in Nederland. Hoe vaak komt jouw achternaam voor?

  5. Wat zijn de tien meest voorkomende achternamen in Nederland? Maak een bar plot (staafdiagram) waarin je laat zien hoe vaak deze namen voorkomen.

  6. Wie van je studiegenoten heeft de meest voorkomende achternaam?

Exercise 10.4

Download the CSV file here. This file shows the cost of a Big Mac over time in 57 countries over a period of twenty years, from 2000 to 2020.

bigmac

  1. Load this file into a Pandas DataFrame.

  2. Inspect the DataFrame.

  3. Use Matplotlib to plot the cost of a Big Mac in local currency in Denmark over time. Use the date column as your \(x\)-value. Hint Use plt.xticks(rotation=70) to rotate the time labels.

  4. The dollar_price column shows the cost of a Big Mac in dollars, also referred to as the Big Mac Index. Find out which country has the most expensive Big Mac in 2020, and which country has the cheapest Big Mac. Hint Use np.unique(df['name']) to get all unique country names in the DataFrame.

Opdracht 11.4

Download hier een CSV bestand. Dit bestand bevat de prijs van een Big Mac in 57 landen, in de periode van 2000 tot 2020.

bigmac

  1. Laad het bestand in een Pandas DataFrame.

  2. Inspecteer het DataFrame om te kijken wat de kolommen betekenen.

  3. Gebruik Matplotlib om de prijs van een Big Mac in Denemarken Deense kronen tussen 2000 en 2020 te plotten. Gebruik hiervoor de date column als \(x\)-waarde. Hint Gebruik plt.xticks(rotation=70) om alle labels op de \(x\)-as een tikje te roteren.

  4. De dollar_price kolom bevat de waarde van een Big Mac in dollars. Dit wordt ook wel de Big Mac-index genoemd. Bepaal welk land de duurste Big Mac had in 2020, en welk land de goedkoopste. Hint Gebruik np.unique(df['name']) om een lijst unieke namen van landen te krijgen, en gebruik een for-loop om je antwoord te krijgen.

Seaborn#

Seaborn provides a high-level plotting library that works seamlessly with Matplotlib. Take a look at the Seaborn examples library to see some of the things that you can do with Seaborn. By convention, we use Seaborn with an sns alias, i.e.,

import seaborn as sns

Seaborn is the result of a perfect marriage between Matplotlib and Pandas. It naturally makes very informative figures of Pandas dataframes, and it is often sufficient to just mention which column you want to use in which role when making a figure. For example, consider the Iris dataset that you used in Chapter 8. This dataset can be loaded using the scikit-learn (or sklearn) package as below. Then, we make a DataFrame that contains that data, with the features in columns, and a column for the label of each flower.

from sklearn import datasets

iris_data = datasets.load_iris() # Directly loads the Iris dataset into a dictionary iris_data

df = pd.DataFrame(data=iris_data['data'], columns=iris_data['feature_names']) # Make a dataframe using dict
df['species'] = iris_data['target_names'][iris_data['target']]                # Add a column for the target names
df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Now using Seaborn, we can - in one line - make a pairplot for this data. This pairplot contains multiple scatter plots (one for each pair of features), and histograms (one for each feature). Individual flowers/samples are color-coded according to their label. You might recognize the bottom left figure as the one you created in Exercise 8.5.

import seaborn as sns

sns.pairplot(df, hue='species');
../../_images/42632b6c0ff2fb35ab6ed5cef33b920482bd0c31ed9c50ca13c9fa37c155924a.png
df.describe()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

By default, Seaborn plots look quite pleasant. You can also style your plots by choosing a color palette.

Exercise 10.5

  1. Get the healthexp dataset from Seaborn as follows

df = sns.load_dataset('healthexp')
  1. Inspect the data set using Pandas methods head and describe. What do you see in this data set?

  2. Use the Seaborn function lineplot to replicate the image below. fig

  3. Use Matplotlib, Pandas, or Seaborn to make a bar plot of spending per life years in 2020 as below. One solution is to first make a DataFrame with only rows relating to 2020. Note: Depending on the method that you use, the resulting figure will not match exactly.

fig

Opdracht 10.5

  1. Maak een DataFrame op basis van de healthexp dataset in Seaborn, als volgt

df = sns.load_dataset('healthexp')
  1. Bekijk de data set met Pandas. Wat zie in je de dataset?

  2. Gebruik de Seaborn functie lineplot om zo goed mogelijk onderstaande figuur na te maken.

fig 4. Gebruik Matplotlib, Pandas, of Seaborn om een bar plot (staafdiagram) te maken van de uitgaven per levensjaar in 2020, zoals hieronder. Een mogelijke oplossing is om eerst alleen de data (rows) van 2020 te selecteren met loc. Afhankelijk van de methode die je kiest zal je figuur er waarschijnlijk net wat anders uitzien, dat is niet erg.

fig