In [2]:
import numpy as np
import scipy as scp
import matplotlib.pyplot as plt

# Practicing with Chi Square distribution on discrete data
In this exercise, we will be practicing using the case of checking if a dice is fair.
To do this, we compare the frequencies of obtaining each number from a sample of throws to the theoretical frequencies of a fair dice.

First, let us set the bins, ie : the possible values that can come out of our experiment. To underline that these values have no mathematical significance, instead of 1, 2, 3 etc. , use strings "one", "two", "three" etc. .
Create an array of possible values, an array containing their probabilities, and an array containing the expected occurence freauencies for a sample size (number of throws) of 60.

For our chi square test to be reliable, it is recommended to have a minimum of 5 results per bin / category.
Add a test to check if this is the case for **Ns=60**.

Now, load the values from _dice1.csv_. Using a for loop, go through the possible values and count the number of occurences of this result in the sample. Use this to calculate the chi square statistic.  

To count the number of occurences, you may want to use the *np.count_nonzero()* function while passing as argument the condition you want to see verified. For exemple : *np.count_nonzero(my_array=="two")* will return the number of times that the string _"two"_ is in *my_array*.

In truth, *my_array=="two"* returns an array of True and False depending on if the elements satisfy the condition, and *np.count_nonzero()* counts the number of nonzero, ie True.

Can you say with a confidence level of 99% that the dice is fair ?  
The chi square value for $\alpha=0.01$ and $6-1$ degrees of freedom is 15.086

Perform the same test for dice2.csv.

# Chi Squared on continuous distribution
For continuous data, binning can be more challenging. One must choose a number of bins, their size and the size of the sample.
In this exercise, we will consider a sample of fiber-reinforced polymers that you have produced. After putting them under strain, you wish to see the propagation of cracks through the sample to study their response and potential application in nautical transport.   

To do so, you write down the length of cracks for each polymer. If they follow a gaussian distribution, the material responds evenly to the stress and the process can be sold. However if the process follows a laplace distribution, this might indicate the presence of imperfections, and the process is too dangerous.

The laplace and normal cumulative distribution function are given below. We are considering a mean crack length of 0cm, with a standard deviation of 1cm.

In [None]:
def laplace(x, mean=0, lam=1):
    return np.exp(-np.abs(x-mean)/lam)/(2*lam)

def laplace_cdf(x, mean=0, lam=1):
    return 1/2 + 1/2 * np.sign(x-mean)*(1-np.exp(-np.abs(x-mean))/lam)

def normal(x, sigma=1, mean=0):
    return np.exp(-(x-mean)**2/(2*sigma**2))/np.sqrt(2*np.pi*sigma**2)

def normal_cdf(x, sigma=1, mean=0):
    return 1/2 * ( 1 + scp.special.erf((x-mean)/(sigma*np.sqrt(2))) )

# plt.plot( normal_cdf(np.linspace(-5,5, 1000)) )
# plt.plot( laplace_cdf(np.linspace(-5,5, 1000)) )

To prepare the chi square study on the sample, we must first create a set of bins to categorise the different lengths.
Create a set of 10 bins centered around the mean, with bin a width of 0.5cm. Make sure both the first and last bins take into account the entire tail of the distributions (meaning they go to infinity).

Calculate the expected frequencies for both distribution in each bin, making use of the cumulative distribution functions.
Start with a number of samples of *Ns=2000*.

Like previously, check that the expected frequencies are all superior then 5.

Load the data from _cracks.csv_.  
Use the _np.histogram()_ function seen previously, with the created bins, to categorise the sample data and get the frequencies.  

Calculate the chi square statistic. Are your polymer samples safe for use with a confidence level of 99% ? 