ENG270 Data models and programming paradigms

Satoshi Takahama

EPFL

2024-09-20

Data models

What and why

What

Common data models

  • key-value pairs
  • array
  • relational
  • graph / hierarchical

Why

“Knowledge representation”

Structuring data in a particular way facilitates data access and specific operations appropriate for that structure.

Illustration of data models (source: IBM, Wikipedia).
http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/index.jsp?topic=%2Fcom.ibm.ims11.doc.apg%2Fims_comparehierandreldbs.htm   http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/reldb/ http://publib.boulder.ibm.com/infocenter/ablxhelp/v8r4m0/index.jsp?topic=%2Fcom.ibm.db2.abx.cub.doc%2Fabx-c-cube-cubingconceptsoverview.html   https://en.wikipedia.org/wiki/Key%E2%80%93value_database

Relational

Representation on hard disk

Representation in memory

  • pandas DataFrame

Power of DataFrames - one example

Data tables (implemented as pandas DataFrames) are useful for a variety of data processing operations

  • filtering rows
  • variable transformations
  • aggregating information
  • merging data sets (join)

Joins

source: datacomy.com

Examples from here

Array

Representation on hard disk

Representation in memory

  • NumPy matrix, \(N\)-D array

Example - matrix operations

Edge detection

source: MathWorks

Image segmentation

source: MathWorks

Spatial analysis


Matrix factorization

Graph / hierarchical

Hierarchical (tree-like) data can be thought of as a subset of graph data models consisting of edges and nodes. (Graph models can additionally represent more complex networks.)

Representation on hard disk

(graph formats have GraphML, GraphSON equivalents of the above)

Representation in memory

  • Python list
  • NetworkX Graph

Example - trees and networks

Key-value pairs

Representation on hard disk

Representation in memory

  • Python dictionary

Programming Paradigms

What and why

What

Programming paradigms

  • Logic
  • Procedural
  • Functional
  • Object-oriented

Various aspects of these paradigms are not necessarily orthogonal to each other and can be used together in the same language.

Why

Alternative ways to express computation.

Modularity and code reuse.

Functional programming (FP)

What and why

What

Computation primarily by evaluation of functions.

Features:

  • "variables" are immutable
  • functions are pure: accept arguments, return value. No "side effects" or change of global state (i.e., value of variables / properties) besides what is returned
  • functions are first-class objects - can be treated as data (functions can be 1) passed to other functions and 2) returned from functions)

Why

Advantages

  • high level
  • transparent
  • parallizable
  • avoids common programming errors:
    • inadvertent variable modification
    • exceeding list/array bounds

Some languages enforce these features strictly (Haskell).

Some languages (Python and MATLAB) borrow constructs from FP to enable a “functional style” and receieve some of its benefits (e.g., reducing common programming errors), but does not strictly enforce immutability.

Mutability

Problems with mutation of global variables

  • unintended reuse of variables (problematic in large code bases and collaborative projects)
  • hard to reason about code - need to keep track of all prior mutations to understand what the variable represents at a particular point in the code
  • difficult for interpreter or compiler to optimize if a variable is modified at any point in the program

How to reduce instances of mutation:

  • avoid variable reassignment
  • instead of loops:
    • list comprehensions
    • functions

Anatomy of a function

Definitions

def fn(x):
    y = 1
    return x + y + 2*z
  • x: bound variable
  • y: local variable
  • z: free variable

Python uses lexical scoping - value of z is taken from the namespace in which it was defined. (A namespace is can be thought of the collection of symbols (objects/variables of the program) and their associated values.)

Namespaces

If your code refers to the name x, then Python searches for x in the following namespaces in the order shown (ref):

  • Local: innermost scope that’s local to the function ("local namespace")
  • Enclosing: namespace in which function is defined
  • Global: namespace of main program
  • Built-in: namespace of all of Python's built-in objects


Source: realpython.com

Namespace example

def fn(x):
    y = 1
    return x + y + 2*z

fn is defined in the global environment.

z = 3
print(fn(0))
7

fn is still defined in the global environment, as is z=3.

def wrapper():
    z = 4
    return(fn(0))

print(wrapper())
7

In this last example, fn is an example of a closure - a function in which data is attached (i.e., the value of z).

fn is still defined in the global environment, as is z=3.

def wrapper(z):
    def fn(x):
        y = 1
        return x + y + 2*z
    return fn

fn = wrapper(4)
print(fn(0))
9

Mutations

Motivating example

Intuitive behavior:

def fn1(a):
    a = [2]
    return a

a_inp = [1]
a_out = fn1(a_inp)

print(a_inp, a_out)
[1] [2]

Surprising behavior:

def fn2(a):
    a[0] = 2
    return a

a_inp = [1]
a_out = fn2(a_inp)

print(a_inp, a_out)
[2] [2]

In Python: scalar arguments are passed by value; compound data type (dictionaries and lists) are effectively passed by reference.

To prevent modification of the global variable, use copy or deepcopy.

import copy

def fn3(a):
    a = copy.deepcopy(a)
    a[0] = 2
    return a

a_inp = [1]
a_out = fn3(a_inp)

print(a_inp, a_out)
[1] [2]

Constructs to reduce global mutations

  • functions and anonymous functions
  • expressions rather than statements
  • in lieu of loops:
    • list/dictionary comprehensions
    • map, filter, reduce
  • closures

Constructs

List comprehensions mirror set builder notation in mathematics. Example from here.

Mathematical expression - let \(\mathcal{I}\) be the set of all integers.

\[\begin{equation*} \{x^2 : x \in \mathcal{I} \vee 0 \leq x < 10 \vee \textrm{prime}(x)\} \end{equation*}\]

Python notation.

from sympy import isprime
[x**2 for x in range(10) if isprime(x)]
[4, 9, 25, 49]

Note that x does not exist outside of the scope of the list comprehension in contrast to performing this calculation wtih a loop.

y = []
for x in range(10):
    if isprime(x):
        y.append(x**2)

print(x)
9

List comprehensions are “safer” in that the variable containing individual elements are not introduced or modified outside of the expression.

Example using map, filter, and partial function application:

from functools import partial
list(map(partial(pow, exp=2), filter(isprime, range(10))))
[4, 9, 25, 49]

or using anonymous functions

list(map(lambda x: x**2, filter(isprime, range(10))))
[4, 9, 25, 49]

Note that map returns an iterator (effectively, an unevaluated sequence) so we wrap the output in list() to obtain the results.

Use ternary operator expression for conditional statements. Instead of

if condition:
    result = value_if_True
else:
    result = value_if_False

we can rewrite this as an expression:

result = value_if_True if condition else value_if_False

Function factories

Function that returns a function

Example from here.

Compute the root of the function

\[ f(x) = x^3 - 100 x^2 - x + 100 \]

from scipy.optimize import fsolve

def polynomial_3rd_order(b):
    return lambda x: b[0] + b[1] * x + b[2] * x**2 + b[3] * x**3

coefficients = (100, -1, -100, 1)
x_init = [2, 80]

fsolve(polynomial_3rd_order(coefficients), [2, 80])
array([  1., 100.])

Function that accepts function as argument

from operator import add
list(map(add, [1, 2, 3], [2, 3, 4]))
[3, 5, 7]

Example

Define functions

from math import sin, cos, pi, sqrt
import wave
import array
import matplotlib.pyplot as plt
from functools import reduce, partial

def readWavFile(filename):
    wav = wave.open(filename)
    framerate = wav.getframerate()
    bytes = wav.readframes(wav.getnframes())
    signal = array.array('h', bytes).tolist()
    return signal, framerate

def fourier(signal, r, f):
    seq = range(len(signal))
    x = reduce(lambda acc, k: acc + signal[k] * cos(k * 2 * pi * f / r), seq, 0)
    y = reduce(lambda acc, k: acc + signal[k] * sin(k * 2 * pi * f / r), seq, 0)
    e = reduce(lambda acc, k: acc + signal[k]**2, seq, 0)
    return sqrt(2 * (x * x + y * y) / (e * len(signal)))

"""
General moving window function

Parameters
-----
fn: function
seq: sequence
n: window
d: increment

Return value
-----
processed signal as list
"""
def movingwindow(fn, seq, n, d=None):
    if not d:
        d = n
    return [fn(seq[i:i+n]) for i in range(0, len(seq), d)]

def findLocalMaxima(time, power, threshold):
    def fn(acc, i):
        if acc['state'] == 0:
            if power[i] > threshold:
                acc['state'] = 1
                acc['localMaximum'] = i
        elif acc['state'] == 1:
            if power[i] > power[acc['localMaximum']]:
                acc['localMaximum'] = i
            elif power[i] < threshold:
                acc['state'] = 0
                acc['maxima'] += [acc['localMaximum']]
        return acc
    out = reduce(fn, range(len(power)), {'state': 0, 'localMaximum': 0, 'maxima': []})
    return out['maxima']

Import data and apply functions

nframes = 1000
frequency = 2260  # Hz
window_size = 100
window_incr = 20
threshold = 0.16

signal, framerate = readWavFile('recording.wav')
freqTime = movingwindow(lambda x: x[0]/framerate, range(len(signal)), nframes)
freqPower = movingwindow(partial(fourier, r=framerate, f=frequency), signal, nframes)
birdTime = movingwindow(lambda x: x[0], freqTime, window_size, window_incr)
birdPower = movingwindow(lambda x: sum(x)/len(x), freqPower, window_size, window_incr)
maxima = findLocalMaxima(birdTime, birdPower, threshold)

Plot

plt.plot(freqTime, freqPower, label = f'{frequency:d} Hz (puissance)')
plt.plot(birdTime, birdPower, label = 'Signal "oiseau"')

for i in maxima:
    time = birdTime[i]
    minute = int(time / 60)
    seconde = int(time % 60)
    timeText = f'{minute:02d}:{seconde:02d}'
    plt.annotate(timeText, xy = (birdTime[i], birdPower[i]),
                 horizontalalignment = 'center',
                 xytext = (birdTime[i], birdPower[i] + 0.2),
                 arrowprops = dict(facecolor = 'black', width = 2))
    print('Oiseau à ' + timeText)

plt.legend()
plt.show()

Object-oriented programming (OOP)

What and why

What

A unit of data with associated functions.

  • simple analogy: Python dictionary that may contain multiple entries for various values and functions that primarily operate on these values
  • objects have additional features like inheritance, and functions (object methods) have access to object variables (attributes).
  • objects are specific instances of a defined /class/ (which prescribes the attributes and methods that all instantiated objects should have)
  • inheritance allows definition of new object classes that have attributes or methods of its parent object class

Why

  • Method dispatch
  • Organize functions around entities on which they act

Alternatives to method dispatch:

  • multiple functions with slightly different names
  • conditional branching within function based on object class

Example

Define classes

from math import sin, cos, pi, sqrt
import wave
import array
import matplotlib.pyplot as plt

class RawSignal:

    def __init__(self, wavefile):
        self.readWavFile(wavefile)

    def readWavFile(self, filename):
        wav = wave.open(filename)
        bytes = wav.readframes(wav.getnframes())
        self.framerate = wav.getframerate()        
        self.signal = array.array('h', bytes).tolist()

    def __fourier(self, signal, r, f): # private function
        x = 0.0
        y = 0.0
        e = 0.0
        for i, s in enumerate(signal):
            x += s * cos(i * 2 * pi * f / r)
            y += s * sin(i * 2 * pi * f / r)
            e += s * s
        return sqrt(2 * (x * x + y * y) / (e * len(signal)))

    def extractfreq(self, f, nframes):
        signal = self.signal
        r = self.framerate        
        time = []
        power = []
        for k in range(0, len(signal), nframes):
            time.append(k / r)
            power.append(self.__fourier(signal[k:k + nframes], r, f))
        return FreqPower(time, power)

class FreqPower:

    def __init__(self, time, power):
        self.time = time
        self.power = power

    def movingaverage(self, n, d):
        time = self.time
        power = self.power
        aveTime = []
        avePower = []
        for k in range(0, len(power) - n, d):
            aveTime.append(time[k])
            avePower.append(sum(power[k:k + n]) / n)
        self.time = aveTime
        self.power = avePower

    def localMaxima(self, threshold):
        time = self.time
        power = self.power
        state = 0
        localMaximum = 0
        maxima = []
        for i, value in enumerate(power):
            if state == 0:
                if power[i] > threshold:
                    state = 1
                    localMaximum = i
            elif state == 1:
                if power[i] > power[localMaximum]:
                    localMaximum = i
                elif power[i] < threshold:
                    state = 0
                    maxima.append(localMaximum)
        return maxima

Instantiate objects and apply methods

nframes = 1000
frequency = 2260  # Hz
window_size = 100
window_incr = 20
threshold = 0.16

raw = RawSignal('recording.wav')
mybird = raw.extractfreq(frequency, nframes)
mybird.movingaverage(window_size, window_incr)
maxima = mybird.localMaxima(threshold)

To use the same code for plotting, I will create a reference to them (assignment = does not copy lists in Python)

freqTime = mybird.time 
freqPower = mybird.power
birdTime = mybird.time
birdPower = mybird.power
plt.plot(freqTime, freqPower, label = f'{frequency:d} Hz (puissance)')
plt.plot(birdTime, birdPower, label = 'Signal "oiseau"')

for i in maxima:
    time = birdTime[i]
    minute = int(time / 60)
    seconde = int(time % 60)
    timeText = f'{minute:02d}:{seconde:02d}'
    plt.annotate(timeText,
                 xy = (birdTime[i], birdPower[i]),
                 horizontalalignment = 'center',
                 xytext = (birdTime[i], birdPower[i] + 0.2),
                 arrowprops = dict(facecolor = 'black', width = 2))
    print('Oiseau à ' + timeText)

plt.legend()
plt.show()

Another example

Here define a parent class Topo, and two children classes TopoXYZ and TopoMNT associated with two different data representations (and file formats) for storing topological information introduced in sieprog.ch.

  • Data in the XYZ format contains space-delimited columns (\(x\), \(y\), \(z\)) while MNT format contains a sequence of \(z\) values.
  • TopoXYZ and TopoMNT classes have different methods for 1) reading in files and 2) preprocessing the data to provide x, y, and z arrays for plotting.
  • After the appropriate arrays are generated, they use a common method defined by the parent class Topo for displaying the topographic map in the same format.


import numpy as np
import matplotlib.pyplot as plt

class Topo:

    def plot(self, xp, yp, array, xlabel = 'X [m]', ylabel = 'Y [m]'):

        fig, ax = plt.subplots()
        ax.set_aspect('equal')
        ax.pcolormesh(xp, yp, array, shading = 'auto')
        ax.set_xlabel(xlabel)
        ax.set_ylabel(ylabel)

Cont’d

class TopoXYZ(Topo):

    def __init__(self, fname, res):

        xyz = np.genfromtxt(fname, delimiter = ' ', dtype = None)
        self.x = xyz[:, 0]
        self.y = xyz[:, 1]
        self.z = xyz[:, 2]
        self.res = res

    def plot(self, xlabel = None, ylabel = None):

        x = self.x
        y = self.y
        z = self.z
        res = self.res

        xmin = x.min()
        ymin = y.min()
        xidx = ((x - xmin) / res).astype('int32')
        yidx = ((y - ymin) / res).astype('int32')

        xp = xmin + np.arange(0, xidx.max() + 1) * res
        yp = ymin + np.arange(0, yidx.max() + 1) * res
        array = np.full((yidx.max() + 1, xidx.max() + 1), np.nan)
        array[yidx, xidx] = z

        super().plot(xp, yp, array)
        
class TopoMNT(Topo):

    def __init__(self, fname, nx, ny, res):

        self.values = np.fromfile(fname, np.float64)
        self.nx = nx
        self.ny = ny
        self.res = res

    def plot(self):
        xp = np.arange(self.nx) * self.res
        yp = np.arange(self.ny) * self.res
        array = self.values.reshape(self.ny, self.nx)

        super().plot(xp, yp, array)
mnt = TopoMNT('mnt-suisse-romande', nx = 2501, ny = 3601, res = 90)
mnt.plot()
plt.savefig('mnt.png')
xyz = TopoXYZ('DHM200.xyz', res = 200)
xyz.plot()
plt.savefig('xyz.png')

Polymorphism

Implementing method dispatch:

  • [shown previously] collect associated functions by object. (I.e., if it is defined as an object attribute, behave according to how it is defined within this function.)
  • [alternative approach] related functions can be associated with a generic function. The behavior of the function is determined by the object class of the first argument (or additional arguments in the case of “multiple dispatch”).

Class definitions only include attributes.

import numpy as np
import matplotlib.pyplot as plt
from functools import singledispatch

class TopoXYZ(Topo):

    def __init__(self, fname, res):

        xyz = np.genfromtxt(fname, delimiter = ' ', dtype = None)
        self.x = xyz[:, 0]
        self.y = xyz[:, 1]
        self.z = xyz[:, 2]
        self.res = res

class TopoMNT(Topo):

    def __init__(self, fname, nx, ny, res):

        self.values = np.fromfile(fname, np.float64)
        self.nx = nx
        self.ny = ny
        self.res = res

plot_topo is a generic function under which class-specific methods are defined (definitions are same as before)

@singledispatch
def plot_topo(xp, yp, array, xlabel = 'X [m]', ylabel = 'Y [m]'):
    fig, ax = plt.subplots()
    ax.set_aspect('equal')
    ax.pcolormesh(xp, yp, array, shading = 'auto')
    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)

@plot_topo.register(TopoXYZ)
def _(obj, *args, **kwargs):
    x = obj.x
    y = obj.y
    z = obj.z
    res = obj.res
    xmin = x.min()
    ymin = y.min()
    xidx = ((x - xmin) / res).astype('int32')
    yidx = ((y - ymin) / res).astype('int32')
    xp = xmin + np.arange(0, xidx.max() + 1) * res
    yp = ymin + np.arange(0, yidx.max() + 1) * res
    array = np.full((yidx.max() + 1, xidx.max() + 1), np.nan)
    array[yidx, xidx] = z
    plot_topo(xp, yp, array, *args, **kwargs)

@plot_topo.register(TopoMNT)
def _(obj, *args, **kwargs):
    xp = np.arange(obj.nx) * obj.res
    yp = np.arange(obj.ny) * obj.res
    array = obj.values.reshape(obj.ny, obj.nx)
    plot_topo(xp, yp, array, *args, **kwargs)

Plotting method is determined by class of first argument.

plot_topo(xyz)
plot_topo(mnt)

Duck typing

To perform method dispatch, it is not necessary that the object have a certain type, but that it has certain attributes or methods.

Example:

"a" + "b"
'ab'
3 + 2
5
3.5 + 2
5.5

The operator + or, equivalently, operator.add, will perform addition on any objects if it has an “add” attribute.

"a".__add__
<method-wrapper '__add__' of str object at 0x7f7ce2c61730>
getattr(3, "__add__")
<method-wrapper '__add__' of int object at 0x7f7ce2d4e970>
getattr(3.5, "__add__")
<method-wrapper '__add__' of float object at 0x7f7c9212c990>

Another example:

"a" * 2
'aa'
3 * 2
6
[3] * 2
[3, 3]
"a".__mul__
<method-wrapper '__mul__' of str object at 0x7f7ce2c61730>
getattr(3, "__mul__")
<method-wrapper '__mul__' of int object at 0x7f7ce2d4e970>
[3].__add__
<method-wrapper '__add__' of list object at 0x7f7c920dc9c0>