Scientific and Engineering Libraries
====================================

*(C) Copyright Notice: This chapter is part of the book available
at*\ https://pp4e-book.github.io/\ *and copying, distributing, modifying
it requires explicit permission from the authors. See the book page for
details:*\ https://pp4e-book.github.io/

In this chapter, we will cover several libraries that are very
functional in scientific & engineering-related computing problems. In
order to keep our focus on the practical usages of these libraries and
considering that this is an introductory textbook, the coverage in this
chapter is in not intended to be comprehensive.

Numerical Computing with NumPy
------------------------------

NumPy can be considered as a library for working with vectors and
matrices. NumPy calls vectors and matrices as *arrays*.

.. raw:: html

   <table bgcolor="pink" width="75%" align="center">

.. raw:: html

   <tr>

.. raw:: html

   </tr>

.. raw:: html

   <tr>

.. raw:: html

   <td>

Installation Notes

To be able to use the NumPy library, you will need to download it from
`numpy.org <http://www.numpy.org>`__ and install it on your computer. If
you are using a Python package manager (e.g. pip), you can install it
directly using: ``$ pip install numpy``. If you are using a Windows/Mac
machine, you should install
`anaconda <https://docs.anaconda.com/anaconda/install/>`__ first. If you
are using Colab or another Jupyter Notebook viewer, the platform may
already have numpy installed.

.. raw:: html

   </td>

.. raw:: html

   </tr>

.. raw:: html

   </table>

Arrays and Their Basic Properties
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let us consider a simple vector and a matrix:

.. math::


    array1 = \begin{pmatrix}
    1 & 2 & 3 \\
    \end{pmatrix}
    

and

.. math::


    array2 = \begin{pmatrix}
    1 & 2 & 3 \\
    4 & 5 & 6
    \end{pmatrix}
    

Let us see how we can represent and work with these two arrays in NumPy:

.. code:: python

   >>> import numpy as np          # Import the NumPy library
   >>> array1 = np.array([1, 2, 3])
   >>> array2 = np.array([[1, 2, 3], [4, 5, 6]])
   >>> type(array1)
   <class 'numpy.ndarray'>
   >>> type(array2)
   <class 'numpy.ndarray'>
   >>> array1
   array([1, 2, 3])
   >>> array2
   array([[1, 2, 3],
          [4, 5, 6]])
   >>> print(array2)
   [[1 2 3]
    [4 5 6]]

We see from this example that we can pass lists of number as arguments
to ``np.array`` function which creates a NumPy array for us. If the
argument is a nested list, each element of the list is used as a row of
a 2D array.

Arrays can contain any data type as elements; however, we will limit
ourselves to numbers (integers and real numbers) in this chapter.

**Shapes, Dimensions and Number of Elements of Arrays**. The first thing
we can do with a NumPy array is check its shape. For our example
``array1`` and ``array2``, we can do so as follows:

.. code:: python

   >>> array1.shape
   (3,)
   >>> array2.shape
   (2,3)

where we can see that ``array1`` is a one-dimensional array with
:math:`3` elements and ``array2`` is a :math:`2\times 3` array (a 2D
matrix).

For a 2D array, a shape value ``(R, C)`` denotes the number of rows
first (``R``), which is sometimes also called the first dimension, and
then the number of columns (``C``), which is the second dimension. For
:math:`n`\ D arrays with :math:`n>2`, the meaning of the shape values is
the same except that there are :math:`n` values in the shape.

In NumPy, we can easily change the shape of an array without losing
content:

.. code:: python

   >>> array1.reshape((3,1))
   array([[1],
          [2],
          [3]])
   >>> array2.reshape((1,6))
   array([[1, 2, 3, 4, 5, 6]])

For many applications, we will need to access the number of dimensions
of an array. For this purpose, we can use ``.ndim`` value:

.. code:: python

   >>> array1.ndim
   1
   >>> array2.ndim
   2

The number of elements in an array is another important value that we
are frequently interested in. To access that, we can use the ``.size``
value:

.. code:: python

   >>> array1.size
   3
   >>> array2.size
   6

**Accessing elements in arrays**. NumPy allows the same indexing
mechanisms that you can use with Python’s native container data types.
For NumPy, let us look at some examples:

.. code:: python

   >>> array1[-1]
   3
   >>> array2[1][2]
   6
   >>> array2[-1]
   array([4, 5, 6])

**Creating arrays**. We have already seen that we can create arrays
using the ``np.array()`` function. However, there are other ways for
creating arrays conforming to predefined specifications. We can for
example create arrays filled with zeros or ones (note that the argument
is a tuple describing the shape of the matrix):

.. code:: python

   >>> np.zeros((3, 4))
   array([[0., 0., 0., 0.],
          [0., 0., 0., 0.],
          [0., 0., 0., 0.]])
   >>> np.ones((2,6))
   array([[1., 1., 1., 1., 1., 1.],
          [1., 1., 1., 1., 1., 1.]])

Alternatively, we can create an array filled with a range of values
using ``np.arange()`` function:

.. code:: python

   >>> np.arange(1,10)     # 1: starting value. 10: ending value (excluded)
   array([1, 2, 3, 4, 5, 6, 7, 8, 9])
   >>> np.arange(1,10,2)
   array([1, 3, 5, 7, 9])
   >>> np.arange(1,10).reshape((3,3))
   array([[1, 2, 3],
          [4, 5, 6],
          [7, 8, 9]])

Working with Arrays
~~~~~~~~~~~~~~~~~~~

The previous section covered how we can access elements in an array and
properties of an array. Now let us see how the different types of
operations we can do with arrays.

**Arithmetic, Relational and Membership Operations with Arrays**

The arithmetic operations (``+``, ``-``, ``*``, ``/``, ``**``), the
relational operations (``==``, ``<``, ``<=``, ``>``, ``>=``) and the
membership operations (``in``, ``not in``) that we can apply on numbers
and other data types in Python can be applied on arrays with NumPy.
These operations are performed elementwise. This means that the arrays
that are provided as operands to a binary arithmetic operator need to
have the same shape.

Let us see some examples for arithmetic operations:

.. code:: python

   >>> A = np.arange(4).reshape((2,2))
   >>> A
   array([[0, 1],
          [2, 3]])
   >>> B = np.arange(4, 8).reshape((2,2))
   >>> B
   array([[4, 5],
          [6, 7]])
   >>> print(B-A)
   [[4 4]
    [4 4]]
   >>> print(B+A)
   [[ 4  6]
    [ 8 10]]
   >>> A
   array([[0, 1],
          [2, 3]])
   >>> B
   array([[4, 5],
          [6, 7]])

Note here that these operations create a new array whose elements are
the results of applying the operation. Therefore, the original arrays
are not modified. If you are interested in in-place operations that
modify an existing array during the operation, you can use combined
statements such as ``+=``, ``-=``, ``*=``.

Relational and membership operations are also applied elementwise and we
can easily anticipate the outcomes of such operations, e.g. as follows:

.. code:: python

   >>> A < B
   array([[ True,  True],
          [ True,  True]])
   >>> B > A
   array([[ True,  True],
          [ True,  True]])
   >>> A > B
   array([[False, False],
          [False, False]])
   >>> 4 in B
   True
   >>> 10 in B
   False

**Useful Functions**

NumPy arrays provide several useful functions already provided. These
include: - Standard mathematical function such as exponent, sin, cos,
square-root: ``np.exp(<array>)``, ``np.sin(<array>)``,
``np.cos(<array>)``, ``np.sqrt(<array>)``. - Minimum and maximum:
``<array object>.min()`` and ``<array object>.max()``. - Summation, mean
and standard deviation: ``<array object>.sum()``,
``<array object>.mean()`` and ``<array object>.std()``.

Note that minimum, maximum, summation, mean and standard deviation can
be applied on the whole array as well as along a pre-specified dimension
(specified with an ``axis`` parameter). Let us see some examples to
clarify this important aspect:

.. code:: python

   >>> A
   array([[0, 1],
          [2, 3]])
   >>> A.sum()
   6
   >>> A.sum(axis=0)
   array([2, 4])
   >>> A.sum(axis=1)
   array([1, 5])
   >>> A.sum(axis=2)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/usr/local/lib/python3.7/site-packages/numpy/core/_methods.py", line 47, in _sum
       return umr_sum(a, axis, dtype, out, keepdims, initial, where)
   numpy.AxisError: axis 2 is out of bounds for array of dimension 2

where we see that axes start being numbered from zero.

**Splitting and Combining Arrays**

For many problems, we will need to split an array into multiple arrays
or combine multiple arrays into one. For splitting arrays, we can use
functions such as ``np.hsplit`` (for horizontal split), ``np.vsplit``
(for vertical split) and ``np.array_split`` (for more general split
operations). Below is an example for ``hsplit`` and ``vsplit``:

.. code:: python

   >>> L = np.arange(16).reshape(4,4)
   >>> L
   array([[ 0,  1,  2,  3],
          [ 4,  5,  6,  7],
          [ 8,  9, 10, 11],
          [12, 13, 14, 15]])
   >>> np.hsplit(L,2)  # Divide L into 2 arrays along the horizontal axis
   [array([[ 0,  1],
          [ 4,  5],
          [ 8,  9],
          [12, 13]]), array([[ 2,  3],
          [ 6,  7],
          [10, 11],
          [14, 15]])]
   >>> np.vsplit(L,2)
   [array([[0, 1, 2, 3],
          [4, 5, 6, 7]]), array([[ 8,  9, 10, 11],
          [12, 13, 14, 15]])]

Note that the resultant arrays are provided as a Python ``list``. Note
also that ``hsplit`` and ``vsplit`` functions work on even sizes
(i.e. they can split into equally sized arrays) – for general split
operations, you can use ``array_split``.

For combining multiple arrays, we can use ``np.hstack`` (for horizontal
stacking), ``np.vstack`` (for vertical stacking) and ``np.stack`` (for
more general stacking operations). Below is an example (for the ``A``
and ``B`` arrays that we have created before):

.. code:: python

   >>> A
   array([[0, 1],
          [2, 3]])
   >>> B
   array([[4, 5],
          [6, 7]])
   >>> np.hstack((A, B))
   array([[0, 1, 4, 5],
          [2, 3, 6, 7]])
   >>> np.vstack((A, B))
   array([[0, 1],
          [2, 3],
          [4, 5],
          [6, 7]])

**Iterations with Arrays**.

NumPy library already provides for us many functionalities that we might
need while working with arrays. However, in many circumstances those
will not be necessary and we will need to be able to iterate over the
elements of arrays and perform custom algorithmic steps.

Luckily, iteration with arrays is very similar to how we would perform
iterations with other container data types in Python. Let us see some
examples:

.. code:: python

   >>> L
   array([[ 0,  1,  2,  3],
          [ 4,  5,  6,  7],
          [ 8,  9, 10, 11],
          [12, 13, 14, 15]])
   >>> for r in L:
   ...     print("row: ", r)
   ...
   row:  [0 1 2 3]
   row:  [4 5 6 7]
   row:  [ 8  9 10 11]
   row:  [12 13 14 15]

where we see that iteration over a multi-dimensional array iterates over
the first dimension. To iterate over each element, we have at least two
options:

.. code:: python

   >>> for element in L.flat:
   ...     print(element)
   ...
   0
   1
   2
   3
   4
   5
   6
   7
   8
   9
   10
   11
   12
   13
   14
   15
   >>> for r in L:
   ...     for element in r:
   ...             print(element)
   ...
   0
   1
   2
   3
   4
   5
   6
   7
   8
   9
   10
   11
   12
   13
   14
   15

Now let us implement one of the operations we have seen above in Python
from scratch as an exercise:

.. code:: python

    import numpy as np
    
    def horizontal_stack(A,B):
      """A function that combines two 2D arrays A and B.
         A and B need to have the same height.
      """
      # Let us get dimensions first and check whether A and B
      # are compatible for stacking.
      (H_A, W_A) = A.shape
      (H_B, W_B) = B.shape
      if H_A != H_B:
        print("Arguments A and B have incompatible heights!")
        return None
      # Let us create an empty `result` array:
      H_result = H_A #or H_B
      W_result = W_A + W_B
      result = np.zeros((H_result, W_result))
    
      # Now let us iterate over each position in A and B and place
      # their elements into the corresponding positions
      for i in range(H_A):
        for j in range(W_A):
          result[i][j] = A[i][j]
        for j in range(W_B):
          result[i][j+W_A] = B[i][j]
    
      return result
    
    # Let us test our code:
    M = np.random.randn(2,2) # Create a random 3x4 array
    N = np.random.randn(2,3) # Create a random 3x6 array
    print("Random array M is:\n", M)
    print("Random array N is:\n", N)
    print("Horizontal stacking of M and N yields:\n", horizontal_stack(M, N))


.. parsed-literal::
    :class: output

    Random array M is:
     [[ 0.39521366  1.41939618]
     [-0.13770429  1.19319949]]
    Random array N is:
     [[-1.61066212e-01  8.15354855e-01  7.19234023e-01]
     [ 2.65985989e-04 -2.89111500e-01  4.52781848e-02]]
    Horizontal stacking of M and N yields:
     [[ 3.95213656e-01  1.41939618e+00 -1.61066212e-01  8.15354855e-01
       7.19234023e-01]
     [-1.37704293e-01  1.19319949e+00  2.65985989e-04 -2.89111500e-01
       4.52781848e-02]]


Linear Algebra with NumPy
~~~~~~~~~~~~~~~~~~~~~~~~~

Now let us give a flavour of the Linear Algebra operations provided by
NumPy. Note that some of these operations is provided in the module
``numpy.linalg``.

**Transpose**.

The transpose operation flips a matrix along the diagonal. For an
example matrix :math:`A`,

.. math::


   A = \begin{pmatrix}
   1 & 2  \\
   3 & 4  
   \end{pmatrix}

its transpose is the following:

.. math::


   A^T = \begin{pmatrix}
   1 & 3 \\
   2 & 4  \\
   \end{pmatrix}

In NumPy, the transpose of a matrix can be easily accessed by accessing
its ``.T`` member variable or by calling its ``.transpose()`` member
function:

.. code:: python

   >>> A
   array([[1, 2],
          [3, 4]])
   >>> A.T
   array([[1, 3],
          [2, 4]])
   >>> A
   array([[1, 2],
          [3, 4]])

Note that ``.T`` is simply a member variable of the array object and is
not defined as an operation. Try finding the transpose of ``A`` with
``A.transpose()`` and see that you obtain the expected result. Note also
that the transpose operation does not change the original array.

**Inverse**.

The inverse of an :math:`n\times n` matrix :math:`A` is an
:math:`n\times n` matrix :math:`A^{-1}` that yields an :math:`n\times n`
identity matrix :math:`I` when multiplied:

.. math::


   A \times A^{-1} = I.

In NumPy, we can use ``np.linalg.inv(<array>)`` to find the inverse of a
square array. Here is an example:

.. code:: python

   >>> A
   array([[1, 2],
          [3, 4]])
   >>> A_inv = np.linalg.inv(A)
   >>> A_inv
   array([[-2. ,  1. ],
          [ 1.5, -0.5]])

Let us check whether the inverse was correctly calculated:

.. code:: python

   >>> np.matmul(A, A_inv)
   array([[1.0000000e+00, 0.0000000e+00],
          [8.8817842e-16, 1.0000000e+00]])

which is the identity matrix (:math:`I`) except for some rounding error
at index :math:`(1,0)` owing to floating point approximation. However,
as we have seen in Chapter 2, while writing our programs, we should
compare numbers such that ``8.8817842e-16`` can be considered to be
equal to zero.

**Determinant, norm, rank, condition number, trace**

While working with matrices, we often require the following properties
which can be easily calculated. For the sake of brevity and to keep the
focus, we will omit their explanations:

=========================================== ============================
Matrix Property                             How to Calculate with NumPy
=========================================== ============================
Determinant (:math:`|A|` or :math:`det(A)`) ``np.linalg.det(A)``
Norm (:math:`||A||`)                        ``np.linalg.norm(A)``
Rank (:math:`rank(A)`)                      ``np.linalg.matrix_rank(A)``
Condition number (:math:`\kappa(A)`)        ``np.linalg.cond(A)``
Trace (:math:`tr(A)`)                       ``np.trace(A)``
=========================================== ============================

**Dot Product, Inner Product, Outer Product, Matrix Multiplication**

The product of two matrices can be calculated in different ways:

-  ``np.dot(a, b)``: For vectors (1D arrays), this is equivalent to the
   dot product (i.e. :math:`\sum_i \mathbf{a}_i \times \mathbf{b}_i` for
   two vectors :math:`\mathbf{a}` and :math:`\mathbf{b}`). For 2D arrays
   and arrays with more dimensions, the result is matrix multiplication.

-  ``np.inner(a,b)``: For vectors, this is dot product (like
   ``np.dot(a,b)``). For higher-dimensional arrays, the result is a sum
   product over the last axes. Consider the following example for
   clarification on this:

.. code:: python

   >>> A
   array([[0, 1],
          [2, 3]])
   >>> B
   array([[4, 5],
          [6, 7]])
   >>> np.inner(A,B)
   array([[ 5,  7],
          [23, 33]])

The last axes for both ``A`` and ``B`` are the horizontal axes.
Therefore, each first-axis element of ``A`` (i.e. [0, 1] and [2,3]) is
multiplied with each first-axis element of ``B`` ([4, 5] and [6, 7]).

-  ``np.outer(a,b)``: Outer product is defined on vectors such that
   ``result[i,j] = a[i] * b[j]``.

-  ``np.matmul(a,b)``: Matrix multiplication of matrices ``a`` and
   ``b``. The result is simply calculated as
   :math:`\textrm{result}_{ij} = \sum_k \mathbf{a}_{ik} \mathbf{b}_{kj}`,
   as also illustrated in :numref:`ch10_matrix_multiplication`.

.. _ch10_matrix_multiplication:

.. figure:: ../figures/ch10_matrix_multiplication.png
   :width: 500px

   Illustration of matrix multiplication of two matrices A and B.
   (Figure source: Wikipedia)


**Eigenvectors and Eigenvalues**

Explaining eigenvectors and eigenvalues is beyond the scope of the book.
However, we can briefly remind that eigenvectors and eigenvalues are the
variables that satisfy the following for a matrix :math:`A` which can be
considered as a transformation in a vector space:

.. math::


   Av = \lambda v.

In other words, :math:`v` is such a vector that :math:`A` just changes
its scale by a factor :math:`\lambda`.

In NumPy, eigenvectors and eigenvalues can be obtained by
``np.linalg.eigh(<array>)`` which returns a tuple with the following two
elements:

1. Eigenvalues (an array of :math:`\lambda`) in decreasing order.
2. Eigenvectors (:math:`v`) as a column matrix. I.e. ``v[:, i]`` is the
   eigenvector corresponding to the eigenvalue :math:`\lambda[i]`.

**Matrix Decompositions**

Numpy also provides many frequently used matrix decompositions, as
listed below:

============================ ===========================
Matrix Decomposition         How to Calculate with NumPy
============================ ===========================
Cholesky decomposition       ``linalg.cholesky(A)``
QR factorization             ``np.linalg.qr(A)``
Singular Value Decomposition ``linalg.svd(a)``
============================ ===========================

**Solve a linear system of equations**

Given a linear system of equations as follows:

.. math::


     a_{11} x_1 + a_{12} x_2 + ... + a_{1n} x_n = b_1,\\
     a_{21} x_1 + a_{22} x_2 + ... + a_{2n} x_n = b_2,\\
     ... \\
     a_{n1} x_1 + a_{22} x_2 + ... + a_{nn} x_n = b_n,\\

which can be represented in compact form as:

.. math::


   \mathbf{a} \mathbf{x} = \mathbf{b},

can be solved in NumPy using ``np.linalg.solve(a, b)``.

Below is an easy example:

.. code:: python

   >>> A
   array([[0, 1],
          [2, 3]])
   >>> B
   array([3, 4])
   >>> np.linalg.solve(A,B)
   array([-2.5,  3. ])
   >>> X = np.linalg.solve(A,B)
   >>> X
   array([-2.5,  3. ])

which can be easily verified using multiplication:

.. code:: python

   >>> np.inner(A, X)
   array([3., 4.])

which is equal to ``B``.

Why Use NumPy? Efficiency Benefits
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The previous sections have illustrated how capable the NumPy library is
and how easily we can do many calculations with the pre-defined
functionalities e.g. in its linear algebra module. Since we can access
each element of an array using Python’s indexing mechanisms, we can be
tempted to implementing those functionalities ourselves, from scratch.

However, those functionalities in NumPy are *not* implemented in Python
but in C, another high-level programming language where programs are
directly compiled into machine executable binary code. Therefore, they
would execute much faster in comparison to what we would be implementing
in Python.

The following example illustrates this. When executed, you should see an
order of $~$1000 difference in running time.

In other words, you are strongly advised, whenever possible, to use
NumPy’s existing functions and routines, and to write every operation in
vector or matrix form as much as possible if you are working with
matrices.

.. code:: python

    import numpy as np
    from time import time
    
    def matmul_2D(M, N):
      """Custom defined matrix multiplication for two 2D matrices M and N"""
      (H_M, W_M) = M.shape
      (H_N, W_N) = N.shape
      if W_M != H_N:
        print("Dimensions of M and N mismatch!")
        return None
    
      result = np.zeros((H_M, W_N))
      for i in range(H_M):
        for j in range(W_N):
          for k in range(W_M):
            result[i][j] += M[i][k] * N[k][j]
    
      return result
    
    # First let us check that our code works as expected
    M = np.random.randn(2,3)
    N = np.random.randn(3,4)
    
    print("Our matmul_2D result: \n", matmul_2D(M, N))
    print("Correct result: \n", np.matmul(M, N))
    
    # Now let us measure the running-time performances
    # Create two 2D large matrices
    M = np.random.randn(100, 100)
    N = np.random.randn(100, 100)
    
    # Option 1: Use NumPy's matrix multiplication
    t1 = time()
    result = np.matmul(M, N)
    t2 = time()
    print("NumPy's matmul took ", t2-t1, "ms.")
    
    # Option 2: Use our matmul_2D function
    t1 = time()
    result = matmul_2D(M, N)
    t2 = time()
    print("Our matmul_2D function took ", t2-t1, "ms.")


.. parsed-literal::
    :class: output

    Our matmul_2D result: 
     [[-0.24208963  1.18790239 -1.5397112  -1.14521578]
     [-2.00697411  4.23675778 -3.74909068 -0.23854542]]
    Correct result: 
     [[-0.24208963  1.18790239 -1.5397112  -1.14521578]
     [-2.00697411  4.23675778 -3.74909068 -0.23854542]]
    NumPy's matmul took  0.00040411949157714844 ms.
    Our matmul_2D function took  1.0593342781066895 ms.


Scientific Computing with SciPy
-------------------------------

SciPy is a library that includes many methods and facilities for
Scientific Computing. It is closely linked with NumPy so much that NumPy
needs to be imported first to be able to use SciPy.

.. raw:: html

   <table bgcolor="pink" width="75%" align="center">

.. raw:: html

   <tr>

.. raw:: html

   </tr>

.. raw:: html

   <tr>

.. raw:: html

   <td>

Installation Notes

To be able to use the SciPy library, you will need to download it from
`scipy.org <http://www.scipy.org>`__ and install it on your computer. If
you are using a Python package manager (e.g. pip), you can install it
directly using: ``$ pip install scipy``. If you are using a Windows/Mac
machine, you should install
`anaconda <https://docs.anaconda.com/anaconda/install/>`__ first. If you
are using Colab or another Jupyter Notebook viewer, the platform may
already have scipy installed.

.. raw:: html

   </td>

.. raw:: html

   </tr>

.. raw:: html

   </table>

Below is a list of some modules provided by SciPy. Even a brief coverage
of these modules is not feasible in a chapter. However, we will see an
example in Chapter 12 using SciPy’s ``stats`` and ``optimize`` modules.

=========== ======================================================
Module      Description
=========== ======================================================
cluster     Clustering algorithms
constants   Physical and mathematical constants
fftpack     Fast Fourier Transform routines
integrate   Integration and ordinary differential equation solvers
interpolate Interpolation and smoothing splines
io          Input and Output
linalg      Linear algebra
ndimage     N-dimensional image processing
odr         Orthogonal distance regression
optimize    Optimization and root-finding routines
signal      Signal processing
sparse      Sparse matrices and associated routines
spatial     Spatial data structures and algorithms
special     Special functions
stats       Statistical distributions and functions
=========== ======================================================

Data handling & analysis with Pandas
------------------------------------

Pandas is a very handy library for working with files with different
formats and analyzing data in different format. To keep things simple,
in this section we will just look at CSV files. Note that the facilities
for other formats are very similar.

.. raw:: html

   <table bgcolor="pink" width="75%" align="center">

.. raw:: html

   <tr>

.. raw:: html

   </tr>

.. raw:: html

   <tr>

.. raw:: html

   <td>

Installation Notes

To be able to use the Pandas library, you will need to download it from
`pandas.pydata.org <https://pandas.pydata.org/>`__ and install it on
your computer. If you are using a Python package manager (e.g. pip), you
can install it directly using: ``$ pip install pandas``. If you are
using a Windows/Mac machine, you should install
`anaconda <https://docs.anaconda.com/anaconda/install/>`__ first. If you
are using Colab or another Jupyter Notebook viewer, the platform may
already have pandas installed.

.. raw:: html

   </td>

.. raw:: html

   </tr>

.. raw:: html

   </table>

Supported File Formats
~~~~~~~~~~~~~~~~~~~~~~

As we have seen before, Python already provides facilities for reading
and writing files. However, if you need to work with “structured” files
such as CSV, XML, XLSX, JSON, HTML, the native facilities of Python
would need to extended significantly.

That’s where Pandas comes in. It complements Python’s native facilities
to be able to work with structured files for both reading data from them
and creating new files. Below is a list of the different file formats
that Pandas supports and the corresponding functions for reading or
writing them.

.. raw:: html

   <center>

*Table 10.1. The file formats supported by the Pandas library. The table
is adapted from and the reader is encouraged to check the following page
to see a more up-to-date list:* Pandas IO Reference.

.. raw:: html

   </center>

=========== ===================== ============== ============
Format Type Data Description      Reader         Writer
=========== ===================== ============== ============
text        CSV                   read_csv       to_csv
text        Fixed-Width Text File read_fwf       -
text        JSON                  read_json      to_json
text        HTML                  read_html      to_html
text        Local clipboard       read_clipboard to_clipboard
-           MS Excel              read_excel     to_excel
binary      OpenDocument          read_excel     -
binary      HDF5 Format           read_hdf       to_hdf
binary      Feather Format        read_feather   to_feather
binary      Parquet Format        read_parquet   to_parquet
binary      ORC Format            read_orc       -
binary      Msgpack               read_msgpack   to_msgpack
binary      Stata                 read_stata     to_stata
binary      SAS                   read_sas       -
binary      SPSS                  read_spss      -
binary      Python Pickle Format  read_pickle    to_pickle
SQL         SQL                   read_sql       to_sql
SQL         Google BigQuery       read_gbq       to_gbq
=========== ===================== ============== ============

Data Frames
~~~~~~~~~~~

Pandas library is built on top of a data type called ``DataFrame`` which
is used to store all types of data while working with Pandas. In the
following, we will illustrate how you can get your data into a
``DataFrame`` object:

**1. Loading data from a file**. If you have your data in a file that is
supported by Pandas, you can use its reader for directly obtaining a
``DataFrame`` object. Let us look at an example CSV file:

.. code:: python

    # Download an example CSV file:
    !wget -nc https://raw.githubusercontent.com/sinankalkan/CENG240/master/figures/ch10_example.csv
    
    # Import the necessary libraries
    import pandas as pd
    
    # Read the file named 'ch10_example.csv'
    df = pd.read_csv('ch10_example.csv')
    
    # Print the CSV file's contents:
    print("The CSV file contains the following:\n", df, "\n")
    
    # Check the types of each column
    df.dtypes


.. parsed-literal::
    :class: output

    --2021-06-21 20:22:24--  https://raw.githubusercontent.com/sinankalkan/CENG240/master/figures/ch10_example.csv
    Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
    Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 162 [text/plain]
    Saving to: ‘ch10_example.csv’
    
    ch10_example.csv    100%[===================>]     162  --.-KB/s    in 0s      
    
    2021-06-21 20:22:25 (2.53 MB/s) - ‘ch10_example.csv’ saved [162/162]
    
    The CSV file contains the following:
           Name  Grade  Age
    0     Jack   40.2   20
    1   Amanda   30.0   25
    2     Mary   60.2   19
    3     John   85.0   30
    4    Susan   70.0   28
    5     Bill   58.0   28
    6     Jill   90.0   27
    7      Tom   90.0   24
    8    Jerry   72.0   26
    9   George   79.0   22
    10  Elaine   82.0   23 
    

.. parsed-literal::
    :class: output

    Name      object
    Grade    float64
    Age        int64
    dtype: object


Note the following critical details:

-  ``read_csv`` file automatically understood that our CSV file had a
   header (“Name”, “Grade” and “Age”). If your file does not have a
   header, you can call ``read_csv`` with the ``header`` parameter set
   to ``None`` as follows: ``pd.read_csv(filename, header=None)``.

-  ``read_csv`` read all columns in the CSV file. If you wish to load
   only some of the columns (e.g. ‘Name’, ‘Age’ in our example), you can
   relay this using the ``usecols`` parameter as follows:
   ``pd.read_csv(filename, usecols=[ 'Name', 'Age'])``.

**2. Convert Python data into a ``DataFrame``**. Alternatively, you can
have already a Python data object which you can provide as argument to a
``DataFrame`` constructor as illustrated with the following example:

.. code:: python

    lst = [('Jack', 40.2, 20), ('Amanda',	30, 25), ('Mary',	60.2, 19)]
    df = pd.DataFrame(data = lst, columns=['Name', 'Grade', 'Age'])
    print(df)


.. parsed-literal::
    :class: output

         Name  Grade  Age
    0    Jack   40.2   20
    1  Amanda   30.0   25
    2    Mary   60.2   19


In many cases, we will require the rows to be associated with names, or
sometimes called as keys. For example, instead of referring to a row as
“the row at index 1”, we might require accessing a row with a
non-integer value. This can be achieved as follows (note how the printed
DataFrame looks different):

.. code:: python

    names = ['Jack', 'Amanda', 'Mary']
    lst = [(40.2, 20), (30, 25), (60.2, 19)]
    df = pd.DataFrame(data = lst, index=names, columns=['Grade', 'Age'])
    print(df)


.. parsed-literal::
    :class: output

            Grade  Age
    Jack     40.2   20
    Amanda   30.0   25
    Mary     60.2   19


Alternatively, we can obtain a DataFrame from a dictionary, which might
be easier for us to create data column-wise – note that the column names
are obtained from the keys of the dictionary:

.. code:: python

    d = {'Grade': [40.2, 30, 60.2],
          'Age': [20, 25, 19]}
    names = ['Jack', 'Amanda', 'Mary']
    df = pd.DataFrame(data = d, index=names)
    print(df)


.. parsed-literal::
    :class: output

            Grade  Age
    Jack     40.2   20
    Amanda   30.0   25
    Mary     60.2   19


Accessing Data with DataFrames
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We can access data columnwise or row-wise.

**1. Columnwise access.**

For columnwise access, you can use ``df["Grade"]``, which returns a
sequence of values. To access a certain element in that sequence, you
can either use an integer index (as in ``df["Grade"][1]``) or named
index if it has been defined (as in ``df["Grade"]["Amanda"]``). The
following example illustrates this for the example ``DataFrame`` we have
created above:

.. code:: python

   >>> print(df)
            Grade  Age
   Jack     40.2   20
   Amanda   30.0   25
   Mary     60.2   19
   >>> print(df['Grade'][1])
   30.0
   >>> print(df['Grade']['Amanda'])
   30.0

**2. Rowwise access.**

To access a particular row, you can either use integer indexes with
``df.iloc[<row index>]`` or ``df.loc[<row name>]`` if a named index is
defined. This is illustrated below:

.. code:: python

   >>> print(df)
           Grade  Age
   Jack     40.2   20
   Amanda   30.0   25
   Mary     60.2   19
   >>> print(df.iloc[1])
   Grade    30.0
   Age      25.0
   Name: Amanda, dtype: float64
   >>> print(df.loc['Amanda'])
   Grade    30.0
   Age      25.0
   Name: Amanda, dtype: float64

You can also provide both column index and name index in a single
operation, i.e. ``[<row index>, <column index>]`` or
``[<row name>, <column name>]`` as illustrated below:

.. code:: python

   >>> df.loc['Amanda','Grade']
   30.0
   >>> df.iloc[1, 1]
   25

While accessing a ``DataFrame`` with integer indexes, you can also use
Python’s slicing; i.e. ``[start:end:step]`` indexing.

Modifying Data with DataFrames
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Although modifying data in a ``DataFrame`` is simple, you need to be
careful about one crucial concept: Let us say you use column and row
names to change the contents of a cell and you access first the column
and then the corresponding row as follows:

.. code:: python

   >>> print(df)
               Grade  Age
   Jack     40.2   20
   Amanda   30.0   25
   Mary     60.2   19
   >>> df['Grade']['Amanda'] = 45
   /usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:6: SettingWithCopyWarning:
   A value is trying to be set on a copy of a slice from a DataFrame
   See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

This is called *chained indexing* and Pandas will give you a warning
that you are modifying a ``DataFrame`` that was *returned* while
accessing columns or rows in your original ``DataFrame``:
``df['Grade']`` returns a direct access for the original ``Grade``
column or a copy to it and depending on how your data is structured and
stored in the memory, the end result may be different and your change
might not be reflected on the original ``DataFrame``.

Instead, you should use a single access operation as illustrated below
(you may do this with ``loc`` and ``iloc`` as well):

.. code:: python

   >>> print(df)
           Grade  Age
   Jack     40.2   20
   Amanda   30.0   25
   Mary     60.2   19
   >>> df.loc['Amanda','Grade'] = 45
   >>> df.iloc[1,1] = 30
   >>> print(df)
           Grade  Age
   Jack     40.2   20
   Amanda   45.0   30
   Mary     60.2   19

With these facilities, you can now access every item in a ``DataFrame``
and modify them.

Analyzing Data with DataFrames
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Once we have our data in a ``DataFrame``, we can use Pandas’s built-in
facilities for analyzing our data. One very simple way to analyze data
is via descriptive statistics, which you can access with
``df.describe()`` and ``df[<column>].value_counts()``:

.. code:: python

    print(df)
    df.describe()


.. parsed-literal::
    :class: output

            Grade  Age
    Jack     40.2   20
    Amanda   30.0   25
    Mary     60.2   19


.. raw:: html

    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }
    
        .dataframe tbody tr th {
            vertical-align: top;
        }
    
        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>Grade</th>
          <th>Age</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>count</th>
          <td>3.000000</td>
          <td>3.000000</td>
        </tr>
        <tr>
          <th>mean</th>
          <td>43.466667</td>
          <td>21.333333</td>
        </tr>
        <tr>
          <th>std</th>
          <td>15.362725</td>
          <td>3.214550</td>
        </tr>
        <tr>
          <th>min</th>
          <td>30.000000</td>
          <td>19.000000</td>
        </tr>
        <tr>
          <th>25%</th>
          <td>35.100000</td>
          <td>19.500000</td>
        </tr>
        <tr>
          <th>50%</th>
          <td>40.200000</td>
          <td>20.000000</td>
        </tr>
        <tr>
          <th>75%</th>
          <td>50.200000</td>
          <td>22.500000</td>
        </tr>
        <tr>
          <th>max</th>
          <td>60.200000</td>
          <td>25.000000</td>
        </tr>
      </tbody>
    </table>
    </div>


Apart from these descriptive functions, Pandas provides functions for
sorting (using ``.sort_values()`` function), finding the maximum or the
minimum (using ``.max()`` or ``.min()`` functions) or finding the
largest or smallest n values (using ``.nsmallest()`` or ``.nlargest()``
functions:

.. code:: python

    print("Maximum grade is: ", df['Grade'].max())
    print("\nRecords sorted according to age:\n", df.sort_values(by="Age"))
    print("\n\nTop two grades are:\n", df['Grade'].nlargest(2))


.. parsed-literal::
    :class: output

    Maximum grade is:  60.2
    
    Records sorted according to age:
             Grade  Age
    Mary     60.2   19
    Jack     40.2   20
    Amanda   30.0   25
    
    
    Top two grades are:
     Mary    60.2
    Jack    40.2
    Name: Grade, dtype: float64


Note that, the displayed output includes the names because we selected
names as the indices for accessing the rows.

Presenting Data in DataFrames
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Pandas provides a very easy mechanism for plotting your data via
``.plot()`` function. Examples are provided below to illustrate how you
can plot your own data. This plotting facility is provided by the
Matplotlib which we will see next. If you are not using Colab, you need
to use the following method to show the plot:
``df.plot().figure.show()``

.. code:: python

    #Plots all columns in different colours
    df.plot()


.. parsed-literal::
    :class: output

    <AxesSubplot:>


.. figure:: output_ch10_scientific_libraries_ea0544_39_1.png


.. code:: python

    # Plots a single column
    df['Age'].plot()


.. parsed-literal::
    :class: output

    <AxesSubplot:>


.. figure:: output_ch10_scientific_libraries_ea0544_40_1.png


Plotting data with Matplotlib
-----------------------------

Matplotlib is a very capable library for drawing different types of
plots in Python. It is very well integrated with Numpy, Scipy and
Pandas, and therefore, all these libraries are very frequently used
together seamlessly.

.. raw:: html

   <table bgcolor="pink" width="75%" align="center">

.. raw:: html

   <tr>

.. raw:: html

   </tr>

.. raw:: html

   <tr>

.. raw:: html

   <td>

Installation Notes

To be able to use the matplotlib library, you will need to download it
from `matplotlib.org <https://matplotlib.org/>`__ and install it on your
computer. If you are using a Python package manager (e.g. pip), you can
install it directly using: ``$ pip install matplotlib``. If you are
using a Windows/Mac machine, you should install
`anaconda <https://docs.anaconda.com/anaconda/install/>`__ first. If you
are using Colab or another Jupyter Notebook viewer, the platform may
already have matplotlib installed.

.. raw:: html

   </td>

.. raw:: html

   </tr>

.. raw:: html

   </table>

Parts of a Figure
~~~~~~~~~~~~~~~~~

A figure consists of the following elements (see also
:numref:`ch10_figure_anatomy`): - Title of the figure. - Axes,
together with their ticks, tick labels, and axis labels. - The canvas of
plot, which consists of a drawing of your data in the form of a dots
(scatter plot), lines (line plot), bars (bar plot), surfaces etc. -
Legend, which informs the perceiver about the different plots in the
canvas.

.. _ch10_figure_anatomy:

.. figure:: ../figures/ch10_figure_anatomy.png
   :width: 500px

   A figure consists of several components all of which you can change
   in matplotlib easily. Figure source: Matplotlib Usage Guides.


Preparing your Data for Plotting
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Matplotlib expects numpy arrays as input and therefore, if you have your
data in a numpy array, you can directly plot it without any data type
conversion. With pandas ``DataFrame``, the behavior is not guaranteed
and therefore, it is recommended that the values in a ``DataFrame`` are
first converted to a numpy array, using e.g.:

.. code:: python

    print(df)
    age_array = df['Age'].values
    print("The `Age` values in an array form are:", age_array)
    print("The type of our new data is: ", type(age_array))


.. parsed-literal::
    :class: output

            Grade  Age
    Jack     40.2   20
    Amanda   30.0   25
    Mary     60.2   19
    The `Age` values in an array form are: [20 25 19]
    The type of our new data is:  <class 'numpy.ndarray'>


Drawing Single Plots
~~~~~~~~~~~~~~~~~~~~

There are two ways to plot with matplotlib: In an object-oriented style
or the so-called Pyplot style:

**1. Drawing in an Object-Oriented Style**.

In this style, we create a figure object and an axes object and work
with those to create our plots. This is illustrated in the following
example:

.. code:: python

    import matplotlib.pyplot as plt
    import numpy as np
    
    # Uniformly sample 50 x values between -2 and 2:
    x = np.linspace(-2, 2, 50)
    
    # Create an empty figure
    fig, ax = plt.subplots()
    
    # Plot y = x
    ax.plot(x, x, label='$y=x$')
    
    # Plot y = x^2
    ax.plot(x, x**2, label='$y=x^2$')
    
    # Plot y = x^3
    ax.plot(x, x**3, label='$y=x^3$')
    
    # Set the labels for x and y axes:
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    
    # Set the title of the figure
    ax.set_title("Our First Plot -- Object-Oriented Style")
    
    # Create a legend
    ax.legend()
    
    # Show the plot
    # fig.show() # Uncomment if not using Colab


.. parsed-literal::
    :class: output

    <matplotlib.legend.Legend at 0x125532250>


.. figure:: output_ch10_scientific_libraries_ea0544_44_1.png


**2. Drawing in a Pyplot Style**. In the Pyplot style, we directly call
functions in the pyplot module (``matplotlib.pyplot``) to create a
figure and draw our plots. This style does not work with an explicit
figure or axes objects. This is illustrated in the following example:

.. code:: python

    # Uniformly sample 50 x values between -2 and 2:
    x = np.linspace(-2, 2, 50)
    
    # Plot y = x
    plt.plot(x, x, label='$y=x$')
    
    # Plot y = x^2
    plt.plot(x, x**2, label='$y=x^2$')
    
    # Plot y = x^3
    plt.plot(x, x**3, label='$y=x^3$')
    
    # Set the labels for x and y axes:
    plt.xlabel('x')
    plt.ylabel('y')
    
    # Set the title of the figure
    plt.title("Our First Plot -- Pyplot Style")
    
    # Create a legend
    plt.legend()
    
    # Show the plot
    #plt.show() # Uncomment if not using Colab


.. parsed-literal::
    :class: output

    <matplotlib.legend.Legend at 0x1256ab9a0>


.. figure:: output_ch10_scientific_libraries_ea0544_46_1.png


Drawing Multiple Plots in a Figure
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In many situations, you will need to draw multiple plots side by side in
a single figure. This can be performed in the object-oriented style
using the ``subplots()`` function to create a grid and then use the
created subplots and axes to draw the plots. This is illustrated with an
example below:

.. code:: python

    # Create a 2x2 grid of plots
    fig, axes = plt.subplots(2, 2)
    
    # Plot (1,1)
    axes[0,0].plot(x, x)
    axes[0,0].set_title("$y=x$")
    
    # Plot (1,2)
    axes[0,1].plot(x, x**2)
    axes[0,1].set_title("$y=x^2$")
    
    # Plot (2,1)
    axes[1,0].plot(x, x**3)
    axes[1,0].set_title("$y=x^3$")
    
    # Plot (2,2)
    axes[1,1].plot(x, x**4)
    axes[1,1].set_title("$y=x^4$")
    
    # Adjust vertical space between rows
    plt.subplots_adjust(hspace=0.5)
    
    # Show the plot
    #fig.show() # Uncomment if not using Colab


.. figure:: output_ch10_scientific_libraries_ea0544_48_0.png


It is possible to do this in the PyPlot style as well, as illustrated
below:

.. code:: python

    # Plot (1,1)
    plt.subplot(2, 2, 1)
    plt.plot(x, x)
    plt.title('$y=x$')
    
    # Plot (1,2)
    plt.subplot(2, 2, 2)
    plt.plot(x, x**2)
    plt.title('$y=x^2$')
    
    # Plot (2,1)
    plt.subplot(2, 2, 3)
    plt.plot(x, x**3)
    plt.title('$y=x^3$')
    
    # Plot (2,2)
    plt.subplot(2, 2, 4)
    plt.plot(x, x**4)
    plt.title('$y=x^4$')
    
    # Adjust vertical space between rows
    plt.subplots_adjust(hspace=0.5)
    
    # Show the plot
    #plt.show() # Uncomment if not using Colab


.. figure:: output_ch10_scientific_libraries_ea0544_50_0.png


Changing elements of a plot
~~~~~~~~~~~~~~~~~~~~~~~~~~~

All elements that are visualized in Figure 10.4.1 can be changed in
matplotlib. We will skip these details to keep our focus in the book to
the practical uses of these libraries. The interested reader can look up
the extensive documentation at
https://matplotlib.org/2.1.1/contents.html or look at the help page of a
function (e.g. ``help(plt.plot)``) to see how to modify all elements of
a figure.

Important Concepts
------------------

We would like our readers to have grasped the following crucial concepts
and keywords from this chapter:

-  NumPy arrays and their properties: array shape, dimensions, sizes,
   elements.
-  Accessing and modifying elements of a NumPy array.
-  Simple algebraic functions on NumPy arrays.
-  SciPy and its basic capabilities.
-  Pandas, DataFrame, loading files with Pandas.
-  Accessing and modifying content in DataFrames.
-  Analyzing and presenting data in DataFrames.
-  Matplotlib and different ways to make plots.
-  Drawing single and multiple plots. Changing elements of a plot.

Further Reading
---------------

-  NumPy documentation: https://numpy.org/doc/
-  SciPy documenation: https://docs.scipy.org/doc/scipy/reference/
-  Pandas documentation: https://pandas.pydata.org/docs/
-  Matplotlib documentation: https://matplotlib.org/2.1.1/contents.html
-  Introduction to Algebra with Python:
   https://pabloinsente.github.io/intro-linear-algebra

Exercises
---------

-  Define functions that work like the ``sum``, ``mean``, ``min`` and
   ``max`` operations provided by NumPy. These functions should take a
   single 2D array and return the result as a number. You can assume
   that the operation applies to the whole array and not to a single
   axis.

-  Create a simple CSV file using your favorite spreadsheet editor
   (e.g. Microsoft Excel or Google Spreadsheets) and create a file with
   your exams and their grades as two separate columns. Save the file,
   upload it to the Colab notebook and do the following:

   -  Load the file using Pandas.
   -  Calculate the mean of your exam grades.
   -  Calculate the standard deviation of your grades.

-  Using Matplotlib, generate the following plots with suitable names
   for the axes and the titles.

   -  Draw the following four functions in separate single plots:
      :math:`\sin(x), \cos(x), \tan(x), \cot(x)`.

   -  Draw these four functions in a single plot.

   -  Draw a multiple 2x2 plot where each subplot is one of the four
      functions.
	Grade	Age
count	3.000000	3.000000
mean	43.466667	21.333333
std	15.362725	3.214550
min	30.000000	19.000000
25%	35.100000	19.500000
50%	40.200000	20.000000
75%	50.200000	22.500000
max	60.200000	25.000000