Scientific and Engineering Libraries ==================================== *(C) Copyright Notice: This chapter is part of the book available at*\ https://pp4e-book.github.io/\ *and copying, distributing, modifying it requires explicit permission from the authors. See the book page for details:*\ https://pp4e-book.github.io/ In this chapter, we will cover several libraries that are very functional in scientific & engineering-related computing problems. In order to keep our focus on the practical usages of these libraries and considering that this is an introductory textbook, the coverage in this chapter is in not intended to be comprehensive. Numerical Computing with NumPy ------------------------------ NumPy can be considered as a library for working with vectors and matrices. NumPy calls vectors and matrices as *arrays*. .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html
Installation Notes To be able to use the NumPy library, you will need to download it from `numpy.org `__ and install it on your computer. If you are using a Python package manager (e.g. pip), you can install it directly using: ``$ pip install numpy``. If you are using a Windows/Mac machine, you should install `anaconda `__ first. If you are using Colab or another Jupyter Notebook viewer, the platform may already have numpy installed. .. raw:: html
Arrays and Their Basic Properties ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Let us consider a simple vector and a matrix: .. math:: array1 = \begin{pmatrix} 1 & 2 & 3 \\ \end{pmatrix} and .. math:: array2 = \begin{pmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{pmatrix} Let us see how we can represent and work with these two arrays in NumPy: .. code:: python >>> import numpy as np # Import the NumPy library >>> array1 = np.array([1, 2, 3]) >>> array2 = np.array([[1, 2, 3], [4, 5, 6]]) >>> type(array1) >>> type(array2) >>> array1 array([1, 2, 3]) >>> array2 array([[1, 2, 3], [4, 5, 6]]) >>> print(array2) [[1 2 3] [4 5 6]] We see from this example that we can pass lists of number as arguments to ``np.array`` function which creates a NumPy array for us. If the argument is a nested list, each element of the list is used as a row of a 2D array. Arrays can contain any data type as elements; however, we will limit ourselves to numbers (integers and real numbers) in this chapter. **Shapes, Dimensions and Number of Elements of Arrays**. The first thing we can do with a NumPy array is check its shape. For our example ``array1`` and ``array2``, we can do so as follows: .. code:: python >>> array1.shape (3,) >>> array2.shape (2,3) where we can see that ``array1`` is a one-dimensional array with :math:`3` elements and ``array2`` is a :math:`2\times 3` array (a 2D matrix). For a 2D array, a shape value ``(R, C)`` denotes the number of rows first (``R``), which is sometimes also called the first dimension, and then the number of columns (``C``), which is the second dimension. For :math:`n`\ D arrays with :math:`n>2`, the meaning of the shape values is the same except that there are :math:`n` values in the shape. In NumPy, we can easily change the shape of an array without losing content: .. code:: python >>> array1.reshape((3,1)) array([[1], [2], [3]]) >>> array2.reshape((1,6)) array([[1, 2, 3, 4, 5, 6]]) For many applications, we will need to access the number of dimensions of an array. For this purpose, we can use ``.ndim`` value: .. code:: python >>> array1.ndim 1 >>> array2.ndim 2 The number of elements in an array is another important value that we are frequently interested in. To access that, we can use the ``.size`` value: .. code:: python >>> array1.size 3 >>> array2.size 6 **Accessing elements in arrays**. NumPy allows the same indexing mechanisms that you can use with Python’s native container data types. For NumPy, let us look at some examples: .. code:: python >>> array1[-1] 3 >>> array2[1][2] 6 >>> array2[-1] array([4, 5, 6]) **Creating arrays**. We have already seen that we can create arrays using the ``np.array()`` function. However, there are other ways for creating arrays conforming to predefined specifications. We can for example create arrays filled with zeros or ones (note that the argument is a tuple describing the shape of the matrix): .. code:: python >>> np.zeros((3, 4)) array([[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]]) >>> np.ones((2,6)) array([[1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1.]]) Alternatively, we can create an array filled with a range of values using ``np.arange()`` function: .. code:: python >>> np.arange(1,10) # 1: starting value. 10: ending value (excluded) array([1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> np.arange(1,10,2) array([1, 3, 5, 7, 9]) >>> np.arange(1,10).reshape((3,3)) array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) Working with Arrays ~~~~~~~~~~~~~~~~~~~ The previous section covered how we can access elements in an array and properties of an array. Now let us see how the different types of operations we can do with arrays. **Arithmetic, Relational and Membership Operations with Arrays** The arithmetic operations (``+``, ``-``, ``*``, ``/``, ``**``), the relational operations (``==``, ``<``, ``<=``, ``>``, ``>=``) and the membership operations (``in``, ``not in``) that we can apply on numbers and other data types in Python can be applied on arrays with NumPy. These operations are performed elementwise. This means that the arrays that are provided as operands to a binary arithmetic operator need to have the same shape. Let us see some examples for arithmetic operations: .. code:: python >>> A = np.arange(4).reshape((2,2)) >>> A array([[0, 1], [2, 3]]) >>> B = np.arange(4, 8).reshape((2,2)) >>> B array([[4, 5], [6, 7]]) >>> print(B-A) [[4 4] [4 4]] >>> print(B+A) [[ 4 6] [ 8 10]] >>> A array([[0, 1], [2, 3]]) >>> B array([[4, 5], [6, 7]]) Note here that these operations create a new array whose elements are the results of applying the operation. Therefore, the original arrays are not modified. If you are interested in in-place operations that modify an existing array during the operation, you can use combined statements such as ``+=``, ``-=``, ``*=``. Relational and membership operations are also applied elementwise and we can easily anticipate the outcomes of such operations, e.g. as follows: .. code:: python >>> A < B array([[ True, True], [ True, True]]) >>> B > A array([[ True, True], [ True, True]]) >>> A > B array([[False, False], [False, False]]) >>> 4 in B True >>> 10 in B False **Useful Functions** NumPy arrays provide several useful functions already provided. These include: - Standard mathematical function such as exponent, sin, cos, square-root: ``np.exp()``, ``np.sin()``, ``np.cos()``, ``np.sqrt()``. - Minimum and maximum: ``.min()`` and ``.max()``. - Summation, mean and standard deviation: ``.sum()``, ``.mean()`` and ``.std()``. Note that minimum, maximum, summation, mean and standard deviation can be applied on the whole array as well as along a pre-specified dimension (specified with an ``axis`` parameter). Let us see some examples to clarify this important aspect: .. code:: python >>> A array([[0, 1], [2, 3]]) >>> A.sum() 6 >>> A.sum(axis=0) array([2, 4]) >>> A.sum(axis=1) array([1, 5]) >>> A.sum(axis=2) Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.7/site-packages/numpy/core/_methods.py", line 47, in _sum return umr_sum(a, axis, dtype, out, keepdims, initial, where) numpy.AxisError: axis 2 is out of bounds for array of dimension 2 where we see that axes start being numbered from zero. **Splitting and Combining Arrays** For many problems, we will need to split an array into multiple arrays or combine multiple arrays into one. For splitting arrays, we can use functions such as ``np.hsplit`` (for horizontal split), ``np.vsplit`` (for vertical split) and ``np.array_split`` (for more general split operations). Below is an example for ``hsplit`` and ``vsplit``: .. code:: python >>> L = np.arange(16).reshape(4,4) >>> L array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15]]) >>> np.hsplit(L,2) # Divide L into 2 arrays along the horizontal axis [array([[ 0, 1], [ 4, 5], [ 8, 9], [12, 13]]), array([[ 2, 3], [ 6, 7], [10, 11], [14, 15]])] >>> np.vsplit(L,2) [array([[0, 1, 2, 3], [4, 5, 6, 7]]), array([[ 8, 9, 10, 11], [12, 13, 14, 15]])] Note that the resultant arrays are provided as a Python ``list``. Note also that ``hsplit`` and ``vsplit`` functions work on even sizes (i.e. they can split into equally sized arrays) – for general split operations, you can use ``array_split``. For combining multiple arrays, we can use ``np.hstack`` (for horizontal stacking), ``np.vstack`` (for vertical stacking) and ``np.stack`` (for more general stacking operations). Below is an example (for the ``A`` and ``B`` arrays that we have created before): .. code:: python >>> A array([[0, 1], [2, 3]]) >>> B array([[4, 5], [6, 7]]) >>> np.hstack((A, B)) array([[0, 1, 4, 5], [2, 3, 6, 7]]) >>> np.vstack((A, B)) array([[0, 1], [2, 3], [4, 5], [6, 7]]) **Iterations with Arrays**. NumPy library already provides for us many functionalities that we might need while working with arrays. However, in many circumstances those will not be necessary and we will need to be able to iterate over the elements of arrays and perform custom algorithmic steps. Luckily, iteration with arrays is very similar to how we would perform iterations with other container data types in Python. Let us see some examples: .. code:: python >>> L array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15]]) >>> for r in L: ... print("row: ", r) ... row: [0 1 2 3] row: [4 5 6 7] row: [ 8 9 10 11] row: [12 13 14 15] where we see that iteration over a multi-dimensional array iterates over the first dimension. To iterate over each element, we have at least two options: .. code:: python >>> for element in L.flat: ... print(element) ... 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 >>> for r in L: ... for element in r: ... print(element) ... 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Now let us implement one of the operations we have seen above in Python from scratch as an exercise: .. code:: python import numpy as np def horizontal_stack(A,B): """A function that combines two 2D arrays A and B. A and B need to have the same height. """ # Let us get dimensions first and check whether A and B # are compatible for stacking. (H_A, W_A) = A.shape (H_B, W_B) = B.shape if H_A != H_B: print("Arguments A and B have incompatible heights!") return None # Let us create an empty `result` array: H_result = H_A #or H_B W_result = W_A + W_B result = np.zeros((H_result, W_result)) # Now let us iterate over each position in A and B and place # their elements into the corresponding positions for i in range(H_A): for j in range(W_A): result[i][j] = A[i][j] for j in range(W_B): result[i][j+W_A] = B[i][j] return result # Let us test our code: M = np.random.randn(2,2) # Create a random 3x4 array N = np.random.randn(2,3) # Create a random 3x6 array print("Random array M is:\n", M) print("Random array N is:\n", N) print("Horizontal stacking of M and N yields:\n", horizontal_stack(M, N)) .. parsed-literal:: :class: output Random array M is: [[ 0.39521366 1.41939618] [-0.13770429 1.19319949]] Random array N is: [[-1.61066212e-01 8.15354855e-01 7.19234023e-01] [ 2.65985989e-04 -2.89111500e-01 4.52781848e-02]] Horizontal stacking of M and N yields: [[ 3.95213656e-01 1.41939618e+00 -1.61066212e-01 8.15354855e-01 7.19234023e-01] [-1.37704293e-01 1.19319949e+00 2.65985989e-04 -2.89111500e-01 4.52781848e-02]] Linear Algebra with NumPy ~~~~~~~~~~~~~~~~~~~~~~~~~ Now let us give a flavour of the Linear Algebra operations provided by NumPy. Note that some of these operations is provided in the module ``numpy.linalg``. **Transpose**. The transpose operation flips a matrix along the diagonal. For an example matrix :math:`A`, .. math:: A = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix} its transpose is the following: .. math:: A^T = \begin{pmatrix} 1 & 3 \\ 2 & 4 \\ \end{pmatrix} In NumPy, the transpose of a matrix can be easily accessed by accessing its ``.T`` member variable or by calling its ``.transpose()`` member function: .. code:: python >>> A array([[1, 2], [3, 4]]) >>> A.T array([[1, 3], [2, 4]]) >>> A array([[1, 2], [3, 4]]) Note that ``.T`` is simply a member variable of the array object and is not defined as an operation. Try finding the transpose of ``A`` with ``A.transpose()`` and see that you obtain the expected result. Note also that the transpose operation does not change the original array. **Inverse**. The inverse of an :math:`n\times n` matrix :math:`A` is an :math:`n\times n` matrix :math:`A^{-1}` that yields an :math:`n\times n` identity matrix :math:`I` when multiplied: .. math:: A \times A^{-1} = I. In NumPy, we can use ``np.linalg.inv()`` to find the inverse of a square array. Here is an example: .. code:: python >>> A array([[1, 2], [3, 4]]) >>> A_inv = np.linalg.inv(A) >>> A_inv array([[-2. , 1. ], [ 1.5, -0.5]]) Let us check whether the inverse was correctly calculated: .. code:: python >>> np.matmul(A, A_inv) array([[1.0000000e+00, 0.0000000e+00], [8.8817842e-16, 1.0000000e+00]]) which is the identity matrix (:math:`I`) except for some rounding error at index :math:`(1,0)` owing to floating point approximation. However, as we have seen in Chapter 2, while writing our programs, we should compare numbers such that ``8.8817842e-16`` can be considered to be equal to zero. **Determinant, norm, rank, condition number, trace** While working with matrices, we often require the following properties which can be easily calculated. For the sake of brevity and to keep the focus, we will omit their explanations: =========================================== ============================ Matrix Property How to Calculate with NumPy =========================================== ============================ Determinant (:math:`|A|` or :math:`det(A)`) ``np.linalg.det(A)`` Norm (:math:`||A||`) ``np.linalg.norm(A)`` Rank (:math:`rank(A)`) ``np.linalg.matrix_rank(A)`` Condition number (:math:`\kappa(A)`) ``np.linalg.cond(A)`` Trace (:math:`tr(A)`) ``np.trace(A)`` =========================================== ============================ **Dot Product, Inner Product, Outer Product, Matrix Multiplication** The product of two matrices can be calculated in different ways: - ``np.dot(a, b)``: For vectors (1D arrays), this is equivalent to the dot product (i.e. :math:`\sum_i \mathbf{a}_i \times \mathbf{b}_i` for two vectors :math:`\mathbf{a}` and :math:`\mathbf{b}`). For 2D arrays and arrays with more dimensions, the result is matrix multiplication. - ``np.inner(a,b)``: For vectors, this is dot product (like ``np.dot(a,b)``). For higher-dimensional arrays, the result is a sum product over the last axes. Consider the following example for clarification on this: .. code:: python >>> A array([[0, 1], [2, 3]]) >>> B array([[4, 5], [6, 7]]) >>> np.inner(A,B) array([[ 5, 7], [23, 33]]) The last axes for both ``A`` and ``B`` are the horizontal axes. Therefore, each first-axis element of ``A`` (i.e. [0, 1] and [2,3]) is multiplied with each first-axis element of ``B`` ([4, 5] and [6, 7]). - ``np.outer(a,b)``: Outer product is defined on vectors such that ``result[i,j] = a[i] * b[j]``. - ``np.matmul(a,b)``: Matrix multiplication of matrices ``a`` and ``b``. The result is simply calculated as :math:`\textrm{result}_{ij} = \sum_k \mathbf{a}_{ik} \mathbf{b}_{kj}`, as also illustrated in :numref:`ch10_matrix_multiplication`. .. _ch10_matrix_multiplication: .. figure:: ../figures/ch10_matrix_multiplication.png :width: 500px Illustration of matrix multiplication of two matrices A and B. (Figure source: Wikipedia) **Eigenvectors and Eigenvalues** Explaining eigenvectors and eigenvalues is beyond the scope of the book. However, we can briefly remind that eigenvectors and eigenvalues are the variables that satisfy the following for a matrix :math:`A` which can be considered as a transformation in a vector space: .. math:: Av = \lambda v. In other words, :math:`v` is such a vector that :math:`A` just changes its scale by a factor :math:`\lambda`. In NumPy, eigenvectors and eigenvalues can be obtained by ``np.linalg.eigh()`` which returns a tuple with the following two elements: 1. Eigenvalues (an array of :math:`\lambda`) in decreasing order. 2. Eigenvectors (:math:`v`) as a column matrix. I.e. ``v[:, i]`` is the eigenvector corresponding to the eigenvalue :math:`\lambda[i]`. **Matrix Decompositions** Numpy also provides many frequently used matrix decompositions, as listed below: ============================ =========================== Matrix Decomposition How to Calculate with NumPy ============================ =========================== Cholesky decomposition ``linalg.cholesky(A)`` QR factorization ``np.linalg.qr(A)`` Singular Value Decomposition ``linalg.svd(a)`` ============================ =========================== **Solve a linear system of equations** Given a linear system of equations as follows: .. math:: a_{11} x_1 + a_{12} x_2 + ... + a_{1n} x_n = b_1,\\ a_{21} x_1 + a_{22} x_2 + ... + a_{2n} x_n = b_2,\\ ... \\ a_{n1} x_1 + a_{22} x_2 + ... + a_{nn} x_n = b_n,\\ which can be represented in compact form as: .. math:: \mathbf{a} \mathbf{x} = \mathbf{b}, can be solved in NumPy using ``np.linalg.solve(a, b)``. Below is an easy example: .. code:: python >>> A array([[0, 1], [2, 3]]) >>> B array([3, 4]) >>> np.linalg.solve(A,B) array([-2.5, 3. ]) >>> X = np.linalg.solve(A,B) >>> X array([-2.5, 3. ]) which can be easily verified using multiplication: .. code:: python >>> np.inner(A, X) array([3., 4.]) which is equal to ``B``. Why Use NumPy? Efficiency Benefits ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The previous sections have illustrated how capable the NumPy library is and how easily we can do many calculations with the pre-defined functionalities e.g. in its linear algebra module. Since we can access each element of an array using Python’s indexing mechanisms, we can be tempted to implementing those functionalities ourselves, from scratch. However, those functionalities in NumPy are *not* implemented in Python but in C, another high-level programming language where programs are directly compiled into machine executable binary code. Therefore, they would execute much faster in comparison to what we would be implementing in Python. The following example illustrates this. When executed, you should see an order of $~$1000 difference in running time. In other words, you are strongly advised, whenever possible, to use NumPy’s existing functions and routines, and to write every operation in vector or matrix form as much as possible if you are working with matrices. .. code:: python import numpy as np from time import time def matmul_2D(M, N): """Custom defined matrix multiplication for two 2D matrices M and N""" (H_M, W_M) = M.shape (H_N, W_N) = N.shape if W_M != H_N: print("Dimensions of M and N mismatch!") return None result = np.zeros((H_M, W_N)) for i in range(H_M): for j in range(W_N): for k in range(W_M): result[i][j] += M[i][k] * N[k][j] return result # First let us check that our code works as expected M = np.random.randn(2,3) N = np.random.randn(3,4) print("Our matmul_2D result: \n", matmul_2D(M, N)) print("Correct result: \n", np.matmul(M, N)) # Now let us measure the running-time performances # Create two 2D large matrices M = np.random.randn(100, 100) N = np.random.randn(100, 100) # Option 1: Use NumPy's matrix multiplication t1 = time() result = np.matmul(M, N) t2 = time() print("NumPy's matmul took ", t2-t1, "ms.") # Option 2: Use our matmul_2D function t1 = time() result = matmul_2D(M, N) t2 = time() print("Our matmul_2D function took ", t2-t1, "ms.") .. parsed-literal:: :class: output Our matmul_2D result: [[-0.24208963 1.18790239 -1.5397112 -1.14521578] [-2.00697411 4.23675778 -3.74909068 -0.23854542]] Correct result: [[-0.24208963 1.18790239 -1.5397112 -1.14521578] [-2.00697411 4.23675778 -3.74909068 -0.23854542]] NumPy's matmul took 0.00040411949157714844 ms. Our matmul_2D function took 1.0593342781066895 ms. Scientific Computing with SciPy ------------------------------- SciPy is a library that includes many methods and facilities for Scientific Computing. It is closely linked with NumPy so much that NumPy needs to be imported first to be able to use SciPy. .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html
Installation Notes To be able to use the SciPy library, you will need to download it from `scipy.org `__ and install it on your computer. If you are using a Python package manager (e.g. pip), you can install it directly using: ``$ pip install scipy``. If you are using a Windows/Mac machine, you should install `anaconda `__ first. If you are using Colab or another Jupyter Notebook viewer, the platform may already have scipy installed. .. raw:: html
Below is a list of some modules provided by SciPy. Even a brief coverage of these modules is not feasible in a chapter. However, we will see an example in Chapter 12 using SciPy’s ``stats`` and ``optimize`` modules. =========== ====================================================== Module Description =========== ====================================================== cluster Clustering algorithms constants Physical and mathematical constants fftpack Fast Fourier Transform routines integrate Integration and ordinary differential equation solvers interpolate Interpolation and smoothing splines io Input and Output linalg Linear algebra ndimage N-dimensional image processing odr Orthogonal distance regression optimize Optimization and root-finding routines signal Signal processing sparse Sparse matrices and associated routines spatial Spatial data structures and algorithms special Special functions stats Statistical distributions and functions =========== ====================================================== Data handling & analysis with Pandas ------------------------------------ Pandas is a very handy library for working with files with different formats and analyzing data in different format. To keep things simple, in this section we will just look at CSV files. Note that the facilities for other formats are very similar. .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html
Installation Notes To be able to use the Pandas library, you will need to download it from `pandas.pydata.org `__ and install it on your computer. If you are using a Python package manager (e.g. pip), you can install it directly using: ``$ pip install pandas``. If you are using a Windows/Mac machine, you should install `anaconda `__ first. If you are using Colab or another Jupyter Notebook viewer, the platform may already have pandas installed. .. raw:: html
Supported File Formats ~~~~~~~~~~~~~~~~~~~~~~ As we have seen before, Python already provides facilities for reading and writing files. However, if you need to work with “structured” files such as CSV, XML, XLSX, JSON, HTML, the native facilities of Python would need to extended significantly. That’s where Pandas comes in. It complements Python’s native facilities to be able to work with structured files for both reading data from them and creating new files. Below is a list of the different file formats that Pandas supports and the corresponding functions for reading or writing them. .. raw:: html
*Table 10.1. The file formats supported by the Pandas library. The table is adapted from and the reader is encouraged to check the following page to see a more up-to-date list:* Pandas IO Reference. .. raw:: html
=========== ===================== ============== ============ Format Type Data Description Reader Writer =========== ===================== ============== ============ text CSV read_csv to_csv text Fixed-Width Text File read_fwf - text JSON read_json to_json text HTML read_html to_html text Local clipboard read_clipboard to_clipboard - MS Excel read_excel to_excel binary OpenDocument read_excel - binary HDF5 Format read_hdf to_hdf binary Feather Format read_feather to_feather binary Parquet Format read_parquet to_parquet binary ORC Format read_orc - binary Msgpack read_msgpack to_msgpack binary Stata read_stata to_stata binary SAS read_sas - binary SPSS read_spss - binary Python Pickle Format read_pickle to_pickle SQL SQL read_sql to_sql SQL Google BigQuery read_gbq to_gbq =========== ===================== ============== ============ Data Frames ~~~~~~~~~~~ Pandas library is built on top of a data type called ``DataFrame`` which is used to store all types of data while working with Pandas. In the following, we will illustrate how you can get your data into a ``DataFrame`` object: **1. Loading data from a file**. If you have your data in a file that is supported by Pandas, you can use its reader for directly obtaining a ``DataFrame`` object. Let us look at an example CSV file: .. code:: python # Download an example CSV file: !wget -nc https://raw.githubusercontent.com/sinankalkan/CENG240/master/figures/ch10_example.csv # Import the necessary libraries import pandas as pd # Read the file named 'ch10_example.csv' df = pd.read_csv('ch10_example.csv') # Print the CSV file's contents: print("The CSV file contains the following:\n", df, "\n") # Check the types of each column df.dtypes .. parsed-literal:: :class: output --2021-06-21 20:22:24-- https://raw.githubusercontent.com/sinankalkan/CENG240/master/figures/ch10_example.csv Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 162 [text/plain] Saving to: ‘ch10_example.csv’ ch10_example.csv 100%[===================>] 162 --.-KB/s in 0s 2021-06-21 20:22:25 (2.53 MB/s) - ‘ch10_example.csv’ saved [162/162] The CSV file contains the following: Name Grade Age 0 Jack 40.2 20 1 Amanda 30.0 25 2 Mary 60.2 19 3 John 85.0 30 4 Susan 70.0 28 5 Bill 58.0 28 6 Jill 90.0 27 7 Tom 90.0 24 8 Jerry 72.0 26 9 George 79.0 22 10 Elaine 82.0 23 .. parsed-literal:: :class: output Name object Grade float64 Age int64 dtype: object Note the following critical details: - ``read_csv`` file automatically understood that our CSV file had a header (“Name”, “Grade” and “Age”). If your file does not have a header, you can call ``read_csv`` with the ``header`` parameter set to ``None`` as follows: ``pd.read_csv(filename, header=None)``. - ``read_csv`` read all columns in the CSV file. If you wish to load only some of the columns (e.g. ‘Name’, ‘Age’ in our example), you can relay this using the ``usecols`` parameter as follows: ``pd.read_csv(filename, usecols=[ 'Name', 'Age'])``. **2. Convert Python data into a ``DataFrame``**. Alternatively, you can have already a Python data object which you can provide as argument to a ``DataFrame`` constructor as illustrated with the following example: .. code:: python lst = [('Jack', 40.2, 20), ('Amanda', 30, 25), ('Mary', 60.2, 19)] df = pd.DataFrame(data = lst, columns=['Name', 'Grade', 'Age']) print(df) .. parsed-literal:: :class: output Name Grade Age 0 Jack 40.2 20 1 Amanda 30.0 25 2 Mary 60.2 19 In many cases, we will require the rows to be associated with names, or sometimes called as keys. For example, instead of referring to a row as “the row at index 1”, we might require accessing a row with a non-integer value. This can be achieved as follows (note how the printed DataFrame looks different): .. code:: python names = ['Jack', 'Amanda', 'Mary'] lst = [(40.2, 20), (30, 25), (60.2, 19)] df = pd.DataFrame(data = lst, index=names, columns=['Grade', 'Age']) print(df) .. parsed-literal:: :class: output Grade Age Jack 40.2 20 Amanda 30.0 25 Mary 60.2 19 Alternatively, we can obtain a DataFrame from a dictionary, which might be easier for us to create data column-wise – note that the column names are obtained from the keys of the dictionary: .. code:: python d = {'Grade': [40.2, 30, 60.2], 'Age': [20, 25, 19]} names = ['Jack', 'Amanda', 'Mary'] df = pd.DataFrame(data = d, index=names) print(df) .. parsed-literal:: :class: output Grade Age Jack 40.2 20 Amanda 30.0 25 Mary 60.2 19 Accessing Data with DataFrames ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We can access data columnwise or row-wise. **1. Columnwise access.** For columnwise access, you can use ``df["Grade"]``, which returns a sequence of values. To access a certain element in that sequence, you can either use an integer index (as in ``df["Grade"][1]``) or named index if it has been defined (as in ``df["Grade"]["Amanda"]``). The following example illustrates this for the example ``DataFrame`` we have created above: .. code:: python >>> print(df) Grade Age Jack 40.2 20 Amanda 30.0 25 Mary 60.2 19 >>> print(df['Grade'][1]) 30.0 >>> print(df['Grade']['Amanda']) 30.0 **2. Rowwise access.** To access a particular row, you can either use integer indexes with ``df.iloc[]`` or ``df.loc[]`` if a named index is defined. This is illustrated below: .. code:: python >>> print(df) Grade Age Jack 40.2 20 Amanda 30.0 25 Mary 60.2 19 >>> print(df.iloc[1]) Grade 30.0 Age 25.0 Name: Amanda, dtype: float64 >>> print(df.loc['Amanda']) Grade 30.0 Age 25.0 Name: Amanda, dtype: float64 You can also provide both column index and name index in a single operation, i.e. ``[, ]`` or ``[, ]`` as illustrated below: .. code:: python >>> df.loc['Amanda','Grade'] 30.0 >>> df.iloc[1, 1] 25 While accessing a ``DataFrame`` with integer indexes, you can also use Python’s slicing; i.e. ``[start:end:step]`` indexing. Modifying Data with DataFrames ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Although modifying data in a ``DataFrame`` is simple, you need to be careful about one crucial concept: Let us say you use column and row names to change the contents of a cell and you access first the column and then the corresponding row as follows: .. code:: python >>> print(df) Grade Age Jack 40.2 20 Amanda 30.0 25 Mary 60.2 19 >>> df['Grade']['Amanda'] = 45 /usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy This is called *chained indexing* and Pandas will give you a warning that you are modifying a ``DataFrame`` that was *returned* while accessing columns or rows in your original ``DataFrame``: ``df['Grade']`` returns a direct access for the original ``Grade`` column or a copy to it and depending on how your data is structured and stored in the memory, the end result may be different and your change might not be reflected on the original ``DataFrame``. Instead, you should use a single access operation as illustrated below (you may do this with ``loc`` and ``iloc`` as well): .. code:: python >>> print(df) Grade Age Jack 40.2 20 Amanda 30.0 25 Mary 60.2 19 >>> df.loc['Amanda','Grade'] = 45 >>> df.iloc[1,1] = 30 >>> print(df) Grade Age Jack 40.2 20 Amanda 45.0 30 Mary 60.2 19 With these facilities, you can now access every item in a ``DataFrame`` and modify them. Analyzing Data with DataFrames ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Once we have our data in a ``DataFrame``, we can use Pandas’s built-in facilities for analyzing our data. One very simple way to analyze data is via descriptive statistics, which you can access with ``df.describe()`` and ``df[].value_counts()``: .. code:: python print(df) df.describe() .. parsed-literal:: :class: output Grade Age Jack 40.2 20 Amanda 30.0 25 Mary 60.2 19 .. raw:: html
Grade Age
count 3.000000 3.000000
mean 43.466667 21.333333
std 15.362725 3.214550
min 30.000000 19.000000
25% 35.100000 19.500000
50% 40.200000 20.000000
75% 50.200000 22.500000
max 60.200000 25.000000
Apart from these descriptive functions, Pandas provides functions for sorting (using ``.sort_values()`` function), finding the maximum or the minimum (using ``.max()`` or ``.min()`` functions) or finding the largest or smallest n values (using ``.nsmallest()`` or ``.nlargest()`` functions: .. code:: python print("Maximum grade is: ", df['Grade'].max()) print("\nRecords sorted according to age:\n", df.sort_values(by="Age")) print("\n\nTop two grades are:\n", df['Grade'].nlargest(2)) .. parsed-literal:: :class: output Maximum grade is: 60.2 Records sorted according to age: Grade Age Mary 60.2 19 Jack 40.2 20 Amanda 30.0 25 Top two grades are: Mary 60.2 Jack 40.2 Name: Grade, dtype: float64 Note that, the displayed output includes the names because we selected names as the indices for accessing the rows. Presenting Data in DataFrames ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Pandas provides a very easy mechanism for plotting your data via ``.plot()`` function. Examples are provided below to illustrate how you can plot your own data. This plotting facility is provided by the Matplotlib which we will see next. If you are not using Colab, you need to use the following method to show the plot: ``df.plot().figure.show()`` .. code:: python #Plots all columns in different colours df.plot() .. parsed-literal:: :class: output .. figure:: output_ch10_scientific_libraries_ea0544_39_1.png .. code:: python # Plots a single column df['Age'].plot() .. parsed-literal:: :class: output .. figure:: output_ch10_scientific_libraries_ea0544_40_1.png Plotting data with Matplotlib ----------------------------- Matplotlib is a very capable library for drawing different types of plots in Python. It is very well integrated with Numpy, Scipy and Pandas, and therefore, all these libraries are very frequently used together seamlessly. .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html
Installation Notes To be able to use the matplotlib library, you will need to download it from `matplotlib.org `__ and install it on your computer. If you are using a Python package manager (e.g. pip), you can install it directly using: ``$ pip install matplotlib``. If you are using a Windows/Mac machine, you should install `anaconda `__ first. If you are using Colab or another Jupyter Notebook viewer, the platform may already have matplotlib installed. .. raw:: html
Parts of a Figure ~~~~~~~~~~~~~~~~~ A figure consists of the following elements (see also :numref:`ch10_figure_anatomy`): - Title of the figure. - Axes, together with their ticks, tick labels, and axis labels. - The canvas of plot, which consists of a drawing of your data in the form of a dots (scatter plot), lines (line plot), bars (bar plot), surfaces etc. - Legend, which informs the perceiver about the different plots in the canvas. .. _ch10_figure_anatomy: .. figure:: ../figures/ch10_figure_anatomy.png :width: 500px A figure consists of several components all of which you can change in matplotlib easily. Figure source: Matplotlib Usage Guides. Preparing your Data for Plotting ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Matplotlib expects numpy arrays as input and therefore, if you have your data in a numpy array, you can directly plot it without any data type conversion. With pandas ``DataFrame``, the behavior is not guaranteed and therefore, it is recommended that the values in a ``DataFrame`` are first converted to a numpy array, using e.g.: .. code:: python print(df) age_array = df['Age'].values print("The `Age` values in an array form are:", age_array) print("The type of our new data is: ", type(age_array)) .. parsed-literal:: :class: output Grade Age Jack 40.2 20 Amanda 30.0 25 Mary 60.2 19 The `Age` values in an array form are: [20 25 19] The type of our new data is: Drawing Single Plots ~~~~~~~~~~~~~~~~~~~~ There are two ways to plot with matplotlib: In an object-oriented style or the so-called Pyplot style: **1. Drawing in an Object-Oriented Style**. In this style, we create a figure object and an axes object and work with those to create our plots. This is illustrated in the following example: .. code:: python import matplotlib.pyplot as plt import numpy as np # Uniformly sample 50 x values between -2 and 2: x = np.linspace(-2, 2, 50) # Create an empty figure fig, ax = plt.subplots() # Plot y = x ax.plot(x, x, label='$y=x$') # Plot y = x^2 ax.plot(x, x**2, label='$y=x^2$') # Plot y = x^3 ax.plot(x, x**3, label='$y=x^3$') # Set the labels for x and y axes: ax.set_xlabel('x') ax.set_ylabel('y') # Set the title of the figure ax.set_title("Our First Plot -- Object-Oriented Style") # Create a legend ax.legend() # Show the plot # fig.show() # Uncomment if not using Colab .. parsed-literal:: :class: output .. figure:: output_ch10_scientific_libraries_ea0544_44_1.png **2. Drawing in a Pyplot Style**. In the Pyplot style, we directly call functions in the pyplot module (``matplotlib.pyplot``) to create a figure and draw our plots. This style does not work with an explicit figure or axes objects. This is illustrated in the following example: .. code:: python # Uniformly sample 50 x values between -2 and 2: x = np.linspace(-2, 2, 50) # Plot y = x plt.plot(x, x, label='$y=x$') # Plot y = x^2 plt.plot(x, x**2, label='$y=x^2$') # Plot y = x^3 plt.plot(x, x**3, label='$y=x^3$') # Set the labels for x and y axes: plt.xlabel('x') plt.ylabel('y') # Set the title of the figure plt.title("Our First Plot -- Pyplot Style") # Create a legend plt.legend() # Show the plot #plt.show() # Uncomment if not using Colab .. parsed-literal:: :class: output .. figure:: output_ch10_scientific_libraries_ea0544_46_1.png Drawing Multiple Plots in a Figure ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In many situations, you will need to draw multiple plots side by side in a single figure. This can be performed in the object-oriented style using the ``subplots()`` function to create a grid and then use the created subplots and axes to draw the plots. This is illustrated with an example below: .. code:: python # Create a 2x2 grid of plots fig, axes = plt.subplots(2, 2) # Plot (1,1) axes[0,0].plot(x, x) axes[0,0].set_title("$y=x$") # Plot (1,2) axes[0,1].plot(x, x**2) axes[0,1].set_title("$y=x^2$") # Plot (2,1) axes[1,0].plot(x, x**3) axes[1,0].set_title("$y=x^3$") # Plot (2,2) axes[1,1].plot(x, x**4) axes[1,1].set_title("$y=x^4$") # Adjust vertical space between rows plt.subplots_adjust(hspace=0.5) # Show the plot #fig.show() # Uncomment if not using Colab .. figure:: output_ch10_scientific_libraries_ea0544_48_0.png It is possible to do this in the PyPlot style as well, as illustrated below: .. code:: python # Plot (1,1) plt.subplot(2, 2, 1) plt.plot(x, x) plt.title('$y=x$') # Plot (1,2) plt.subplot(2, 2, 2) plt.plot(x, x**2) plt.title('$y=x^2$') # Plot (2,1) plt.subplot(2, 2, 3) plt.plot(x, x**3) plt.title('$y=x^3$') # Plot (2,2) plt.subplot(2, 2, 4) plt.plot(x, x**4) plt.title('$y=x^4$') # Adjust vertical space between rows plt.subplots_adjust(hspace=0.5) # Show the plot #plt.show() # Uncomment if not using Colab .. figure:: output_ch10_scientific_libraries_ea0544_50_0.png Changing elements of a plot ~~~~~~~~~~~~~~~~~~~~~~~~~~~ All elements that are visualized in Figure 10.4.1 can be changed in matplotlib. We will skip these details to keep our focus in the book to the practical uses of these libraries. The interested reader can look up the extensive documentation at https://matplotlib.org/2.1.1/contents.html or look at the help page of a function (e.g. ``help(plt.plot)``) to see how to modify all elements of a figure. Important Concepts ------------------ We would like our readers to have grasped the following crucial concepts and keywords from this chapter: - NumPy arrays and their properties: array shape, dimensions, sizes, elements. - Accessing and modifying elements of a NumPy array. - Simple algebraic functions on NumPy arrays. - SciPy and its basic capabilities. - Pandas, DataFrame, loading files with Pandas. - Accessing and modifying content in DataFrames. - Analyzing and presenting data in DataFrames. - Matplotlib and different ways to make plots. - Drawing single and multiple plots. Changing elements of a plot. Further Reading --------------- - NumPy documentation: https://numpy.org/doc/ - SciPy documenation: https://docs.scipy.org/doc/scipy/reference/ - Pandas documentation: https://pandas.pydata.org/docs/ - Matplotlib documentation: https://matplotlib.org/2.1.1/contents.html - Introduction to Algebra with Python: https://pabloinsente.github.io/intro-linear-algebra Exercises --------- - Define functions that work like the ``sum``, ``mean``, ``min`` and ``max`` operations provided by NumPy. These functions should take a single 2D array and return the result as a number. You can assume that the operation applies to the whole array and not to a single axis. - Create a simple CSV file using your favorite spreadsheet editor (e.g. Microsoft Excel or Google Spreadsheets) and create a file with your exams and their grades as two separate columns. Save the file, upload it to the Colab notebook and do the following: - Load the file using Pandas. - Calculate the mean of your exam grades. - Calculate the standard deviation of your grades. - Using Matplotlib, generate the following plots with suitable names for the axes and the titles. - Draw the following four functions in separate single plots: :math:`\sin(x), \cos(x), \tan(x), \cot(x)`. - Draw these four functions in a single plot. - Draw a multiple 2x2 plot where each subplot is one of the four functions.