Build Your Own Neural Network
=============================

In this section we will build a simple neural network, train it and validate it on a sample test data.
For this exercise, we will use the `Mushroom dataset from the Audobon Society Field Guide <https://archive.ics.uci.edu/dataset/73/mushroom>`_.
This dataset includes 22 physical characteristics of ~8,000 mushrooms spanning 23 species of gilled mushrooms in the Agaricus and Lepiota Family.
Our task is to predict whether a mushroom is edible or poisonous based on its physical characteristics.

By the end of this exercise, you should be able to:

* Import the Mushroom dataset from the UCI Machine Learning Repository
* Examine and preprocess the data to be fed to the neural network
* Build a sequential model neural network using TensorFlow Keras
* Evaluate the model's performance on test data


TensorFlow and Keras Fundamentals
---------------------------------

Before we dive into the hands-on exercise, let's briefly introduce the tools we'll be using.


TensorFlow and Keras
^^^^^^^^^^^^^^^^^^^^

.. image:: ./images/TensorFlow-Icon.png
    :width: 100px
    :align: right

`TensorFlow <https://www.tensorflow.org/>`_ is one of the most powerful open-source machine learning libraries available today. 
Developed by Google, TensorFlow offers a wide range of tools and resources to help you build, train, and deploy neural networks, making it accessible to both beginners and experts.

At its core, TensorFlow uses multi-dimensional arrays called *tensors* to represent:

* Input data
* Model parameters (weights and biases that the model learns)
* Outputs (predictions from the model)

.. list-table:: Common Tensor Types
    :widths: 20 27 15 30
    :align: center
    :header-rows: 1

    * - **Tensor Type**
      - **Example**
      - **Shape**
      - **Description**
    * - **Scalar (Rank-0)**
      - ``5``
      - ``()``
      - A single number. No dimensions.
    * - **Vector (Rank-1)**
      - ``[1, 2, 3]``
      - ``(3,)``
      - A list of three numbers.
    * - **Matrix (Rank-2)**
      - ``[[1, 2, 3], [4, 5, 6]]``
      - ``(2, 3)``
      - A table of numbers with 2 rows and 3 columns. 


`Keras <https://www.tensorflow.org/guide/keras>`_ is the high-level API of the TensorFlow platform. 
It provides a simple and intuitive way to define neural network architectures, and it's designed to be easy to use and understand.

Keras simplifies every step of the machine learning workflow, including data preprocessing, model building, training, and deployment.
Unless you're developing custom tools on top of TensorFlow, you should use Keras as your default API for deep learning tasks. 


Building Models with Keras
^^^^^^^^^^^^^^^^^^^^^^^^^^

Keras offers three approaches to building neural networks, but we'll focus on the Sequential API, which is perfect for the linear stack of layers we need for our mushroom classifier.

The basic workflow we'll follow is:

1. **Define the model architecture**: Specify the layers, their sizes, and activation functions
2. **Compile the model**: Set the optimizer, loss function, and metrics
3. **Train the model**: Fit the model to our training data
4. **Evaluate performance**: Test the model on unseen data

Here's a preview of what our model code will look like:

.. warning::

   The code below is just a template and will not run as-is. It needs to be modified as described 
   below.

.. code-block:: python
    :linenos:

    from tensorflow.keras import Sequential
    from tensorflow.keras.layers import Input, Dense

    # 1. Define the model architecture
    model = Sequential([
        Input(shape=(number_of_features,)),    # Input layer matching our feature count
        Dense(units=10, activation='relu'),    # Hidden layer with 10 perceptrons
        Dense(units=1, activation='sigmoid')   # Outputs a probability between 0 and 1
    ])

    # 2. Compile the model
    model.compile(                             
        optimizer='adam',                      # Gradient-based optimizer
        loss='binary_crossentropy',            # Loss function for binary classification
        metrics=['accuracy']                   # Track accuracy during training
    )

    # Display model summary to understand its structure
    model.summary()

    # 3. Train the model
    model.fit(
        X_train, y_train,                      # Training data and labels
        validation_split=0.2,                  # Use 20% of training data for validation
        epochs=5,                              # Number of complete passes through the dataset
        batch_size=32                          # Number of samples per gradient update
    )

    # 4. Evaluate model performance
    test_loss, test_accuracy = model.evaluate(X_test, y_test)
    print(f"Test accuracy: {test_accuracy:.4f}")

With this foundation in place, let's start building our our own neural network!


Building a Sequential Model Neural Network
------------------------------------------

Tutorial Setup and Materials
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

All materials and instructions for running this tutorial in the `TACC Analysis Portal <https://tap.tacc.utexas.edu/>`_ are available in our GitHub repository: `TACC Deep Learning Tutorials <https://github.com/kbeavers/tacc-deep-learning-tutorials>`_.

Step 0: Check GPU Availability and TensorFlow Version
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Before training deep learning models, it's important to check whether TensorFlow can access the GPU on your machine. Training on a GPU is significantly faster than on a CPU, especially for large image datasets. 

If you've followed the setup instructions in the `TACC Deep Learning Tutorials README <https://github.com/kbeavers/tacc-deep-learning-tutorials>`_, and you've run the ``install_kernel.sh`` script on **Frontera**, you should now be running this notebook inside a containerized Jupyter kernel that includes:

 - TensorFlow v. 2.13.0 with GPU support
 - CUDA libraries compatible with the system
 - All required Python packages pre-installed

This cell will confirm that your environment is correctly configured (TIP: Make sure you change your kernel to ``Day3-tf-213``).

.. code-block:: python

    >>> import tensorflow as tf
    >>> print(tf.config.list_physical_devices('GPU'))

.. code-block:: text

    Devices: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'),
    PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'),
    PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'),
    PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]


Step 1: Importing and Examining the Data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The Mushroom dataset is available in the University of California, Irvine Machine Learning Repository, which is a popular repository for machine learning datasets.
Conveniently, the ``ucimlrepo`` Python package provides a simple interface to download and load datasets directly from this repository.

First, we will import the Mushroom dataset using the ``ucimlrepo`` package:

.. code-block:: python

    >>> import pandas as pd
    >>> from ucimlrepo import fetch_ucirepo 
    >>> import random
    >>> random.seed(123)

    >>> # fetch dataset 
    >>> mushroom = fetch_ucirepo(id=73) 

Let's inspect the metadata:

.. code-block:: python

    >>> print("Dataset Overview:", mushroom.metadata.abstract)
    >>> print("Number of Instances:", mushroom.metadata.num_instances)
    >>> print("Number of Features:", mushroom.metadata.num_features)
    >>> print("Has Missing Values:", mushroom.metadata.has_missing_values)

.. code-block:: text

    Dataset Overview: From Audobon Society Field Guide; mushrooms described in terms of physical characteristics; classification: poisonous or edible
    Number of Instances: 8124
    Number of Features: 22
    Has Missing Values: yes

We know that the Mushroom dataset has 8124 instances (samples) and 22 features (physical characteristics), and there are missing values in the dataset.
Now that we have loaded the dataset, let's separate the features (``X``) from the target variable and examine the structure of our feature data.

.. code-block:: python

    >>> X = mushroom.data.features
    >>> print(X.info())

Examine the outout of ``X.info()``:

.. code-block:: text

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 8124 entries, 0 to 8123
    Data columns (total 22 columns):
     #   Column                    Non-Null Count  Dtype 
    ---  ------                    --------------  ----- 
     0   cap-shape                 8124 non-null   object
     1   cap-surface               8124 non-null   object
     2   cap-color                 8124 non-null   object
     3   bruises                   8124 non-null   object
     4   odor                      8124 non-null   object
     5   gill-attachment           8124 non-null   object
     6   gill-spacing              8124 non-null   object
     7   gill-size                 8124 non-null   object
     8   gill-color                8124 non-null   object
     9   stalk-shape               8124 non-null   object
     10  stalk-root                5644 non-null   object
     11  stalk-surface-above-ring  8124 non-null   object
     12  stalk-surface-below-ring  8124 non-null   object
     13  stalk-color-above-ring    8124 non-null   object
     14  stalk-color-below-ring    8124 non-null   object
     15  veil-type                 8124 non-null   object
     16  veil-color                8124 non-null   object
     17  ring-number               8124 non-null   object
     18  ring-type                 8124 non-null   object
     19  spore-print-color         8124 non-null   object
     20  population                8124 non-null   object
     21  habitat                   8124 non-null   object
    Dtypes: object(22)
    memory usage: 1.4+ MB
    None

Next, let's isolate and examine our target variable ``y``:

.. code-block:: python

    >>> y = mushroom.data.targets 
    >>> print(y.info())

Examine the outout of ``y.info()``:

.. code-block:: text

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 8124 entries, 0 to 8123
    Data columns (total 1 columns):
     #   Column     Non-Null Count  Dtype 
    ---  ------     --------------  ----- 
     0   poisonous  8124 non-null   object
    Dtypes: object(1)
    memory usage: 63.6+ KB
    None

In pandas, a Dtype (data type) specifies how the data in a column should be stored and interpreted.
See the section on `Exploratory Data Analysis (EDA) <../section1/exploratory_data_analysis.html>`_
for more information on Dtypes.

When we see a Dtype of ``object``, it typically means the column contains strings or a mix of different data types. Let's examine our data further:

.. code-block:: python

    >>> print(X.head(3))

.. code-block:: text

      cap-shape cap-surface cap-color bruises odor gill-attachment gill-spacing  \
    0         x           s         n       t    p               f            c   
    1         x           s         y       t    a               f            c   
    2         b           s         w       t    l               f            c   

      gill-size gill-color stalk-shape  ... stalk-surface-below-ring  \
    0         n          k           e  ...                        s   
    1         b          k           e  ...                        s   
    2         b          n           e  ...                        s   

      stalk-color-above-ring stalk-color-below-ring veil-type veil-color  \
    0                      w                      w         p          w   
    1                      w                      w         p          w   
    2                      w                      w         p          w   

      ring-number ring-type spore-print-color population habitat  
    0           o         p                 k          s       u  
    1           o         p                 n          n       g  
    2           o         p                 n          n       m  

    [3 rows x 22 columns] 

In this dataset, the features are categorical variables stored as strings (which pandas represents as ``object`` Dtype). 
Each feature is encoded with single-character values that represent specific categories.

For a complete reference of all categorical values and their meanings, visit the `UCI Mushroom Dataset page <https://archive.ics.uci.edu/dataset/73/mushroom>`_.

Here are a few examples of the categorical encodings:
 
 * **cap-shape**: 'x' (convex), 'b' (bell), 'f' (flat), etc.
 * **cap-color**: 'n' (brown), 'y' (yellow), 'w' (white), etc.
 * **odor**: 'p' (pungent), 'a' (almond), 'l' (anise), etc.

Next, let's take a look at the target variable:

.. code-block:: python

    >>> print(y.head())

.. code-block:: text

      poisonous
    0         p
    1         e
    2         e
    3         p
    4         e

The target variable contains two categorical labels: ``p`` (poisonous) and ``e`` (edible).
With this insight into our dataset's structure, our next step is to prepare the data for model training.

**Thought Challenge:** What are some things that you have noticed about the data that you think we will need to fix before feeding it to the neural network? Pause here and write down your thoughts before continuing.


Step 2: Data Pre-processing
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Our exploration of the Mushroom dataset reveals a collection of 8124 samples with 22 features and a single target variable. Before proceeding with model development, several preprocessing challenges need to be addressed:

 1. The dataset contains missing values that require handling.
 2. All features are categorical, encoded as text strings (represented as ``object`` type in pandas).
 3. The target variable itself is categorical, using ``p`` to indicate poisonous mushrooms and ``e`` for edible ones.

First, let's handle the missing values. Let's see how many missing values are in the dataset, and where they are located:

.. code-block:: python

    >>> missing_values = X.isnull().sum()
    >>> print("Columns with missing values:")
    >>> print(missing_values[missing_values > 0])

.. code-block:: text
    
    Columns with missing values:
    stalk-root    2480
    Dtype: int64

The output shows that ``stalk-root`` is missing data for 2480 samples, while all other features have complete data.
Let's remove this column from the dataset:

.. code-block:: python

    >>> X_clean = X.drop(columns=['stalk-root'])
    
Now we need to encode our categorical variables into a format suitable for the neural network. We'll use one-hot encoding via ``pd.get_dummies()`` to transform each categorical feature into multiple binary columns. For example, if a feature has three possible values (A, B, C), it will be converted into three separate columns, where only one column will have a value of 1 (True) and the others 0 (False):

.. code-block:: python

    >>> X_encoded = pd.get_dummies(X_clean)
    >>> print(X_encoded.head(2))

.. code-block:: text

       cap-shape_b  cap-shape_c  cap-shape_f  cap-shape_k  cap-shape_s  \
    0        False        False        False        False        False   
    1        False        False        False        False        False   

       cap-shape_x  cap-surface_f  cap-surface_g  cap-surface_s  cap-surface_y  \
    0         True          False          False           True          False   
    1         True          False          False           True          False   

       ...  population_s  population_v  population_y  habitat_d  habitat_g  \
    0  ...          True         False         False      False      False   
    1  ...         False         False         False      False       True   

       habitat_l  habitat_m  habitat_p  habitat_u  habitat_w  
    0      False      False      False       True      False  
    1      False      False      False      False      False  

    [2 rows x 112 columns]

Now, instead of having 22 features, we have 112 features, each representing a binary True/False value for each categorical value in the original features.

Finally, let's encode the target variable. We will simply convert the string labels ``p`` and ``e`` into binary numeric values of 1 and 0, respectively.
In this case, 1 will represent a poisonous mushroom and 0 will represent an edible mushroom.

.. code-block:: python

    >>> y_encoded = y['poisonous'].map({'p': 1, 'e': 0})

Now would be a good time to check the class distribution of our dataset:

.. code-block:: python3

    >>> print("\nClass Distribution:")
    >>> print(y_encoded.value_counts())
    >>> print("\nPercentage:")
    >>> print(y_encoded.value_counts(normalize=True) * 100)

We have a roughly balanced dataset with 51.8% of the samples being edible and 48.2% being poisonous.
We can now split the dataset into training and test sets:

.. code-block:: python

    >>> from sklearn.model_selection import train_test_split

    >>> # Split the dataset into training and testing sets
    >>> X_train, X_test, y_train, y_test = train_test_split(
    >>>     X_encoded,
    >>>     y_encoded,
    >>>     test_size=0.3,
    >>>     stratify=y_encoded,
    >>>     random_state=123
    >>> )

    >>> # Examine the shape of the training and testing sets
    >>> print("Training set shape:", X_train.shape, y_train.shape)
    >>> print("Testing set shape:", X_test.shape, y_test.shape)

.. code-block:: python-console

    Training set shape: (5686, 112) (5686,)
    Testing set shape: (2438, 112) (2438,)


Understanding the Train-Test Split
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The code above divides our data into training and testing sets, creating four objects:
``X_train``, ``X_test``, ``y_train``, and ``y_test``.

.. list-table:: Key Train-Test Split Parameters
   :widths: 20 50 30
   :header-rows: 1

   * - Parameter
     - Purpose
     - In Our Example
   * - ``test_size``
     - Determines what portion of data is reserved for testing
     - 30% for testing, 70% for training
   * - ``stratify``
     - Maintains the same class distribution in both splits
     - Ensures balanced representation of poisonous/edible classes
   * - ``random_state``
     - Controls the shuffling of data before splitting
     - Ensures we get the same samples in train/test splits each time we run the code


Why These Parameters Matter:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* **Test Size**: Finding the right balance between having enough data for training while reserving sufficient data for testing is crucial. Too little test data may not reliably assess model performance; too little training data may limit learning.

* **Stratification**: When working with classification problems, maintaining class proportions is essential. Without stratification, you might accidentally create a test set with disproportionate class representation, leading to misleading evaluation metrics.

* **Random State**: Without setting ``random_state``, you'd get a different train/test split each time you run the code. When you set a fixed value here, you'll get the same splits, allowing you to make fair comparisons when you make changes to your model. 

.. tip::
  
    While our dataset has roughly balanced classes, stratification becomes especially important with
    imbalanced datasets. Always consider using ``stratify`` as a best practice.


Step 3: Building a Sequential Model Neural Network 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Now we'll create a simple neural network for our mushroom classification task. The model will consist of:

- An **input layer** that matches our feature dimensions
- A **hidden layer** with 10 perceptrons and ReLU activation
- An **output layer** with sigmoid activation for binary classification

This architecture provides a good starting point for understanding how neural networks learn from tabular data.

.. code-block:: python

    >>> # Import necessary libraries from TensorFlow
    >>> import tensorflow as tf
    >>> from tensorflow.keras import Sequential
    >>> from tensorflow.keras.layers import Input, Dense

    >>> # Set random seed for reproducibility
    >>> tf.random.set_seed(123)

    >>> # Create model with sequential API
    >>> model = Sequential([
    >>>     # Input layer - shape matches our feature count
    >>>     Input(shape=(112,)),  # Each sample is a 1D tensor with 112 features
    >>>     
    >>>     # Hidden layer - 10 perceptrons with ReLU activation
    >>>     # ReLU allows the network to learn non-linear patterns
    >>>     Dense(10, activation='relu'),
    >>>     
    >>>     # Output layer - single perceptron with sigmoid activation
    >>>     # Sigmoid squashes output between 0-1, perfect for binary classification
    >>>     Dense(1, activation='sigmoid')
    >>> ])

    >>> # Compile the model with appropriate settings for binary classification
    >>> model.compile(
    >>>     optimizer='adam',              # Adam: efficient gradient-based optimizer
    >>>     loss='binary_crossentropy',    # Standard loss function for binary problems
    >>>     metrics=['accuracy']           # ßTrack accuracy during training
    >>> )

    >>> # Display model architecture and parameter count
    >>> model.summary()

For fully connected layers, the number of trainable parameters can be calculated with the following formula:

.. math:: 

  \text{Parameters} = (\text{Input units} \times \text{Output units}) + \text{Output units}

Let's understand what each part means:

 1. **Weights**: ``Input units x Output units``
    Each Input unit connects to each Output unit, so there's one weight per connection. This forms a weight matrix of shape ``(Input units, Output units)``. 

 2. **Biases**: ``+ Output units``
    Each output perceptron has one bias term, regardless of the number of Input units. So the total number of bias terms is equal to the number of Output units. 


**Thought Challenge**: 

How many parameters does the model have? Can you calculate this manually and get the same result?

.. toggle:: Click to see the answer

    Let's calculate the parameters manually:

    **Layer 1** (Input -> Hidden):

    - Input Units: ``X_train.shape[1]`` (112 features after one-hot encoding)
    - Output Units: 10 perceptrons
  
    - Weights: 112 x 10 = 1120 parameters
    - Biases: 10 (one per Output unit (perceptrons in the next layer))
    - Total for Layer 1: 1120 + 10 = 1130 parameters

    **Layer 2** (Hidden -> Output):

    - Input Units: 10 perceptrons
    - Output Units: 1 perceptron
  
    - Weights: 10 × 1 = 10 parameters
    - Biases: 1 (one per Output unit)
    - Total for Layer 2: 10 + 1 = 11 parameters

    **Total parameters**: 1130 + 11 = 1141 parameters

    This should match the parameter count shown in the model.summary() output.


Training the Neural Network
~~~~~~~~~~~~~~~~~~~~~~~~~~~

With our model built and compiled, we can now train it on our data. Before executing the training code, let's understand the key parameters we'll use:

.. list-table:: Key Training Parameters
   :widths: 20 80
   :header-rows: 1

   * - Parameter
     - Description
   * - **validation_split=0.2**
     - Reserves 20% of training data to evaluate performance during training, without affecting model weights
   * - **epochs=5**
     - Number of complete passes through the dataset; more epochs allow for more learning iterations but risk overfitting
   * - **batch_size=32**
     - Number of samples processed before weight update; affects memory usage, training speed, and convergence behavior
   * - **verbose=2**
     - Controls output level (0=silent, 1=progress bar, 2=one line per epoch)

**Thought Challenge**: How does the choice of ``batch_size`` affect the training process?

.. toggle:: Click to see the answer

    The ``batch_size`` parameter determines how many samples the model processes before updating its weights.
    
    **Effects of batch size:**
    
    - **Small batch sizes** (e.g., 8-32):
      - Use less memory
      - Update weights more frequently
      - Can help the model escape local minima
      - May make training slower overall
    
    - **Large batch sizes** (e.g., 128-512):
      - More efficient use of GPU/CPU
      - More stable training (less "noisy" updates)
      - Require more memory
      - May get stuck in poor solutions
    
    The batch size of 32 in our example is relatively small, which is good for learning complex patterns in modest-sized datasets.

Now let's train our model with these parameters:

.. code-block:: python

    >>> # Train the model with the specified parameters
    >>> model.fit(X_train, y_train, validation_split=0.2, epochs=5, batch_size=32, verbose=2)

Below shows the output of the training process:

.. code-block:: text

  Epoch 1/5
  143/143 - 1s - loss: 0.3543 - accuracy: 0.8709 - val_loss: 0.1458 - val_accuracy: 0.9569 - 1s/epoch - 9ms/step
  Epoch 2/5
  143/143 - 0s - loss: 0.0966 - accuracy: 0.9756 - val_loss: 0.0647 - val_accuracy: 0.9851 - 398ms/epoch - 3ms/step
  Epoch 3/5
  143/143 - 0s - loss: 0.0486 - accuracy: 0.9888 - val_loss: 0.0372 - val_accuracy: 0.9938 - 398ms/epoch - 3ms/step
  Epoch 4/5
  143/143 - 0s - loss: 0.0291 - accuracy: 0.9954 - val_loss: 0.0235 - val_accuracy: 0.9982 - 394ms/epoch - 3ms/step
  Epoch 5/5
  143/143 - 0s - loss: 0.0192 - accuracy: 0.9976 - val_loss: 0.0161 - val_accuracy: 0.9991 - 386ms/epoch - 3ms/step

Let's understand what this output tells us:

1. **Progress metrics**:

  - ``143/143``: Shows progress through the training batches; 143 batches were completed out of 143, and each batch contains 32 samples (as specified by ``batch_size=32``)
  - ``1s``: Indicates the time taken for each epoch; here, the first epoch took <1 second to complete.
  - ``9ms/step``: This indicates the average time taken per training step (one forward and backward pass through a single batch) during training.

2. **Training metrics**:

  - ``accuracy: 0.8709``: Represents the accuracy of the model on the training dataset. The accuracy value of approximately 0.8709 indicates that the model correctly predicted 87.09% of the training samples.
  - ``loss: 0.3543``: Represents the training loss value (using binary cross-entropy loss function) on the training dataset. Higher loss values indicate that the model's predictions are further from the true labels.

3. **Validation metrics**:

  - ``val_accuracy: 0.9569``: Represents the accuracy of the model on the validation dataset. The accuracy value of approximately 0.9569 indicates that the model correctly predicted 95.69% of the validation samples.
  - ``val_loss: 0.1458``: Represents the validation loss value (using binary cross-entropy loss function) on the validation dataset. Lower loss values indicate that the model's predictions are closer to the true labels.

Looking at our training results after 5 epochs, we can observe:

1. The model achieved excellent performance, with final training accuracy of 99.85% and validation accuracy of 99.82%.
2. Both training and validation loss steadily decreased across epochs, indicating consistent learning.
3. Validation metrics consistently tracked close to training metrics, suggesting the model generalizes well rather than memorizing the training data.

Let's visualize our training progress before moving on:

.. code-block:: python

    >>> import matplotlib.pyplot as plt
    
    >>> # Create a simple visualization of training history
    >>> plt.figure(figsize=(10, 4))
    
    >>> # Plot training & validation accuracy
    >>> plt.subplot(1, 2, 1)
    >>> plt.plot([0.8709, 0.9776, 0.9894, 0.9949, 0.9985], label='Training Accuracy')
    >>> plt.plot([0.9569, 0.9851, 0.9938, 0.9982, 0.9982], label='Validation Accuracy')
    >>> plt.title('Model Accuracy')
    >>> plt.ylabel('Accuracy')
    >>> plt.xlabel('Epoch')
    >>> plt.legend()

    >>> # Plot training & validation loss
    >>> plt.subplot(1, 2, 2)
    >>> plt.plot([0.3543, 0.0964, 0.0481, 0.0288, 0.0186], label='Training Loss')
    >>> plt.plot([0.1458, 0.0638, 0.0364, 0.0230, 0.0157], label='Validation Loss')
    >>> plt.title('Model Loss')
    >>> plt.ylabel('Loss')
    >>> plt.xlabel('Epoch')
    >>> plt.legend()
    
    >>> plt.tight_layout()
    >>> plt.show()

.. figure:: ./images/mushroom-training-progress.png
    :width: 600px
    :align: center
    :alt: Training and validation metrics over epochs

|

This high performance is promising, but we should verify it on our completely separate test set, which the model has never seen during training. This will give us the most reliable measure of how well our model might perform in real-world scenarios.


Step 4: Evaluate the Model's Performance on Test Data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The true test of our model's capabilities comes from evaluating it on our completely separate test dataset. Let's see how our neural network performs when classifying mushrooms it has never encountered before!

.. code-block:: python

    >>> # Make predictions on the test data
    >>> y_pred=model.predict(X_test)

For a binary classification problem like our (poisonous vs edible), the model outputs probabilities between 0 and 1 for each sample. Let's show the first sample's prediction:

.. code-block:: python

    >>> y_pred[0]

.. code-block:: text
    
    array([0.00323989], dtype=float32)

This shows the probability for the first mushroom sample in the test set.
The output is a single value between 0 and 1, where:

 - Values closer to 1 indicate the model is more confident that the sample is poisonous.
 - Values closer to 0 indicate the model is more confident that the sample is edible.

For example, our output value is 0.00323989, which means that the model is ~99.68% confident that the sample is edible.

The model outputs probability values, but for practical mushroom classification, we need definitive "edible" or "poisonous" predictions. We need to convert these continuous probability values into discrete class labels:

.. code-block:: python

    >>> import numpy as np
    
    >>> # Convert probabilities to binary predictions using a threshold of 0.5
    >>> y_pred_final = (y_pred > 0.5).astype(int)
    
This code performs what's called "thresholding":

1. First, we compare each probability to the threshold value (0.5)
   
   - If probability > 0.5, the result is True (model thinks it's more likely poisonous)
   - If probability ≤ 0.5, the result is False (model thinks it's more likely edible)

2. Then, we convert these True/False values to integers (1/0) with ``.astype(int)``
   
   - True becomes 1 (poisonous)
   - False becomes 0 (edible)

The 0.5 threshold represents the decision boundary - the point where the model is equally confident in either class. We could adjust this threshold if we wanted to be more conservative about certain types of errors (e.g., lowering the threshold would classify more mushrooms as poisonous, reducing the chance of missing toxic ones).

Now, let's visualize the model's prediction accuracy with a **confusion matrix**. 
This will allow us to see how many correct vs incorrect predictions were made using the model above.

.. code-block:: python

    >>> from sklearn.metrics import confusion_matrix
    >>> import seaborn as sns

    >>> # Create confusion matrix
    >>> cm=confusion_matrix(y_test,y_pred_final)

    >>> # Create visualization
    >>> plt.figure(figsize=(10,7))          # Set figure size to 10x7 inches
    >>> sns.heatmap(cm,annot=True,fmt='d')  # Create heatmap with annotations and display counts as integers
    >>> plt.xlabel('Predicted')             # Label x-axis as 'Predicted'
    >>> plt.ylabel('Truth')                 # Label y-axis as 'Truth'
    >>> plt.show()                          # Display the plot

Output of the above confusion matrix is as follows:

.. figure:: ./images/nn-confusion-matrix.png
    :width: 600px
    :align: center
    :alt: 

The confusion matrix visualization shows how well our model classifies mushrooms as edible or poisonous. The matrix is a 2x2 grid where:

* The y-axis (Truth) shows the actual class of the mushrooms
* The x-axis (Predicted) shows what our model predicted
* Each cell contains the count of predictions falling into that category
* The heatmap coloring provides visual intensity, where lighter colors indicate higher counts

Reading the matrix:

* **Top-left**: True Negatives (TN) - Correctly identified edible mushrooms
* **Top-right**: False Positives (FP) - Edible mushrooms incorrectly classified as poisonous
* **Bottom-left**: False Negatives (FN) - Poisonous mushrooms incorrectly classified as edible
* **Bottom-right**: True Positives (TP) - Correctly identified poisonous mushrooms 


Key Classification Metrics
~~~~~~~~~~~~~~~~~~~~~~~~~~

From these confusion matrix values, we can calculate several important evaluation metrics:

.. list-table:: Classification Metrics for Mushroom Model
   :widths: 20 30 40
   :header-rows: 1

   * - Metric
     - Definition
     - Interpretation for Mushrooms
   * - **Accuracy**
     - :math:`\frac{TP + TN}{TP + TN + FP + FN}`
     - Percentage of all mushrooms correctly classified
   * - **Precision**
     - :math:`\frac{TP}{TP + FP}`
     - When model predicts "poisonous," how often is it right?
   * - **Recall**
     - :math:`\frac{TP}{TP + FN}`
     - Of all poisonous mushrooms, how many did we correctly identify? 
   * - **F1-Score**
     - :math:`2 \times \frac{Precision \times Recall}{Precision + Recall}`
     - Harmonic mean of precision and recall; useful when you need to balance both
   * - **Specificity**
     - :math:`\frac{TN}{TN + FP}`
     - Of all edible mushrooms, how many did we correctly identify?

**Thought Challenge**: Which prediction metric is most important for this model? Why? 

.. toggle:: Click to see the answer

    For mushroom classification, false negatives (bottom-left) are particularly concerning as they represent poisonous mushrooms that were incorrectly classified as edible.

    **Recall** measures a model's ability to correctly identify all true positives within a dataset, minimizing false negatives. 
    Therefore, **recall** is the most important metric for this model.

Let's also print the full classification report of this model using code below

.. code-block:: python

    >>> from sklearn.metrics import classification_report
    >>> print(classification_report(y_test,y_pred_final, digits=4))

.. code-block:: python-console

               precision    recall  f1-score   support

            0     0.9976    0.9992    0.9984      1263
            1     0.9991    0.9974    0.9983      1175

     accuracy                         0.9984      2438
    macro avg     0.9984    0.9983    0.9984      2438
 weighted avg     0.9984    0.9984    0.9984      2438


The accuracy of our model is 99.84%, so 99.84% of the time, this model predicted the correct label on the test data.

**Thought Challenge**: Did we build a successful model? Why or why not? Is there anything we can do to improve the model?

.. toggle:: Click to see the answer

    **Did we build a successful model?**
    
    By standard performance metrics, our model is remarkably successful:
    
    * Accuracy of 99.84% on the test set
    * Recall of 99.74% for poisonous predictions
    * Precision of 99.91% for poisonous predictions
    * F1-score of 99.83% for poisonous predictions
    
    **Why it's successful:**
    
    * The model efficiently learned the patterns distinguishing edible from poisonous mushrooms
    * The architecture, despite being simple (just one hidden layer), was sufficient for this task
    * The dataset is well-structured with clear categorical features that strongly correlate with mushroom edibility
    
    **However, there are important considerations:**
    
    In a real-world mushroom classification system, even our 99.74% recall means that approximately 3 out of 1000 poisonous mushrooms were misclassified as edible. For a life-critical application like mushroom toxicity detection, this error rate is still too high. 
    
    **Potential improvements:**
    
    1. **Domain-specific threshold adjustment**: Lower the classification threshold from 0.5 to a more conservative value (e.g., 0.3) to reduce the likelihood of false negatives (missing poisonous mushrooms)
    
    2. **More sophisticated architecture**: Try deeper networks or different architectures that might capture more subtle patterns
    
    3. **Ensemble methods**: Combine multiple models to reduce the chance of missing poisonous mushrooms
    
    4. **Cost-sensitive learning**: Explicitly penalize false negatives (missing poisonous mushrooms) more heavily during training
    
    5. **Uncertainty estimation**: Add methods to quantify prediction uncertainty, so users know when to seek additional verification
    
    **Real-world deployment considerations:**
    
    Even with an improved model, it would be ethically questionable to deploy such a system as the sole decision-maker for mushroom consumption. It should be presented as a tool to assist experts rather than replace human judgment, especially for life-critical decisions.


Additional Resources
--------------------

* Adapted from: 
  `COE 379L: Software Design For Responsible Intelligent Systems <https://coe-379l-sp24.readthedocs.io/en/latest/index.html>`_
* `Mushroom dataset from the Audobon Society Field Guide <https://archive.ics.uci.edu/dataset/73/mushroom>`_
* `Tensorflow <https://www.tensorflow.org/>`_
* `Keras Guide <https://www.tensorflow.org/guide/keras>`_
* `Keras Sequential Model <https://www.tensorflow.org/guide/keras/sequential_model>`_