How to solve Binary Nomenclature Problems in Deep Learning with Tensorflow & Keras?
In this tutorial, nosotros will focus on how to select
Accurateness Metrics, Activation & Loss functions
in
Binary
Classification Bug.
Showtime, we will
review
the
types
of
Classification Problems
,
Activation & Loss functions
,
characterization encodings
, and
accurateness metrics.
Furthermore, we will too talk over how the
target encoding
tin can impact the selection of Activation & Loss functions.
Moreover, nosotros volition talk about how to select the
accurateness metric
correctly.
Then, for each type of classification trouble, we will apply several Activation & Loss functions and observe their effects on
performance.
We will experiment with all the concepts past designing and evaluating a deep learning model by using
Transfer Learning
on
horses and humans
dataset.
In the end, we will summarize the experiment results.
I split the tutorial into
three parts. In this offset part, we will focus on
Binary Classification. Adjacent office, we will focus on multicharacterization classification and multilabel nomenclature.
If y’all would similar to learn more about Deep Learning with practical coding examples, please
subscribe
to my YouTube Channel or
follow
my blog on Medium. Do non forget to turn on
notifications
so that yous will be notified when
new parts are uploaded
.
You can access this
Colab Notebook
using the link given in the video description below.
Furthermore, you can spotter this notebook on Youtube as well!
If y’all are ready, allow’s go started!
You lot tin sentinel this notebook on Murat Karakaya Akademi Youtube channel.
Types of Nomenclature Tasks
In general, there are iii main types/categories for Nomenclature Tasks in car learning:
A. binary classification
two target classes
B. multiclass classification
more than than two sectional targets, only one form tin can be assigned to an input
C. multicharacterization classification
more two nonexclusive targets, one input can be labeled with multiple target classes.
Nosotros volition see the details of each classification job forth with an example dataset and Keras model beneath.
Types of Label Encoding
In full general, we can utilize different encodings for
truthful (actual) labels (y values)
:

a floating number
(east.yard. in binary classification: 1 or 0) 
anehot encoding
(due east.g. in multiclass classification: [0 0 1 0 0]) 
a vector (array) of integers
(eastward.thou. in multilabel classification: [xiv 225 3])
We will cover the all possible encodings in the postobit examples.
Types of Activation Functions for Nomenclature Tasks
In Keras, there are several Activation Functions. Below I summarize ii of them:

Sigmoid or Logistic Activation Function:
Sigmoid part maps any input to an output ranging from
0 to 1. For small values (<five), sigmoid returns a value close to null, and for large values (>five) the consequence of the function gets close to one. Sigmoid is equivalent to a twoelement Softmax, where
the second element is causeless to be zero.
Therefore, sigmoid is more often than not used for binary classification.
Example: Assume the last layer of the model is every bit:
outputs = keras.layers.Dense(1, activation=tf.keras.activations.sigmoid)(ten)
# Permit the final layer output vector be:
y_pred_logit = tf.abiding([20, 1.0, 0.0, 1.0, 20], dtype = tf.float32)
print("y_pred_logit:", y_pred_logit.numpy())
# and terminal layer activation role is sigmoid:
y_pred_prob = tf.keras.activations.sigmoid(y_pred_logit)
impress("y_pred:", y_pred_prob.numpy())
print("sum of all the elements in y_pred: ",y_pred_prob.numpy().sum()) y_pred_logit: [20. 1. 0. 1. 20.]
y_pred: [2.0611537e09 2.6894143e01 five.0000000e01 vii.3105860e01 one.0000000e+00]
sum of all the elements in y_pred: 2.5

Softmax part:
Softmax converts a existent vector to
a vector of categorical probabilities. The elements of the output vector are in range (0, i) and
sum to 1. Each vector is handled independently.
Softmax is ofttimes used
as the activation for the final layer of a
classification
network considering the result could exist interpreted as a probability distribution.
Therefore, Softmax is mostly used for multiclass or multilabel classification.
For case: Assume the terminal layer of the model is every bit:
outputs = keras.layers.Dense(1, activation=tf.keras.activations.softmax)(x)
# Assume last layer output is as:
y_pred_logit = tf.constant([[twenty, i.0, 4.5], [0.0, 1.0, 20]], dtype = tf.float32)
print("y_pred_logit:\north", y_pred_logit.numpy())
# and last layer activation part is softmax:
y_pred_prob = tf.keras.activations.softmax(y_pred_logit)
print("y_pred:", y_pred_prob.numpy())
print("sum of all the elements in each vector in y_pred: ",
y_pred_prob.numpy()[0].sum()," ",
y_pred_prob.numpy()[1].sum()) y_pred_logit:
[[twenty. 1. 4.5]
[ 0. 1. xx. ]]
y_pred: [[2.2804154e11 four.0701381e03 9.9592990e01]
[two.0611537e09 5.6027964e09 1.0000000e+00]]
sum of all the elements in each vector in y_pred: one.0 1.0
These two activation functions are the most used ones for nomenclature tasks
in the last layer
.
PLEASE Annotation THAT
If we
don’t specify whatever activation
office at the final layer, no activation is applied to the outputs of the layer (ie.
“linear” activation: a(10) = 10).
Types of Loss Functions for Classification Tasks
In Keras, there are several Loss Functions. Beneath, I summarized the ones used in
Classification
tasks:

BinaryCrossentropy:
Computes the crossentropy loss between true labels and predicted labels. We use this cantankerousentropy loss
when in that location are but two characterization classes (assumed to exist 0 and 1). For each example, there should be a
single floatingsignal value per prediction. 
CategoricalCrossentropy:
Computes the crossentropy loss between the labels and predictions. Nosotros apply this crossentropy loss function
when there are two or more label classes.
We expect
labels to be provided in a onehot representation. If y’all want to provide labels as integers, please use SparseCategoricalCrossentropy loss. There should be # classes floating point values per feature. 
SparseCategoricalCrossentropy:
Computes the crossentropy loss between the labels and predictions. We use this crossentropy loss function
when at that place are 2 or more label classes. Nosotros expect
labels to be provided as integers. If you want to provide labels using onehot representation, please use CategoricalCrossentropy loss. At that place should exist # classes floating point values per feature for y_pred and a unmarried floatingpoint value per feature for y_true.
Of import:
 In Keras,
these three CrossEntropy
functions expect two inputs:
right / true /actual labels
(y) and
predicted labels
(y_pred):
 Equally mentioned above,
right (actual) labels
can be encoded
floating numbers
,
ihot,
or an
array of integer
values.  However, the
predicted labels
should exist presented as a
probability distribution
.  If the predicted labels are
non converted to a probability
distribution
past the last layer
of the model (using
sigmoid
or
softmax
activation functions), we
need to inform
these three CrossEntropy functions past setting their
from_logits = Truthful.
2. If the parameter
from_logits is set True
in any crossentropy function, then the function expects
ordinary
numbers as
predicted label values
and apply
sigmoid transformation
on these predicted label values to convert them into a
probability distribution. For details, you can check the
tf.keras.backend.binary_crossentropy
source code. The below code is taken from TF source code:
if from_logits: render nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output)
three. Both,
categorical crossentropy
and
thin chiselled cantankerousentropy
have
the same loss function
which we have mentioned in a higher place. The
but deviation
is the
format of the truthful labels:
 If
correct (actual) labels
are
onehot
encoded, use
categorical_crossentropy. Examples (for a threeclass classification): [i,0,0] , [0,ane,0], [0,0,1]  Merely if
correct (actual) labels
are
integers, employ
sparse_categorical_crossentropy. Examples for above 3course classification problem: [ane] , [two], [three]  The usage entirely depends on how
nosotros load our dataset. 
One reward of using sparse chiselled crossentropy
is information technology saves storage in memory every bit well as time in computation because information technology simply uses a single integer for a form, rather than
a whole onehot vector.
I volition explain the above concepts by designing models in
three
parts
Types of Accuracy Metrics
Keras has several accuracy metrics. In classification, we can employ two of them:

Accuracy: Calculates how often predictions
equal
labels.
y_true = [[1], [1], [0], [0]]
y_pred = [[0.99], [1.0], [0.01], [0.0]]
print("Which predictions equal to labels:", np.equal(y_true, y_pred).reshape(1,))
thou = tf.keras.metrics.Accuracy()
m.update_state(y_true, y_pred)
impress("Accuracy: ",m.result().numpy()) Which predictions equal to labels: [Faux True Faux True]
Accurateness: 0.five

Binary Accuracy:
Calculates how often predictions
friction match
binary labels.
y_true = [[one], [1], [0], [0]]
y_pred = [[0.49], [0.51], [0.5], [0.51]]
m = tf.keras.metrics.binary_accuracy(y_true, y_pred, threshold=0.five)
print("Which predictions match with binary labels:", m.numpy())m = tf.keras.metrics.BinaryAccuracy()
Which predictions lucifer with binary labels: [0. 1. i. 0.]
one thousand.update_state(y_true, y_pred)
impress("Binary Accuracy: ", one thousand.result().numpy())
Binary Accuracy: 0.5

Categorical Accuracy:
Calculates how often predictions
friction match
onehot
labels.
# assume 3 classes exist
y_true = [[ 0, 0, 1], [ 0, 1, 0]]
y_pred = [[0.1, 0.9, 0.8], [0.05, 0.95, 0.3]]m = tf.keras.metrics.categorical_accuracy(y_true, y_pred)
Which predictions lucifer with onehot labels: [0. ane.]
impress("Which predictions match with 1hot labels:", one thousand.numpy())
g = tf.keras.metrics.CategoricalAccuracy()
m.update_state(y_true, y_pred)
print("Categorical Accuracy:", m.upshot().numpy())
Categorical Accuracy: 0.v
Part A: Binary Nomenclature (two target classes)
For a binary classification task, I will employ “horses_or_humans” dataset which is available in
TF Datasets.
A. 1. Truthful (Bodily) Labels are encoded with a single floating number (1./0.)
Offset, let’s load the data from Tensorflow Datasets
ds_raw_train, ds_raw_test = tfds.load('horses_or_humans',
dissever=['train','exam'], as_supervised=True) impress("Number of samples in railroad train : ", ds_raw_train.cardinality().numpy(),
" in test : ",ds_raw_test.cardinality().numpy()) Number of samples in train : 1027 in test : 256 def show_samples(dataset):
fig=plt.figure(figsize=(14, 14))
columns = 3
rows = 3print(columns*rows,"samples from the dataset")
i=1
for a,b in dataset.take(columns*rows):
fig.add_subplot(rows, columns, i)
plt.imshow(a)
#plt.imshow(a.numpy())
plt.title("prototype shape:"+ str(a.shape)+" Label:"+str(b.numpy()) )i=i+one
ix samples from the dataset
plt.show()
show_samples(ds_raw_test)
Notice that:
 There are
only 2 label classes:
horses and humans
.  For each sample, there is a
single floatingindicate value per label:
(
0 → horse, 1 → human
)
Let’southward resize and calibration the images and so that we can save time in training
#VGG16 expects min 32 x 32
def resize_scale_image(image, label):
paradigm = tf.prototype.resize(prototype, [32, 32])
image = paradigm/255.0
return epitome, label ds_train_resize_scale=ds_raw_train.map(resize_scale_image)
ds_test_resize_scale=ds_raw_test.map(resize_scale_image)
show_samples(ds_test_resize_scale) 9 samples from the dataset
Prepare the data pipeline by setting batch size & buffer size using tf.data
batch_size = 64#buffer_size = ds_train_resize_scale.cardinality().numpy()/x
#ds_resize_scale_batched=ds_raw.echo(iii).shuffle(buffer_size=buffer_size).batch(64, )ds_train_resize_scale_batched=ds_train_resize_scale.batch(64, drop_remainder=Truthful )
ds_test_resize_scale_batched=ds_test_resize_scale.batch(64, drop_remainder=Truthful )print("Number of batches in railroad train: ", ds_train_resize_scale_batched.cardinality().numpy())
Number of batches in train: xvi
print("Number of batches in exam: ", ds_test_resize_scale_batched.cardinality().numpy())
Number of batches in test: 4
To train fast, let’south use Transfer Learning past importing VGG16
base_model = keras.applications.VGG16(
weights='imagenet', # Load weights pretrained on ImageNet.
input_shape=(32, 32, iii), # VGG16 expects min 32 x 32
include_top=False) # Exercise non include the ImageNet classifier at the meridian.
base_model.trainable = False
Create the nomenclature model
inputs = keras.Input(shape=(32, 32, three))
x = base_model(inputs, training=False)
ten = keras.layers.GlobalAveragePooling2D()(x)
initializer = tf.keras.initializers.GlorotUniform(seed=42)activation = None # tf.keras.activations.sigmoid or softmax
outputs = keras.layers.Dumbo(1,
kernel_initializer=initializer,
activation=activation)(x)
model = keras.Model(inputs, outputs)
Pay attending:
 The last layer has only 1 unit of measurement. So the output (
y_pred
) will exist
a single floating point
equally the true (bodily) label (
y_true
).  For the last layer, the activation function can be:
 None
 sigmoid
 softmax
 When in that location is
no activation
function is used in the model’south last layer, we need to fix
from_logits=True
in crossentropy loss functions
every bit we discussed above. Thus,
crossentropy loss functions
volition use a
sigmoid
transformation on
predicted characterization values: 
if from_logits: return nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output)
Compile the model
model.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.BinaryCrossentropy(from_logits=True), # default from_logits=Imitation
metrics=[keras.metrics.BinaryAccuracy()])
Important:
We need to employ
keras.metrics.BinaryAccuracy()
for
measuring
the
accurateness
since it calculates how ofttimes predictions friction match
binary labels.
 As we mentioned above, Keras does
non
define a
single
accurateness metric, only
several
unlike ones, among them:
accuracy
,
binary_accuracy
and
categorical_accuracy
.  What happens under the hood is that, if y’all select
mistakenly
chiselled crossentropy as your loss function
in
binary classification
and if y’all practise
non specify
a particular accuracy metric by but writing
metrics="Accuracy"
Keras (
wrongly
…)
infers
that you are interested in the
categorical_accuracy, and this is what it returns — while in fact, yous are interested in the
binary_accuracy
since our problem is a binary nomenclature.
In summary;
 to get
model.fit()
and
model.evaulate()
run correctly (without mixing the loss role and the classification problem at hand) we need to
specify the bodily accuracy metric!  if the true (bodily) labels are encoded binary (0./ane.), you lot need to apply
keras.metrics.BinaryAccuracy()
for
measuring
the
accuracy
since it calculates how often predictions match
binary labels.
Try & Meet
At present, we can try and see the performance of the model past using a
combination of activation and loss functions.
Each epoch takes almost 15 seconds on Colab TPU accelerator.
model.fit(ds_train_resize_scale_batched, validation_data=ds_test_resize_scale_batched, epochs=xx) Epoch 1/xx
xvi/16 [==============================]  17s 1s/step  loss: 0.7149  binary_accuracy: 0.4824  val_loss: 0.6762  val_binary_accuracy: 0.5039
... ...
Epoch 19/20
16/16 [==============================]  17s 1s/step  loss: 0.3041  binary_accuracy: 0.8730  val_loss: 0.5146  val_binary_accuracy: 0.8125
Epoch twenty/xx
16/16 [==============================]  17s 1s/step  loss: 0.2984  binary_accuracy: 0.8809  val_loss: 0.5191  val_binary_accuracy: 0.8125model.evaluate(ds_test_resize_scale_batched)
4/4 [==============================]  2s 556ms/step  loss: 0.5191  binary_accuracy: 0.7266[0.519140362739563, 0.7265625]
Obtained Results*:
When you run this notebook, about probably you would non get the exact numbers rather you would notice very similar values due to the stochastic nature of ANNs.
Note that:
 Generally, we use
softmax activation
instead of
sigmoid
with the
crossentropy loss
because softmax activation distributes the probability throughout each output node.  But, for
binary classification, we apply
sigmoid
rather than softmax.  The practical reason is that

softmax
is specially designed for
multicourse
and
multilabel
classification tasks. 
Sigmoid
is equivalent to a 2element
Softmax, where the 2d element is assumed to be zero. Therefore,
sigmoid
is by and large used for
binary classification.  The above results support this recommendation
Why do BinaryCrossentropy loss functions with from_logits=Truthful atomic number 82 to good accuracy without whatever activation function?
Because using from_logits=True tells the BinaryCrossentropy loss functions to apply its own
sigmoid
transformation over the inputs:
if from_logits: return nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output)
In Keras documentation: “Using from_logits=True may be more numerically stable.”
In summary:
Nosotros tin can
conclude
that, if the task is
binary classification
and truthful (actual) labels are encoded as a
single floating number
(0./one.) we have ii options to go:
 Option ane: activation =
sigmoid
loss =BinaryCrossentropy()
accurateness metric=
BinaryAccuracy()  Option 1: activation =
None
loss =BinaryCrossentropy(from_logits=True)
accuracy metric=
BinaryAccuracy()
A. 2. True (Actual) Labels are onehot encoded [1 0] or [0 1]
Unremarkably, in binary classification problems, we
exercise not
utilise onehot encoding for
y_true
values. However, I would like to investigate the effects of doing so. In your existentlife applications, it is upward to you how to encode your y_true. You tin can retrieve of this department
as an experiment.
Offset, catechumen the true (bodily) characterization encoding to onehot
def one_hot(image, label):
label = tf.one_hot(label, depth=ii)
return image, label ds_train_resize_scale_one_hot= ds_train_resize_scale.map(one_hot)
ds_test_resize_scale_one_hot= ds_test_resize_scale.map(one_hot)
show_samples(ds_test_resize_scale_one_hot) nine samples from the dataset
Notice that:
 In that location are
just two characterization classes:
horses and humans
.  Labels are now
onehot encoded
[1. 0.] → horse,
[0. 1.] → human
Prepare the data pipeline by setting the batch size
ds_train_resize_scale_one_hot_batched=ds_train_resize_scale_one_hot.batch(64)
ds_test_resize_scale_one_hot_batched=ds_test_resize_scale_one_hot.batch(64)
Create the classification model
inputs = keras.Input(shape=(32, 32, 3))
10 = base_model(inputs, preparation=False)
x = keras.layers.GlobalAveragePooling2D()(x)initializer = tf.keras.initializers.GlorotUniform(seed=42)
activation = None # tf.keras.activations.sigmoid or softmax
outputs = keras.layers.Dense(2,
kernel_initializer=initializer,
activation=activation)(ten)
model = keras.Model(inputs, outputs)
Pay attending:
 The last layer has
now ii units
instead of 1. Thus the output volition back up
onehot
encoding of the true (actual) label. Remember that the ihot vector has
two floatingbetoken numbers
in
binary
classification: [1. 0.] or [0. 1.]  For the last layer, the activation role can be:
 None
 sigmoid
 softmax
 When there is
no activation
role is used, nosotros need to set
from_logits=Truthful
in crossentropy functions
as we discussed in a higher place
Compile the model
model.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.CategoricalCrossentropy(from_logits=True), # default from_logits=Faux
metrics=[keras.metrics.CategoricalAccuracy()])
Of import:
We need to use
keras.metrics.CategoricalAccuracy()
for
measuring
the
accuracy
since it calculates how often predictions match
onehot labels.
Practise Not Apply
just
metrics=['accuracy']
as a performance metric! Because, as explained above hither in details:
 Keras does not ascertain a unmarried accurateness metric, but several different ones, among them:
accuracy
,
binary_accuracy
and
categorical_accuracy
.  What happens under the hood is that, if you
mistakenly
select
binary cantankerousentropy as your loss function
when
y_true
is encoded
onehot
and do
not specify
a particular accuracy metric, instead, if you provide only: 
metrics="Accuracy"
 Keras (
wrongly
…)
infers
that you are interested in the
binary_accuracy, and this is what it returns — while in fact, you are interested in the
categorical_accuracy
(because of 1hot encoding!).
In summary,
 to get
model.fit()
and
model.evaulate()
run correctly (without mixing the loss function and the classification trouble at hand) nosotros demand to
specify the actual accuracy metric!  if the true (bodily) labels are encoded onhot, you lot demand to utilize
keras.metrics.CategoricalAccuracy()
for
measuring
the
accuracy
since it calculates how often predictions match
onehot labels.
Try & Run across
You can try and see the operation of the model by using a
combination of activation and loss functions.
Each epoch takes almost 15 seconds on Colab TPU accelerator.
model.fit(ds_train_resize_scale_one_hot_batched, validation_data=ds_test_resize_scale_one_hot_batched, epochs=20) Epoch ane/20
17/17 [==============================]  17s 1s/step  loss: 0.8083  categorical_accuracy: 0.4956  val_loss: 0.7656  val_categorical_accuracy: 0.4648
... ...
Epoch 19/twenty
17/17 [==============================]  17s 997ms/step  loss: 0.2528  categorical_accuracy: 0.9182  val_loss: 0.5972  val_categorical_accuracy: 0.7031
Epoch 20/20
17/17 [==============================]  17s 1s/stride  loss: 0.2476  categorical_accuracy: 0.9211  val_loss: 0.6044  val_categorical_accuracy: 0.6992model.evaluate(ds_test_resize_scale_one_hot_batched)
4/four [==============================]  2s 557ms/step  loss: 0.6044  categorical_accuracy: 0.6992
Obtained Results*:
 When yous run this notebook, about probably you lot would non get the verbal numbers rather you would discover very similar values due to the stochastic nature of ANNs.
Why exercise Binary and Categorical crossentropy loss functions lead to like accuracy?
I would similar to remind you that when we tested two loss functions for the truthful labels are encoded equally 1hot, the calculated loss values are
very similar. Thus, the model converges by using the loss part results and since both functions generate similar loss functions, the resulting trained models would have similar accuracy every bit seen above.
Why exercise Sigmoid and Softmax activation functions atomic number 82 to similar accuracy?
 Since we employ 1hot encoding in true label encoding, sigmoid generates two floating numbers irresolute from 0 to 1 but the sum of these two numbers exercise not necessarily equal 1 (they are non probability distribution).
 On the other paw, softmax generates two floating numbers irresolute from 0 to 1 but the sum of these ii numbers exactly equal to 1.
 Normally, the Binary and Chiselled crossentropy loss functions expect a probability distribution over the input values (when from_logit = Simulated as default).
 Nevertheless, sigmoid activation function output is non a probability distribution over these two outputs.
 Even so, the Binary and Chiselled crossentropy loss functions can consume sigmoid outputs and generate similar loss values.
Why 0.6992?
I take run the models for 20 epochs starting with the aforementioned initial weights to isolate the initial weight effects on the performance. Here, 4 models reach verbal accuracy 0.6992 and the rest similarly achieve exact accuracy 0.7148. Ane reason might exist it is only hazard. Another reason could be if all the loss calculations end upward with the same values so that the gradients are exactly the same. Simply it is not likely. I checked several times just the process seems to be right. Delight attempt yourself at home :))
According to the above experiment results, if the job is
binary classification
and truthful (actual) labels are encoded equally a
onehot, we might have ii options:
 Option A
 activation =
None  loss =
BinaryCrossentropy(from_logits=True)  accurateness metric=
CategoricalAccuracy()  Option B
 activation =
sigmoid  loss =BinaryCrossentropy()
 accurateness metric=
CategoricalAccuracy()
Binary Nomenclature Summary
In a nut shel, in binary classification
 nosotros employ floating numbers 0. or 1.0 to encode the class labels,
 BinaryAccuracy is the correct accuracy metric
 (By and large recomended) Last layer activation part is Sigmoid and loss office is BinaryCrossentropy.
 Merely we observed that the last layer activation function None and loss function is BinaryCrossentropy(from_logits=True) could as well work.
So the summary of the experiments are below:
Next: Role B: MultiClass classification (more than 2 target classes)
You can follow me on these social networks:
YouTube
Github
Kaggle
Medium
Source: https://medium.com/deeplearningwithkeras/whichactivationlossfunctionspartae16f5ad6d82a