Errata

Contents

Page xiii: The heading for the section for page 329 should be “King – Man + Woman = Queen” without an exclamation mark after Woman. (Found by author)

Chapter 1

Chapter 2

Page 38: At the end of the last sentence of the second to last paragraph, x1 should be x0. (Found by Wan-Ting Chen)

Chapter 3

Page 65: On the second to last line of the last paragraph, -1 is incorrect. It should be 0. (Found by Wan-Ting Chen)
Page 78: The first calculation results in the value -0.24492, which should have been rounded to -0.24 instead of -0.25. As a side note, the subsequent calculations in the this section were done in a spreadsheet that used the full-precision numbers rather than the rounded numbers that are printed in the book. (Found by Neil Braggio)

Chapter 4

Page 94: In the last line of Code Snippet 4-1, “test_images” should be changed to “test_labels”. (Found by author)
Page 99: Figure 4-2 uses the word “epochs” without first having described what an epoch is. See page 111 for an explanation of what an epoch is. (Found by Ruthie Lyle)
Page 107: In the last paragraph before the programming example, “i=1” should be changed to “j=1”. (Found by author)

Chapter 5

Page 126: In the first paragraph, in the sentence “Similarly, even if the neuron is not fully saturated, the derivative is less than 0” the 0 should have been 1. Additionally, the subsequent sentence should be changed to “Doing a series of multiplications (one per layer) where each number is a positive number less than 1 results in the gradient approaching 0.” (Found by Wan-Ting Chen)
Page 127: In the first sentence in the paragraph after the yellow box references Code Snippet 5-4. It should have referenced Code Snippet 5-3. (Found by Wan-Ting Chen)
Page 132: The derivative of the error term (de/dy^) is incorrect. The plus sign between the two terms should be a negative sign. that is, replace (y/y^ + (1-y)/(1-y^)) by (y/y^ – (1-y)/(1-y^)). (Found by Hank Shen)

Chapter 7

Page 184: In the second sentence in the green box, “deep into the network” should be replaced by “layers deep into the network”. (Found by Nafez Qanadilo)
Page 189: In Table 7-3, in the two rightmost cells, the number 4,925,200 is incorrect. It should be 4,915,200 from Table 7-2. (Found by Wan-Ting Chen).
Page 190: In the lower right cell of the table, “4,090” should be changed to “4,097”. (Found by author)
Page 193: The last sentence of the second paragraph ends with “so we can interpret the one-hot encoded outputs as probabilities.” In reality, even though the ground truth in the training examples were one-hot encoded, the outputs themselves will not be one-hot encoded (multiple outputs are non-zero for Softmax). Therefore, the corrected ending of the sentence should be “so we can interpret the outputs as probabilities.”
Page 202: There is a missing y at the end of Krizhevsky. (Found by author).

Chapter 8

Page 208: Reading the third paragraph, it is not obvious what the benefit of 1×1 convolutions is in VGGNet. To address this, before the sentence that begins “We can use 1×1 convolutions”, add the sentence “That is, the 1×1 convolutions in VGGNet provide the ability to combine features from different channels into new features.” (Found by Wan-Ting Chen)
Page 217: In the second paragraph, x is a vector so should have been bold. The same applies to Figure 8-4. (Found by Wan-Ting Chen)
Page 223: On the first line of the first paragraph, “7-layer” should be replaced by “8-layer” because AlexNet consists of eight layers. (Found by Wan-Ting Chen).
Page 232: In the second to last paragraph, last sentence, “three weights per additional output” should be changed to “four weights including bias per additional output”, and “nine weights per additional output” should be changed to “ten weights including bias per additional output”. (Found by Wan-Ting Chen)
Page 232: In the last paragraph, “M*K2+1” should be “M*(K2+1)”, and “N * M + 1” should be changed to “N * ( M + 1).” Similarly, the resulting formula on the next page becomes “W * H * (M * (K2 +1)) + W * H * (N * (M + 1))”. (Found by Wan-Ting Chen)

Chapter 9

Page 241: In the last sentence of the first paragraph, “day” should be changed to “month”. (Found by Wan-Ting Chen)
Page 243: In each of the two equations, y is a vector and should have been using bold typeface. (Found by Wan-Ting Chen)
Page 244: In the equation, h is a vector and should have been using bold typeface. (Found by Wan-Ting Chen)
Page 257: In Code Snippet 9-4, the line that starts with “stddev =” should be “stddev = np.std(train_sales)”. This also has implications on the numbers quoted on page 259, 260, and 263. However, the conclusions do not change. (Found by Jason Lee)
Page 262: In the caption for Code Snippet 9-9, “7 days” should be changed to “12 months” (Found by Wan-Ting Chen)
Page 262: In Code Snippet 9-9, the following import is missing: “from tensorflow.keras.layers import Flatten”. Also note that you still need to first create a Sequential model before starting to add layers: “model = Sequential()”. (Found by Wan-Ting Chen)

Chapter 10

Page 275: In the first sentence after the green box it says that a simple RNN would result in an output close to 1. This is not correct. The output of the activation function will often tend to get close to 1 or 0 for a logistic sigmoid activation and close to 1 or -1 for a tanh activation function, if the value of the input is of significant magnitude. However, the value the RNN will converge to also depends on the recurrent weight. (Found by Wan-Ting Chen)
Page 276: There are a couple of inaccuracies in the description of the LSTM cell under the red box. It says that “there is a gate that controls whether or not the remembered value should be sent to the output of the cell”, but it fails to mention that the remembered value is first passed through the output activation function (Out Act in Figure 10-5). Further down it states that the output activation receives its value from the output gate. This is not correct. It is the gate that receives its value from the output activation function. (Found by Wan-Ting Chen)
Page 277: in Figure 10-5, the input to the forget gate is marked as c^(t-1), which is a common notation in the literature. However, that was never explicitly mentioned in the text, which can be confusing given that it was named y^(t-1) in Figure 10-4. (Found by Wan-Ting Chen)

Chapter 11

Page 292: In code Snippet 11-1, the variable MAX_LENGTH is initialized to 50. This variable is not used for anything and can simply be deleted. (Found by author)
Page 293: “utf-8” should be changed to “utf-8-sig” to properly handle formatting codes found in the assumed text file. (Found by Ibrahim Abdelkader)
Page 297: In the second sentence of the last paragraph “word in the vocabulary” should be “character in the alphabet.” (Found by Peter Hizalev)
Page 298: In the middle of the page, the printed output from the programming example shows only six generated predictions. It should have shown eight predictions given that the variable BEAM_SIZE was set to 8 on page 292. (Found by Peter Hizalev)

Chapter 12

Chapter 13

Page 345: The second bullet (starting with “One or more hidden layers”) should be changed to “One or more hidden layers that are either feedforward or recurrent layers – high complexity (fully connected)”. (Found by Wan-Ting Chen)
Page 350: In the last sentence of the second to last paragraph, “Each training example” refers to a single combination of an input and output word (a single line in Table 13-1). This is confusing given that “training example” referred to five consecutive words earlier on that same page. (Found by Wan-Ting Chen)
Page 353: In the first paragraph, it is stated that “Press and Wolf (2017) have shown that it can be beneficial to tie the input and output embeddings together using weight sharing.” In reality, Press and Wolf showed that it can be beneficial in the context of language modeling and neural machine translation. For word2vec they found it to be better to not tie the input and output embeddings together. (Found by Wan-Ting Chen)
Page 353: In the second paragraph it is stated that “we train the network to make this dot product get close to 1.0.” This is incorrect. It is the output of the activation function that we train to be close to 1.0. (Found by Wan-Ting Chen)
Page 357: In the description of Code Snippet 13-2, “Three Words” should be replaced by “n Words”. (Found by Wan-Ting Chen)

Chapter 14

Page 377: In the second paragraph “original src_input_data list” should be changed to “original src_token_seq list”. (Found by Wan-Ting Chen)

Chapter 15

Chapter 16

Page 438: The last three words before Code Snippet 16-14 should be changed from “our decoder model” to “our encoder model” (Found by Wan-Ting Chen)

Chapter 17

Page 488: The imports of Reshape and random are unused so those two lines can be deleted. (Found by author)
Page 492: The variable prev_size is unused so this line can be deleted. (Found by author)
Page 496: In the second sentence of the last paragraph, “validation error” should be replaced by “validation accuracy” to match the code. (Found by Wan-Ting Chen)
Page 500: “np.int” should be changed to “np.int64” (np.int is deprecated) in three places. (Found by Ibrahim Abdelkader)

Works Cited

Page 648: The URL for the paper by Bahdanau is incorrect. It should be https://arxiv.org/pdf/1409.0473 (Found by Wan-Ting Chen)
Page 649: The URL for the paper by Chung is incorrect. It should be https://arxiv.org/pdf/1412.3555 (Found by Wan-Ting Chen)
Page 654: The URL for the paper by Jozefowicz is incorrect. It should be https://arxiv.org/pdf/1602.02410.pdf (Found by Wan-Ting Chen)
Page 656: Missing space between the words “a” and “Taylor” in the entry for Linnainmaa. (Found by author)

Appendix B

Appendix C

Page 566: In the section of FastText, the first sentence should be changed to “FastText (Bojanowski et al., 2017) extends both the word2vec continuous skip-gram model and continuous bag-of-words model. In this section we will describe their extension of the continuous skip-gram model.” (Found by Wan-Ting Chen)
Page 574: In the third paragraph it says that it is beneficial to combine the ELMo embeddings with another context-independent embedding, but it is not clearly spelled out how they are combined. Replace “combine” by “concatenate” for clarity. (Found by Wan-Ting Chen)

Appendix D

Page 581: In the fourth paragraph, the citation “Sennrich, Haddow, and Birch (2016)” should be changed to “Radford et al. (2018)” (Found by Wan-Ting Chen)
Page 587: In the third paragraph, in two sentences “Devin” should be replaced by “Devlin”. (Found by Wan-Ting Chen)
Page 588: In the second paragraph, the last sentence should be replaced by “Finally, the RoBERTa study also quantified the effect of the number of training steps, by studying a range from 100K to 500K.” (Found by Wan-Ting Chen)

Appendix I

Page 623: One of the code examples requires scipy and one of the PyTorch versions of the code examples requires scikit-learn so you would also need to do “pip3 install scipy” and “pip3 install sklearn”. Or, perhaps better, make use of requirements.txt in the GitHub repository, which includes sub-dependencies as well. (Found by author)
Page 627: For the bookstore sales dataset, we only used sales data up until and including March 2020. If you want to use the same input data as was used in the book, you will have to manually delete the last nine data points. (Found by author)
Page 629: For consistency, the command line to install PyTorch should begin with “pip3” instead of “pip”, although in practice, “pip” will most likely be mapped to a 3.x pip version. (Found by author)
Page 633: The fifth bullet states that item() converts a single element into a NumPy value. It should have said that it converts it into a standard Python number. (Found by Wan-Ting Chen)

Appendix H

Page 616: The last paragraph states that Keras implements both reset-before and reset-after, which is true. However, Keras’ implementation has a subtle difference with respect to the number of bias terms for the reset-after version, which does not exactly match what is described in the book.

Page 618: The last equation is incorrect. It should be changed to the following:

(Found by Delyan Kalchev)

The errors above appeared in the English printed version of the book. Some or all of these errors may already have been addressed in later translated versions of the book and/or updated electronic versions.

Submit errors or problems

If you find something that looks incorrect or is simply unclear, please let us know so we can add it to the Errata. For each error, we will state the name of the person who reported it first.