Errata

Page xiii: The heading for the section for page 329 should be “King – Man + Woman = Queen” without an exclamation mark after Woman. (Found by author)

  • Page 2: There is a missing comma between “weight” and “and a single output” in the first sentence that describes the perceptron. The corrected sentence is “The perceptron consists of a computational unit, a number of inputs (one of which is a special bias input, which is detailed later in this chapter), each with an associated input weight, and a single output.” That is, the perceptron has a single output (as opposed to each input having a single output). (Found by author)
  • Page 3: The vertical line in Figure 1-3 is somewhat confusing in that it makes it seem like the output can take on any value between -1 and 1 when the input is 0. It would have been better to omit the vertical line to make it clear that there is a discontinuity in the output value. (Found by Ruthie Lyle)
  • Page 10: In Code Snippet 1-4, the variable y refers to the ground truth. This is confusing given that the third sentence in the numbered list on page 8 uses a different notation (there y corresponds to the network output). (Found by Juraj Mavracic)
  • Page 17: The last sentence in the second paragraph makes it sound like the weights in Figure 1-9 corresponds to the two lines in Figure 1-8. This is not the case. The chosen weights in Figure 1-9 will result in two lines of somewhat different slope and orientation but that also fall in-between the plusses and minuses in Figure 1-8.
  • Page 19: In multiple places the word “multilevel” is used (e.g., “multilevel neural network” and “multilevel perceptron”). The correct term is “multilayer”. (Found by Bart Pelle)
  • Page 19: The third sentence from the end of the page makes it sound like the weights of P0 in Figure 1-9 correspond to a specific NAND-gate presented earlier but this is not the case. The intent is simply to state that the weights for P0 results in the behavior of a NAND-gate.
  • Page 19: At the bottom of the page, the notation used for the XOR function is not described. The letters A and B correspond to the two inputs. The dot symbol corresponds to AND. The plus symbol corresponds to OR. The line above a term corresponds to NOT (inverse).
  • Page 20: The last sentence on the page should be changed to the following “Further, if you have a graphics processing unit (GPU) capable of running CUDA, there is the CUDA BLAS (cuBLAS) library that enables these operations to be performed efficiently on the GPU, which can give you orders of magnitude speedup compared to running on a CPU.” That is, NumPy does not make use of CUDA or CuBLAS. As a side note, cuBLAS and cuDNN provide interfaces that accept NumPy arrays. (Found by Bart Pelle) 
  • Page 21: The x in the formula is only italic but should have been bold italic. (Found by Peter Hizalev)
  • Page 31: The text states that the upper left chart is identical to the chart in Figure 1-10, but it is the upper right chart that is identical. That is, the two upper charts need to swap place to each other to be consistent with the description in the text. (Found by Wan-Ting Chen)
  • Page 38: At the end of the last sentence of the second to last paragraph, x1 should be x0. (Found by Wan-Ting Chen)
  • Page 65: On the second to last line of the last paragraph, -1 is incorrect. It should be 0. (Found by Wan-Ting Chen)
  • Page 78: The first calculation results in the value -0.24492, which should have been rounded to -0.24 instead of -0.25. As a side note, the subsequent calculations in the this section were done in a spreadsheet that used the full-precision numbers rather than the rounded numbers that are printed in the book. (Found by Neil Braggio)
  • Page 94: In the last line of Code Snippet 4-1, “test_images” should be changed to “test_labels”. (Found by author)
  • Page 99: Figure 4-2 uses the word “epochs” without first having described what an epoch is. See page 111 for an explanation of what an epoch is. (Found by Ruthie Lyle)
  • Page 107: In the last paragraph before the programming example, “i=1” should be changed to “j=1”. (Found by author)
  • Page 126: In the first paragraph, in the sentence “Similarly, even if the neuron is not fully saturated, the derivative is less than 0” the 0 should have been 1. Additionally, the subsequent sentence should be changed to “Doing a series of multiplications (one per layer) where each number is a positive number less than 1 results in the gradient approaching 0.” (Found by Wan-Ting Chen)
  • Page 127: In the first sentence in the paragraph after the yellow box references Code Snippet 5-4. It should have referenced Code Snippet 5-3. (Found by Wan-Ting Chen)
  • Page 132: The derivative of the error term (de/dy^) is incorrect. The plus sign between the two terms should be a negative sign. that is, replace (y/y^ + (1-y)/(1-y^)) by (y/y^ – (1-y)/(1-y^)). (Found by Hank Shen)
  • Page 184: In the second sentence in the green box, “deep into the network” should be replaced by “layers deep into the network”. (Found by Nafez Qanadilo)
  • Page 189: In Table 7-3, in the two rightmost cells, the number 4,925,200 is incorrect. It should be 4,915,200 from Table 7-2. (Found by Wan-Ting Chen). 
  • Page 190: In the lower right cell of the table, “4,090” should be changed to “4,097”. (Found by author)
  • Page 193: The last sentence of the second paragraph ends with “so we can interpret the one-hot encoded outputs as probabilities.” In reality, even though the ground truth in the training examples were one-hot encoded, the outputs themselves will not be one-hot encoded (multiple outputs are non-zero for Softmax). Therefore, the corrected ending of the sentence should be “so we can interpret the outputs as probabilities.” 
  • Page 202: There is a missing y at the end of Krizhevsky. (Found by author).
  • Page 208: Reading the third paragraph, it is not obvious what the benefit of 1×1 convolutions is in VGGNet. To address this, before the sentence that begins “We can use 1×1 convolutions”, add the sentence “That is, the 1×1 convolutions in VGGNet provide the ability to combine features from different channels into new features.” (Found by Wan-Ting Chen)
  • Page 217: In the second paragraph, x is a vector so should have been bold. The same applies to Figure 8-4. (Found by Wan-Ting Chen)
  • Page 223: On the first line of the first paragraph, “7-layer” should be replaced by “8-layer” because AlexNet consists of eight layers. (Found by Wan-Ting Chen). 
  • Page 232: In the second to last paragraph, last sentence, “three weights per additional output” should be changed to “four weights including bias per additional output”, and “nine weights per additional output” should be changed to “ten weights including bias per additional output”. (Found by Wan-Ting Chen)
  • Page 232: In the last paragraph, “M*K2+1” should be “M*(K2+1)”, and “N * M + 1” should be changed to “N * ( M + 1).” Similarly, the resulting formula on the next page becomes “W * H * (M * (K2 +1)) + W * H * (N * (M + 1))”. (Found by Wan-Ting Chen)
  • Page 241: In the last sentence of the first paragraph, “day” should be changed to “month”. (Found by Wan-Ting Chen)
  • Page 243: In each of the two equations, y is a vector and should have been using bold typeface. (Found by Wan-Ting Chen)
  • Page 244: In the equation, h is a vector and should have been using bold typeface. (Found by Wan-Ting Chen)
  • Page 257: In Code Snippet 9-4, the line that starts with “stddev =” should be “stddev = np.std(train_sales)”. This also has implications on the numbers quoted on page 259, 260, and 263. However, the conclusions do not change. (Found by Jason Lee)
  • Page 262: In the caption for Code Snippet 9-9, “7 days” should be changed to “12 months” (Found by Wan-Ting Chen)
  • Page 262: In Code Snippet 9-9, the following import is missing: “from tensorflow.keras.layers import Flatten”. Also note that you still need to first create a Sequential model before starting to add layers: “model = Sequential()”. (Found by Wan-Ting Chen)
  • Page 275: In the first sentence after the green box it says that a simple RNN would result in an output close to 1. This is not correct. The output of the activation function will often tend to get close to 1 or 0 for a logistic sigmoid activation and close to 1 or -1 for a tanh activation function, if the value of the input is of significant magnitude. However, the value the RNN will converge to also depends on the recurrent weight. (Found by Wan-Ting Chen)
  • Page 276: There are a couple of inaccuracies in the description of the LSTM cell under the red box. It says that “there is a gate that controls whether or not the remembered value should be sent to the output of the cell”, but it fails to mention that the remembered value is first passed through the output activation function (Out Act in Figure 10-5). Further down it states that the output activation receives its value from the output gate. This is not correct. It is the gate that receives its value from the output activation function. (Found by Wan-Ting Chen)
  • Page 277: in Figure 10-5, the input to the forget gate is marked as c^(t-1), which is a common notation in the literature. However, that was never explicitly mentioned in the text, which can be confusing given that it was named y^(t-1) in Figure 10-4. (Found by Wan-Ting Chen)
  • Page 292: In code Snippet 11-1, the variable MAX_LENGTH is initialized to 50. This variable is not used for anything and can simply be deleted. (Found by author)
  • Page 293: “utf-8” should be changed to “utf-8-sig” to properly handle formatting codes found in the assumed text file. (Found by Ibrahim Abdelkader)
  • Page 297: In the second sentence of the last paragraph “word in the vocabulary” should be “character in the alphabet.” (Found by Peter Hizalev)
  • Page 298: In the middle of the page, the printed output from the programming example shows only six generated predictions. It should have shown eight predictions given that the variable BEAM_SIZE was set to 8 on page 292. (Found by Peter Hizalev)
  • Page 307: The description of language models does not clearly distinguish between n-grams and language models based on n-grams. In reality, a language model based on n-grams makes use of the Markov assumption to provide probabilities for sequences that are longer than n. That is, an n-gram based language model uses multiple n-grams to provide a probability of a sequence longer than n. It does this by multiplying the probabilities of each individual n-gram in the sequence. The description in the book incorrectly makes it appear as if a single n-gram is equivalent to an n-gram based language model. (Found by author)
  • Page 309: In the first paragraph, in all places where it says 5-gram, it should have said 6-gram. (Found by Wan-Ting Chen)
  • Page 321: “utf-8” should be changed to “utf-8-sig” to properly handle formatting codes found in the assumed text file. (Found by Ibrahim Abdelkader)
  • Page 322: In the first paragraph it is described that converting an index that is not assigned to a word will result in UNK. This is correct, and it turns out that in the programming example this happens fairly often, because the vocabulary size in the input data file is smaller than 10,000. Therefore, you might get better results (fewer instances of UNK) if you set MAX_WORDS to 7500 instead of 10000. (Found by Wan-Ting Chen)
  • Page 322: “np.int” should be changed to “np.int64” (np.int is deprecated). (Found by Ibrahim Abdelkader)
  • Page 326: “np.int” should be changed to “np.int64” (np.int is deprecated). (Found by Ibrahim Abdelkader)
  • Page 329: In the first paragraph it says that it does not seem too farfetched to believe that slothful is used together with the word monster in the book. Upon further examination of the book, it turns out that slothful is not mentioned together with the word monster. (Found by Wan-Ting Chen)
  • Page 329: The section heading should be “King – Man + Woman = Queen” without an exclamation mark after Woman. (Found by author)
  • Page 332: In the third paragraph it says that “the authors of word2vec are the same authors who discovered the King/Queen property.” This is not correct. Only the first author is the same between the two papers. (Found by Wan-Ting Chen)
  • Page 338: The last paragraph describes how Jaccard similarity can be computed, and the description states to divide by the length of the vectors. This assumes that the vectors are both of the same length and that no position is non-zero in both vectors (this was not explicitly pointed out in the description). More formally, the Jaccard similarity is computed as the ratio between the intersection and the union of two vocabularies. (Found by Wan-Ting Chen)
  • Page 340: The text in Figure 12-9 is misspelled (non-normalized and normalized should only have a single l). (Found by author) 
  • Page 345: The second bullet (starting with “One or more hidden layers”) should be changed to “One or more hidden layers that are either feedforward or recurrent layers – high complexity (fully connected)”. (Found by Wan-Ting Chen)
  • Page 350: In the last sentence of the second to last paragraph, “Each training example” refers to a single combination of an input and output word (a single line in Table 13-1). This is confusing given that “training example” referred to five consecutive words earlier on that same page. (Found by Wan-Ting Chen)
  • Page 353: In the first paragraph, it is stated that “Press and Wolf (2017) have shown that it can be beneficial to tie the input and output embeddings together using weight sharing.” In reality, Press and Wolf showed that it can be beneficial in the context of language modeling and neural machine translation. For word2vec they found it to be better to not tie the input and output embeddings together. (Found by Wan-Ting Chen)
  • Page 353: In the second paragraph it is stated that “we train the network to make this dot product get close to 1.0.” This is incorrect. It is the output of the activation function that we train to be close to 1.0. (Found by Wan-Ting Chen)
  • Page 357: In the description of Code Snippet 13-2, “Three Words” should be replaced by “n Words”. (Found by Wan-Ting Chen)
  • Page 377: In the second paragraph “original src_input_data list” should be changed to “original src_token_seq list”. (Found by Wan-Ting Chen)
  • Page 395: In the last paragraph it says that each vector corresponds to the internal state of the decoder. This is incorrect. It corresponds to the internal state of the encoder. (Found by Wan-Ting Chen)
  • Page 398: In the caption to Figure 15-3 “create encoder input state” should be “create decoder input state”. (Found by Wan-Ting Chen)
  • Page 404: The first sentence in the second paragraph should be changed to “Up until now, the description has assumed a network with a single recurrent layer.” (Found by Wan-Ting Chen) 
  • Page 407: In the first paragraph it is stated that both Kalchbrenner (2016) and Gehring (2016) use convolutional networks with attention. This is incorrect. Kalchbrenner (2016) did not use attention. (Found by Wan-Ting Chen)
  • Page 407: The second reference in the second paragraph should be (Lin, Feng, et al., 2017). (Found by Wan-Ting Chen)
  • Page 409: Figure 15-8 does not contain enough information to clearly communicate how the multiple positions (words) in the previous layer connect to the self-attention mechanism. In reality, the self-attention mechanism for one position receives a value and a key from each position in the preceding layer, while it receives a query only from the same position in the preceding layer. That is, each attention mechanism receives a single query, N keys, and N values, where N is the number of positions in the preceding layer. (Found by author)
  • Page 412: In Figure 15-10, the top skip connections in the encoder module should originate between the normalization layer and the feedforward layer. (Found by Wan-Ting Chen)
  • Page 412: On the third to last line on the page, the sentence that ends with “encoder modules” should be changed to end with “encoder stack”. In the sentence after that, the words “intermediate state” should be changed to “output state”. Further, Figure 15-11 should be modified so the horizontal arrows that connect encoder modules to decoder modules, should all originate from the top-most encoder module, instead of from different encoder modules. (Found by Vijay Agrawal)
  • Page 438: The last three words before Code Snippet 16-14 should be changed from “our decoder model” to “our encoder model” (Found by Wan-Ting Chen) 
  • Page 488: The imports of Reshape and random are unused so those two lines can be deleted. (Found by author)
  • Page 492: The variable prev_size is unused so this line can be deleted. (Found by author)
  • Page 496: In the second sentence of the last paragraph, “validation error” should be replaced by “validation accuracy” to match the code. (Found by Wan-Ting Chen)
  • Page 500: “np.int” should be changed to “np.int64” (np.int is deprecated) in three places. (Found by Ibrahim Abdelkader)
  • Page 648: The URL for the paper by Bahdanau is incorrect. It should be https://arxiv.org/pdf/1409.0473 (Found by Wan-Ting Chen)
  • Page 649: The URL for the paper by Chung is incorrect. It should be https://arxiv.org/pdf/1412.3555 (Found by Wan-Ting Chen)
  • Page 654: The URL for the paper by Jozefowicz is incorrect. It should be https://arxiv.org/pdf/1602.02410.pdf (Found by Wan-Ting Chen)
  • Page 656: Missing space between the words “a” and “Taylor” in the entry for Linnainmaa. (Found by author)
  • Page 544: In the second paragraph it says that “run the entire image through the convolutional layers and max pooling layers of a pretrained VGGNet-16 model. That is, the two fully connected layers and the softmax layer have been removed.” This is incorrect. Not only the two fully connected layers and the softmax layer are removed but also the last max-pooling layer. That is, looking at VGGNet-16 in Table 8-1 on page 209, the last four layers are removed. Note that two of the layers the two fully connected ones are later added back after the ROI pooling layer (described in the next paragraph as “we can connect the output of the ROI pooling layer to the pretrained fully connected layers.” This change also implies that the feature map will have a dimension of W/16 x H/16, instead of having 32 in the denominators. Similarly, the foot note needs to be changed to reflect this by changing “32” to “16” and by changing “five pooling layers” to “four pooling layers”. (Found by Wan-Ting Chen) 
  • Page 546: In the upper left part of Figure B-5 it should say “K+1 classes” instead of “K classes” to be able to classify the input image as containing one out of K object classes or no object at all. (Found by Wan-Ting Chen)
  • Page 547: In the second paragraph it says that one of the sibling layers provides K outputs, where each output indicates whether or not an object is present. In the original paper, this sibling layer provided 2*K outputs to provide an explicit probability both for object vs. background (although they also mentioned that it is possible to implement with only K outputs). This also has implications on Figure B-6, which shows only K outputs in the left branch of the network.
  • Page 548: In the second paragraph that discusses the sliding window approach, it would have made sense to point out that the sliding window approach is conceptual. In an actual implementation, this would typically be implemented as a convolutional network, which applies the same operation across multiple locations (see the original paper for all details). (Found by Wan-Ting Chen)
  • Page 551: In the last paragraph, third to last and second to last sentences should be changed to “Note that it is not the Euclidian distance between the red pixel and the blue pixel that determines the weight. Instead, the weight is computed as the product of (1 – distance) in each of the two (x, y) dimensions.” (Found by Wan-Ting Chen)
  • Page 566: In the section of FastText, the first sentence should be changed to “FastText (Bojanowski et al., 2017) extends both the word2vec continuous skip-gram model and continuous bag-of-words model. In this section we will describe their extension of the continuous skip-gram model.” (Found by Wan-Ting Chen)
  • Page 574: In the third paragraph it says that it is beneficial to combine the ELMo embeddings with another context-independent embedding, but it is not clearly spelled out how they are combined. Replace “combine” by “concatenate” for clarity. (Found by Wan-Ting Chen)
  • Page 581: In the fourth paragraph, the citation “Sennrich, Haddow, and Birch (2016)” should be changed to “Radford et al. (2018)” (Found by Wan-Ting Chen)
  • Page 587: In the third paragraph, in two sentences “Devin” should be replaced by “Devlin”. (Found by Wan-Ting Chen)
  • Page 588: In the second paragraph, the last sentence should be replaced by “Finally, the RoBERTa study also quantified the effect of the number of training steps, by studying a range from 100K to 500K.” (Found by Wan-Ting Chen)
  • Page 623: One of the code examples requires scipy and one of the PyTorch versions of the code examples requires scikit-learn so you would also need to do “pip3 install scipy” and “pip3 install sklearn”. Or, perhaps better, make use of requirements.txt in the GitHub repository, which includes sub-dependencies as well. (Found by author)
  • Page 627: For the bookstore sales dataset, we only used sales data up until and including March 2020. If you want to use the same input data as was used in the book, you will have to manually delete the last nine data points. (Found by author)
  • Page 629: For consistency, the command line to install PyTorch should begin with “pip3” instead of “pip”, although in practice, “pip” will most likely be mapped to a 3.x pip version. (Found by author)
  • Page 633: The fifth bullet states that item() converts a single element into a NumPy value. It should have said that it converts it into a standard Python number. (Found by Wan-Ting Chen)

Page 616: The last paragraph states that Keras implements both reset-before and reset-after, which is true. However, Keras’ implementation has a subtle difference with respect to the number of bias terms for the reset-after version, which does not exactly match what is described in the book.

Page 618: The last equation is incorrect. It should be changed to the following:

(Found by Delyan Kalchev)

The errors above appeared in the English printed version of the book. Some or all of these errors may already have been addressed in later translated versions of the book and/or updated electronic versions.

Submit errors or problems

If you find something that looks incorrect or is simply unclear, please let us know so we can add it to the Errata. For each error, we will state the name of the person who reported it first.