Homework 2, Part 1: Hebb and Delta Rules [50 points]

1A [5 pts.] Explain why the weight from input:0 to output:0 is equal to 0.375, and why the weight from input:1 to output:4 is equal to -0.25.

Example answer:

The first pass of the Delta rule (starting with weights of zero) is equivalent to the Hebb rule (with an extra factor of 2.0 due to the exponent of 2.0 in the error function), so the change in weight wi j for each pattern is equal to 2 e ai t j (where e is the learning rate). After all three patterns (a, b and c), wi j = 2 e ( aia tja + aib tjb + aic tjc). Thus:

For the weight from input 0 to output 0, w00 = 2.0 x 0.0625 x ( 1 x 1 + 1 x 1 + 1 x 1 ) = 0.375.

For the weight from input 1 to output 4, w 14 = 2.0 x 0.0625 x ( - 1 x 1 + 1 x 0 + - 1 x 1 ) = - 0.25.

Feedback:

Almost everyone did well on this question.

1B [5 pts.] If you click on ”Train Network” again, the weights remain unchanged. Why? What would have happened if the Hebb rule had been applied instead?

Example answer:

Because the training patterns are orthogonal, the Hebb rule (and, hence, the Delta rule for the first pass) works perfectly; that is, after training on the patterns once, aj = tj for each pattern. [Strictly speaking, getting this to come out exactly depends on the learning rate; each pattern is composed of 8 +/- 1's so that its dot-product with itself is 8, and 8 X 2 (factor from exponent) X 0.0625 (learning rate) = 1.0.] On the second pass, the Delta rule is no longer the same as the Hebb rule (because the weights are not zero); it changes weights according to e ( tj - aj ) ai . But because aj = tj , tj - aj = 0, and so the weight changes are all equal to zero.

If the Hebb rule were reapplied, exactly the same weight changes would be applied a second time, and since the weights started at zero, they would double.

Feedback:

Almost everyone did well on this question as well.

1C [10 pts.] Describe and explain the similarities and differences among the weights equivalent to those that would be produced by the Hebb rule ( hebb-li.wt ) and those produced by the Delta rule ( delta-li.wt ) when training on the linearly independent set.

Example answer:

Patterns a and c are each orthogonal to pattern b but not to each other. This means that b can be learned perfectly by either learning rule without affecting the others, so that the differences between the Delta and Hebb rules reflect the relationship of a and c. These two patterns differ in inputs 6 and 7 (counting from 0) and targets 1, 3, 5, and 7. The eight weights between these two inputs and four outputs change the most from epoch 1 (Hebb) to epoch 10 (Delta), because these are the weights that allow the unique aspects of each pattern (inputs 6 and 7) to produce the necessary output differences (over outputs 1, 3, 5, and 7).

Feedback:

Most people did poorly on this question. Though almost everyone provided a good description of the differences in the weight matrices observed, hardly anyone explained why differences were observed with reference to patterns of overlap in the training set. Few people specifically noted that non-orthogonality of A & C was the root problem. Many people struggled with concept of linear-independence, and failed to note how that delta rule ends up strongly weighting the few units that differentiate otherwise-overlapping patterns.

1D [10 pts.] Why was training with the intermediate learning rate more effective than either the higher or lower rate?

Example answer:

Noise is added to both the inputs ai and the targets tj , so the weight changes themselves (based on the difference between these values) are noisy and may even change sign as a result of the noise. Consequently, the weight changes do not always reduce error and may even increase it. If the learning rate is too large, the resulting weight changes can cause performance to jump around rather wildly. When the learning rate is small, temporary increases in error can still occur, but they are fairly small and quickly reversed by subsequent weight changes. If the learning rate is very small, however, the weight changes are too small to improve performance sufficiently given the number of epochs.

Feedback:

Again, people generally got the idea here.

1E [10 pts.] Why is learning so much slower using sigmoid units than when using linear units? Describe and explain the similarities and differences in the resulting sets of weights.

Example answer:

With linear units, output activations change in a way that is directly proportional to changes in the weights during learning. With sigmoid units, the amount of change in output activation that results from a change in net input gets small and smaller as the net input gets larger and larger (either positively or negatively) due to the asymptotic shape of the sigmoid function. Thus, it can take a very long time to accumulate enough weight changes to generate net inputs large enough to produce output activations near 0.0 or 1.0. The weights produced by the two unit functions generally agree in sign, but the weights for sigmoid units are very much larger.

Another difference between the weights is that, with sigmoid units, all the weights are different from 0, whereas with linear units several weights are 0. This is because, in the case of linear units, when the target is 0 in all training examples, the error is always 0 and the weight doesn't change from 0. On the other hand, in the case of a sigmoid unit, its output is never 0; thus, even if the target is 0, there is a weight change (so long as in any of the training patterns the input unit is on at least once, which in this example is always true).

Feedback:

People did well on this question but some forgot to comment specifically on the difference in weights generated by the two rules, whereas others noted that there were few zero weights but did not notice how much larger the weights for the sigmoid activation function become.

1F [10 pts.] Why does learning fail here [with the “imposs” set], even with the Delta rule? Try to explain the pattern of weights that are produced.

Example answer:

The Delta rule fails because the target values for at least one of the output units are not linearly separable given the eight inputs. To see this, note that inputs 0, 1, 4 and 5 are identical across the four training patterns. Moreover, input 2 is just the negative of input 3 and input 7 is the negative of input 8, so these don't provide any additional information. So effectively there are two unique input values, creating four input patterns (-1 -1), (-1 1), (1 -1) and (-1 -1). The targets differ only in 2, 3, 4, 5, but targets 2 and 3 are the same and targets 4 and 5 are the same, so again there are only two unique values, and each of these forms an exclusive-or (XOR) with the input values. As a result, all of the weights between inputs 2, 3, 7, 8 and outputs 2, 3, 4, 5 end up being zero (because the weight changed caused by two of the patterns are exactly canceled by the weight changes caused by the other two). The only weights that build up are those that generate the constant outputs from the constant inputs.

Feedback:

Many people struggled with this question. Most people noticed that some output units have constant values across patterns, and that the weights into these units are the only ones that build up. However, few people explained precisely why this is the case. Some people confused linear separability with linear dependence among the input patterns. And, very few people were able to trace the problem to the particular input-output relationships highlighted in the example answer above.

Part 2 : Your own delta-rule net [50 points]

Feedback:

People did a good job of explaining to what the units in their models were intended to correspond; but few people indicated why the learning problem they explored was interesting from a computational perspective. (People were happy to explain that their patterns were interesting because they referred to some domain of intrinsic interest, like beer, sports, or girlfriends, but the point of the assignment was really to show some knowledge of the computational issues behind delta-rule learning). For your project, be sure to explain what lead you to choose a particular problem, and why your patterns were constructed in a particular way. This explanation should pertain to some computational or psychological issue.

Also, although most people discovered some problems with learning their patterns, many people had difficulty clearly explaining why these problems occurred. Perhaps relatedly, although everyone always provided some examples of model behavior, some people did not really provide very comprehensive evidence. For instance, many people would say "The model did not generalize very well," and would proceed to report the error on a single generalization item. A better approach would be to show model outputs, targets, and error for all generalization patterns, indicate which items/outputs were hard to learn and which easy, and explain why. Similarly for the "time course of learning" question, people often contrasted just two time points and reported on a few unit activations at these points. A better approach would be to save all unit activations every n epochs or so, read these into Excel or another data-analysis program, and plot the activation of the units over time.

Finally, people are struggling a bit to explain why the sigmoid and linear units behave differently. People generally understand that learning slows when sigmoid units approach their extremes whereas this is not true of linear units. But, there are other important differences to consider:

i) For linear units, it is possible to "overshoot" targets in either direction. This means that the sign of the derivative on the weights can flip-flop between negative and positive. For sigmoid units it is impossible to overshoot.

ii) For linear units, a net input of zero leads to an activation of zero. Weights leaving a unit with zero activation will not change---so the only weights that will change will be those leaving units with non-zero activation.

iii) Note that, even with sigmoid units, it is impossible for a 2-layer network to learn non-linearly separable mapping!

Below is an example answer written by David Plaut (who designed the original version of this course) that may help in understanding the level of detail and kind of problem you might tackle. The key take-home points from this example are: i) the clear explanation of the rationale for selecting the patterns, and ii) the presentation of evidence supporting the conclusions about how the different versions of the model behave. It is not necessary that your answer bear directly on some important question in psychology, but you should clearly explain what about your patterns make them *computationally* interesting.

2A [5 pts.] Hand in a table displaying the set of patterns you have constructed, and explain what led you to design them the way that you did.

For illustration purposes, I decided to investigate the “Rule of 78” problem. This problem involves input patterns which each have three bits on: one among (1 2 3), one among (4 5 6) and one among (7 8). This deﬁnes 18 possible inputs (3 x 3 x 2). For each of these, the correct output pattern also has three bits on: the corresponding bit from (1 2 3) and from (4 5 6), and the opposite bit from (7 8). Thus 247 -> 248. There is one exception pattern: 147 -> 147. This is supposed to be analogous (at a very small scale) to forming the past tense of verbs: for most verbs, you introduce a slight modiﬁcation at the end by adding “-ED” (e.g., FIT-> FITTED) but for some you do nothing (e.g., HIT->HIT). This problem is interesting because the same set of weights must learn to handle this exception while also encoding the “rule” sufﬁciently well to support generalization (nb. when we consider language processing later in the course, we will see a number of arguments that rules and exceptions must be handled separately).

Table 1 shows my training and testing patterns. I trained on the 6 patterns on the left (including the exception 147) and tested for generalization to the remaining patterns shown on the right.

2B [10 pts.] Examine how well each network does in learning the set of training pairs using each unit activation function, and explain its successes and failures. Are there any differences between the two functions in how well the patterns can be learned?

Figure 1 shows the total error for 20 epochs of training with either linear units or with sigmoid units, using a learning rate of 0.5. Learning with linear units is much faster than with sigmoid units, because the latter require much large net input to achieve activations near 0 or 1 due to the asymptotic nature of the sigmoid function. This can be seen clearly by examining the weights that result from learning in the two cases (Table 2 here) —note that training was continued for a total of 100 epochs when using sigmoid units. [This simulation was run on an older simulator in which the weight displays are text-based and formatted differently than in Lens. In these displays, input values are displayed along the bottom and output values and targets are displayed on the right-hand side. Activations and weight values are plotted as integers out of 100 (e.g., an activation of 98 is really 0.98, and a weight of -368 is really -3.68).]The solution in the linear case is very clear and elegant. Each set of mutually exclusive bits (1 2 3) and (4 5 6) form a block where each bit in the input supports itself in the output and inhibits its competitors. The same is broadly true for the (7 8) group except the competition is reversed (since generally 7 -> 8 and 8 -> 7). Note, however, that for these last output bits, the network listens strongly to 1 and 4 (weights of 71 and -42) which are collectively sufﬁcient to override the 7 -> 8 weight of -42 and to generate an output of 8. Also note that weights to the (7 8) group from other input have to compensate for these weights, in case only one of 1 or 4 are present (e.g., 167), to override the override, so to speak.

The solution in the sigmoid case has the same basic ﬂavor but is a bit less clear, and involves much larger weights.2C [20 pts.] Choosing one of the two activation functions, examine the time course of learning. Try to identify what aspects of your patterns the network learns rapidly and what aspects it learns less rapidly. Describe what you observe and try to explain why it happens.

I chose sigmoid units because they generalize better (see Tables 3a and b). To illustrate the time-course of learning, Figure 2 shows the error for three informative patterns over the course of training: the exception (147), a similar “regular” item (167), and a regular item with no similarity to the exception (258). Not surprisingly, the exception produces greater error and takes much longer to learn than the regular items. What is interesting, though, is that, even among the regulars, those that are similar to the exception are learned somewhat slower than the rest. This effect arises because the compensatory weights mentioned above—overriding the override—take a while to develop and are not completely successful. The contaminating inﬂuence of exceptions on similar regular items is also found in empirical data from language tasks such as forming the past tense of verbs or reading words aloud.

>2D [15 pts.] Now, for both the linear and sigmoid network, consider how well the trained network generalizes to the two patterns that you set aside. Report what happens when you test with these, and explain the results.

Table 3a shows the outputs generated by the network after 20 epochs using linear units to the 12 patterns withheld from training. In general, performance is not particularly good—the total error across these patterns is 14.5. Predictably, the particular examples that cause the most difﬁcult (148, 158, 267, 348; shown in bold) are those that are somewhat similar to the exception 147. In fact, the one that is most similar, 148, is also the one that produced the most error by far, 4.5. Thus, with linear units, the network more directly reﬂects the similarity of items with the training corpus. Also, not surprisingly, the output values that produce the most error are the 7 and 8 positions. Table 3b shows the equivalent data after 100 epochs of training using sigmoid units. Here, generalization is much better—the total error is only 0.98. In fact, all output units are on the correct side of 0.5 for all patterns. Again, the patterns that produce the most error (157, 158, 347, 348) are similar to 147, although there are not exactly the same patterns as were difﬁcult for linear units. In particular, it is not always the items which are most similar to 147 which produce the most error—the nonlinearity of the sigmoid activation function allows the network to ignore certain types of similarity (e.g., with 148) when it is beneﬁcial to do so. Also, it is not only positions 7 and 8 that cause problems—sometimes positions 1 or 4 are the most problematic. There are two major conclusions to draw from these results:

Both regular and exception items can be processed successfully within a single system; moreover, the pattern of contamination from the exceptions to the regulars matches (qualitatively) the corresponding empirical ﬁndings.
Despite having learned an exceptional item, the system can nonetheless apply the “rule” successfully in generalizing to novel inputs (e.g., the past tense of DIT is DITTED rather than DIT; MAVE is pronounced to rhyme with GAVE rather than HAVE). This is true even when the system has only been trained on a relatively small proportion of the regular items (5 of 17 in the current simulation). Again, there is some degree of contamination for items which are sensitive to the exception, but this contamination is not so strong as to prevent correct responding (at least for sigmoid units), but might be expected to make responses a bit slower in these cases. This also seems to be a property of human performance in these domains.