Neural history compressor
The vanishing gradient problem of automatic differentiation or backpropagation in neural networks was partially overcome in 1992 by an early generative model called the neural history compressor, implemented as an unsupervised stack of recurrent neural networks (RNNs). The RNN at the input level learns to predict its next input from the previous input history. Only unpredictable inputs of some RNN in the hierarchy become inputs to the next higher level RNN which therefore recomputes its internal state only rarely. Each higher level RNN thus learns a compressed representation of the information in the RNN below. This is done such that the input sequence can be precisely reconstructed from the sequence representation at the highest level. The system effectively minimises the description length or the negative logarithm of the probability of the data. If there is a lot of learnable predictability in the incoming data sequence, then the highest level RNN can use supervised learning to easily classify even deep sequences with very long time intervals between important events. In 1993, such a system already solved a "Very Deep Learning" task that requires more than 1000 subsequent layers in an RNN unfolded in time.
It is also possible to distill the entire RNN hierarchy into only two RNNs called the "conscious" chunker (higher level) and the "subconscious" automatizer (lower level). Once the chunker has learned to predict and compress inputs that are still unpredictable by the automatizer, then the automatizer can be forced in the next learning phase to predict or imitate through special additional units the hidden units of the more slowly changing chunker. This makes it easy for the automatizer to learn appropriate, rarely changing memories across very long time intervals. This in turn helps the automatizer to make many of its once unpredictable inputs predictable, such that the chunker can focus on the remaining still unpredictable events, to compress the data even further.