| Artificial neural network | | | | The Hopfield network is a recurrent neural |
| An artificial neural network (ANN), usually called | | | | network in which all connections are symmetric. |
| "neural network" (NN), is a mathematical model or | | | | Invented by John Hopfield in 1982, this network |
| computational model that tries to simulate the | | | | guarantees that its dynamics will converge. If the |
| structure and/or functional aspects of biological | | | | connections are trained using Hebbian learning then |
| neural networks. It consists of an interconnected | | | | the Hopfield network can perform as robust |
| group of artificial neurons and processes | | | | content-addressable (or associative) memory, |
| information using a connectionist approach to | | | | resistant to connection alteration. Echo state |
| computation. In most cases an ANN is an adaptive | | | | network |
| system that changes its structure based on | | | | The echo state network (ESN) is a recurrent |
| external or internal information that flows through | | | | neural network with a sparsely connected random |
| the network during the learning phase. Neural | | | | hidden layer. The weights of output neurons are |
| networks are non-linear statistical data modeling | | | | the only part of the network that can change and |
| tools. They can be used to model complex | | | | be learned. ESN are good to (re)produce temporal |
| relationships between inputs and outputs or to find | | | | patterns. Long short term memory network |
| patterns in data. | | | | The Long short term memory is an artificial neural |
| Background | | | | net structure that unlike traditional RNNs doesn't |
| There is no precise agreed-upon definition among | | | | have the problem of vanishing gradients. It can |
| researchers as to what a neural network is, but | | | | therefore use long delays and can handle signals |
| most would agree that it involves a network of | | | | that have a mix of low and high frequency |
| simple processing elements (neurons), which can | | | | components. |
| exhibit complex global behavior, determined by | | | | Stochastic neural networks |
| the connections between the processing elements | | | | A stochastic neural network differs from a typical |
| and element parameters. The original inspiration | | | | neural network because it introduces random |
| for the technique came from examination of the | | | | variations into the network. In a probabilistic view |
| central nervous system and the neurons (and | | | | of neural networks, such random variations can |
| their axons, dendrites and synapses) which | | | | be viewed as a form of statistical sampling, such |
| constitute one of its most significant information | | | | as Monte Carlo sampling. Boltzmann machine |
| processing elements (see Neuroscience). In a | | | | The Boltzmann machine can be thought of as a |
| neural network model, simple nodes (called | | | | noisy Hopfield network. Invented by Geoff Hinton |
| variously "neurons", "neurodes", "PEs" ("processing | | | | and Terry Sejnowski in 1985, the Boltzmann |
| elements") or "units") are connected together to | | | | machine is important because it is one of the first |
| form a network of nodes — hence the term | | | | neural networks to demonstrate learning of latent |
| "neural network." While a neural network does not | | | | variables (hidden units). Boltzmann machine learning |
| have to be adaptive per se, its practical use | | | | was at first slow to simulate, but the contrastive |
| comes with algorithms designed to alter the | | | | divergence algorithm of Geoff Hinton (circa 2000) |
| strength (weights) of the connections in the | | | | allows models such as Boltzmann machines and |
| network to produce a desired signal flow. | | | | products of experts to be trained much faster. |
| These networks are also similar to the biological | | | | Modular neural networks |
| neural networks in the sense that functions are | | | | Biological studies have shown that the human |
| performed collectively and in parallel by the units, | | | | brain functions not as a single massive network, |
| rather than there being a clear delineation of | | | | but as a collection of small networks. This |
| subtasks to which various units are assigned (see | | | | realization gave birth to the concept of modular |
| also connectionism). Currently, the term Artificial | | | | neural networks, in which several small networks |
| Neural Network (ANN) tends to refer mostly to | | | | cooperate or compete to solve problems. |
| neural network models employed in statistics, | | | | Committee of machines |
| cognitive psychology and artificial intelligence. | | | | A committee of machines (CoM) is a collection of |
| Neural network models designed with emulation of | | | | different neural networks that together "vote" on |
| the central nervous system (CNS) in | | | | a given example. This generally gives a much |
| mind are a subject of theoretical neuroscience | | | | better result compared to other neural network |
| (computational neuroscience). | | | | models. Because neural networks suffer from |
| In modern software implementations of artificial | | | | local minima, starting with the same architecture |
| neural networks the approach inspired by biology | | | | and training but using different initial random |
| has for the most part been abandoned for a | | | | weights often gives vastly different networks. A |
| more practical approach based on statistics and | | | | CoM tends to stabilize the result. |
| signal processing. In some of these systems, | | | | The CoM is similar to the general machine learning |
| neural networks or parts of neural networks | | | | bagging method, except that the necessary |
| (such as artificial neurons) are used as | | | | variety of machines in the committee is obtained |
| components in larger systems that combine both | | | | by training from different random starting weights |
| adaptive and non-adaptive elements. While the | | | | rather than training on different randomly selected |
| more general approach of such adaptive systems | | | | subsets of the training data. Associative neural |
| is more suitable for real-world problem solving, it | | | | network (ASNN) |
| has far less to do with the traditional artificial | | | | The ASNN is an extension of the committee of |
| intelligence connectionist models. What they do | | | | machines that goes beyond a simple/weighted |
| have in common, however, is the principle of | | | | average of different models. ASNN represents a |
| non-linear, distributed, parallel and local processing | | | | combination of an ensemble of feed-forward |
| and adaptation. | | | | neural networks and the k-nearest neighbor |
| Models | | | | technique (kNN). It uses the correlation between |
| Neural network models in artificial intelligence are | | | | ensemble responses as a measure of distance |
| usually referred to as artificial neural networks | | | | amid the analyzed cases for the kNN. This |
| (ANNs); these are essentially simple mathematical | | | | corrects the bias of the neural network ensemble. |
| models defining a function . Each type of ANN | | | | An associative neural network has a memory |
| model corresponds to a class of such functions. | | | | that can coincide with the training set. If new data |
| Employing artificial neural networks | | | | become available, the network instantly improves |
| Perhaps the greatest advantage of ANNs is their | | | | its predictive ability and provides data |
| ability to be used as an arbitrary function | | | | approximation (self-learn the data) without a need |
| approximation mechanism which 'learns' from | | | | to retrain the ensemble. Another important |
| observed data. However, using them is not so | | | | feature of ASNN is the possibility to interpret |
| straightforward and a relatively good | | | | neural network results by analysis of correlations |
| understanding of the underlying theory is essential. | | | | between data cases in the space of models. The |
| - Choice of model: This will depend on the data | | | | method is demonstrated at where you can either |
| representation and the application. Overly complex | | | | use it online or download it. |
| models tend to lead to problems with learning. | | | | Other types of networks |
| - Learning algorithm: There are numerous | | | | These special networks do not fit in any of the |
| tradeoffs between learning algorithms. Almost any | | | | previous categories. Holographic associative |
| algorithm will work well with the correct | | | | memory |
| hyperparameters for training on a particular fixed | | | | Holographic associative memory represents a |
| dataset. However selecting and tuning an | | | | family of analog, correlation-based, associative, |
| algorithm for training on unseen data requires a | | | | stimulus-response memories, where information is |
| significant amount of experimentation. | | | | mapped onto the phase orientation of complex |
| - Robustness: If the model, cost function and | | | | numbers operating. Instantaneously trained |
| learning algorithm are selected appropriately the | | | | networks |
| resulting ANN can be extremely robust. | | | | Instantaneously trained neural networks (ITNNs) |
| With the correct implementation ANNs can be | | | | were inspired by the phenomenon of short-term |
| used naturally in online learning and large dataset | | | | learning that seems to occur instantaneously. In |
| applications. Their simple implementation and the | | | | these networks the weights of the hidden and |
| existence of mostly local dependencies exhibited | | | | the output layers are mapped directly from the |
| in the structure allows for fast, parallel | | | | training vector data. Ordinarily, they work on |
| implementations in hardware. | | | | binary data, but versions for continuous data that |
| Applications | | | | require small additional processing are also available. |
| The utility of artificial neural network models lies in | | | | Spiking neural networks |
| the fact that they can be used to infer a function | | | | Spiking neural networks (SNNs) are models which |
| from observations. This is particularly useful in | | | | explicitly take into account the timing of inputs. |
| applications where the complexity of the data or | | | | The network input and output are usually |
| task makes the design of such a function by hand | | | | represented as series of spikes (delta function or |
| impractical. | | | | more complex shapes). SNNs have an advantage |
| Real life applications | | | | of being able to process information in the time |
| The tasks to which artificial neural networks are | | | | domain (signals that vary over time). They are |
| applied tend to fall within the following broad | | | | often implemented as recurrent networks. SNNs |
| categories: | | | | are also a form of pulse computer. |
| - Function approximation, or regression analysis, | | | | Spiking neural networks with axonal conduction |
| including time series prediction, fitness | | | | delays exhibit polychronization, and hence could |
| approximation and modeling. | | | | have a very large memory capacity. |
| - Classification, including pattern and sequence | | | | Networks of spiking neurons — and the |
| recognition, novelty detection and sequential | | | | temporal correlations of neural assemblies in such |
| decision making. | | | | networks — have been used to model figure |
| - Data processing, including filtering, clustering, blind | | | | ground separation and region linking in the visual |
| source separation and compression. | | | | system (see, for example, Reitboeck et al.in |
| - Robotics, including directing manipulators, | | | | Haken and Stadler: Synergetics of the Brain. Berlin, |
| Computer numerical control. | | | | 1989). |
| Application areas include system identification and | | | | In June 2005 IBM announced construction of a |
| control (vehicle control, process control), quantum | | | | Blue Gene supercomputer dedicated to the |
| chemistry, game-playing and decision making | | | | simulation of a large recurrent spiking neural |
| (backgammon, chess, racing), pattern recognition | | | | network. |
| (radar systems, face identification, object | | | | Gerstner and Kistler have a freely available online |
| recognition and more), sequence recognition | | | | textbook on Spiking Neuron Models. Dynamic |
| (gesture, speech, handwritten text recognition), | | | | neural networks |
| medical diagnosis, financial applications (automated | | | | Dynamic neural networks not only deal with |
| trading systems), data mining (or knowledge | | | | nonlinear multivariate behaviour, but also include |
| discovery in databases, "KDD"), visualization and | | | | (learning of) time-dependent behaviour such as |
| e-mail spam filtering. | | | | various transient phenomena and delay effects. |
| Neural network software | | | | Cascading neural networks |
| Neural network software is used to simulate, | | | | Cascade-Correlation is an architecture and |
| research, develop and apply artificial neural | | | | supervised learning algorithm developed by Scott |
| networks, biological neural networks and in some | | | | Fahlman and Christian Lebiere. Instead of just |
| cases a wider array of adaptive systems. See | | | | adjusting the weights in a network of fixed |
| also logistic regression. | | | | topology, Cascade-Correlation begins with a |
| | | | minimal network, then automatically trains and |
| Types of neural networks | | | | adds new hidden units one by one, creating a |
| Feedforward neural network | | | | multi-layer structure. Once a new hidden unit has |
| The feedforward neural network was the first | | | | been added to the network, its input-side weights |
| and arguably simplest type of artificial neural | | | | are frozen. This unit then becomes a permanent |
| network devised. In this network, the information | | | | feature-detector in the network, available for |
| moves in only one direction, forward, from the | | | | producing outputs or for creating other, more |
| input nodes, through the hidden nodes (if any) and | | | | complex feature detectors. The |
| to the output nodes. There are no cycles or loops | | | | Cascade-Correlation architecture has several |
| in the network. | | | | advantages over existing algorithms: it learns very |
| Radial basis function (RBF) network | | | | quickly, the network determines its own size and |
| Radial Basis Functions are powerful techniques for | | | | topology, it retains the structures it has built even |
| interpolation in multidimensional space. A RBF is a | | | | if the training set changes, and it requires no |
| function which has built into a distance criterion | | | | back-propagation of error signals through the |
| with respect to a center. Radial basis functions | | | | connections of the network. See: Cascade |
| have been applied in the area of neural networks | | | | correlation algorithm. Neuro-fuzzy networks |
| where they may be used as a replacement for | | | | A neuro-fuzzy network is a fuzzy inference |
| the sigmoidal hidden layer transfer characteristic in | | | | system in the body of an artificial neural network. |
| Multi-Layer Perceptrons. RBF networks have two | | | | Depending on the FIS type, there are several |
| layers of processing: In the first, input is mapped | | | | layers that simulate the processes involved in a |
| onto each RBF in the 'hidden' layer. The RBF | | | | fuzzy inference like fuzzification, inference, |
| chosen is usually a Gaussian. In regression | | | | aggregation and defuzzification. Embedding an FIS |
| problems the output layer is then a linear | | | | in a general structure of an ANN has the benefit |
| combination of hidden layer values representing | | | | of using available ANN training methods to find the |
| mean predicted output. The interpretation of this | | | | parameters of a fuzzy system. Compositional |
| output layer value is the same as a regression | | | | pattern-producing networks |
| model in statistics. In classification problems the | | | | Compositional pattern-producing networks (CPPNs) |
| output layer is typically a sigmoid function of a | | | | are a variation of ANNs which differ in their set of |
| linear combination of hidden layer values, | | | | activation functions and how they are applied. |
| representing a posterior probability. Performance | | | | While typical ANNs often contain only sigmoid |
| in both cases is often improved by shrinkage | | | | functions (and sometimes Gaussian functions), |
| techniques, known as ridge regression in classical | | | | CPPNs can include both types of functions and |
| statistics and known to correspond to a prior | | | | many others. Furthermore, unlike typical ANNs, |
| belief in small parameter values (and therefore | | | | CPPNs are applied across the entire space of |
| smooth output functions) in a Bayesian | | | | possible inputs so that they can represent a |
| framework. | | | | complete image. Since they are compositions of |
| RBF networks have the advantage of not | | | | functions, CPPNs in effect encode images at |
| suffering from local minima in the same way as | | | | infinite resolution and can be sampled for a |
| Multi-Layer Perceptrons. This is because the only | | | | particular display at whatever resolution is optimal. |
| parameters that are adjusted in the learning | | | | One-shot associative memory |
| process are the linear mapping from hidden layer | | | | This type of network can add new patterns |
| to output layer. Linearity ensures that the error | | | | without the need for re-training. It is done by |
| surface is quadratic and therefore has a single | | | | creating a specific memory structure, which |
| easily found minimum. In regression problems this | | | | assigns each new pattern to an orthogonal plane |
| can be found in one matrix operation. In | | | | using adjacently connected hierarchical arrays. The |
| classification problems the fixed non-linearity | | | | network offers real-time pattern recognition and |
| introduced by the sigmoid output function is most | | | | high scalability, it however requires parallel |
| efficiently dealt with using iteratively re-weighted | | | | processing and is thus best suited for platforms |
| least squares. | | | | such as Wireless sensor networks (WSN), Grid |
| RBF networks have the disadvantage of requiring | | | | computing, and GPGPUs. |
| good coverage of the input space by radial basis | | | | Theoretical properties |
| functions. RBF centres are determined with | | | | Computational power |
| reference to the distribution of the input data, but | | | | The multi-layer perceptron (MLP) is a universal |
| without reference to the prediction task. As a | | | | function approximator, as proven by the Cybenko |
| result, representational resources may be wasted | | | | theorem. However, the proof is not constructive |
| on areas of the input space that are irrelevant to | | | | regarding the number of neurons required or the |
| the learning task. A common solution is to | | | | settings of the weights. |
| associate each data point with its own centre, | | | | Work by Hava Siegelmann and Eduardo D. Sontag |
| although this can make the linear system to be | | | | has provided a proof that a specific recurrent |
| solved in the final layer rather large, and requires | | | | architecture with rational valued weights (as |
| shrinkage techniques to avoid overfitting. | | | | opposed to the commonly used floating point |
| Associating each input datum with an RBF leads | | | | approximations) has the full power of a Universal |
| naturally to kernel methods such as Support | | | | Turing Machine using a finite number of neurons |
| Vector Machines and Gaussian Processes (the | | | | and standard linear connections. They have |
| RBF is the kernel function). All three approaches | | | | further shown that the use of irrational values for |
| use a non-linear kernel function to project the | | | | weights results in a machine with super-Turing |
| input data into a space where the learning | | | | power. |
| problem can be solved using a linear model. Like | | | | Capacity |
| Gaussian Processes, and unlike SVMs, RBF | | | | Artificial neural network models have a property |
| networks are typically trained in a Maximum | | | | called 'capacity', which roughly corresponds to their |
| Likelihood framework by maximizing the | | | | ability to model any given function. It is related to |
| probability (minimizing the error) of the data under | | | | the amount of information that can be stored in |
| the model. SVMs take a different approach to | | | | the network and to the notion of complexity. |
| avoiding overfitting by maximizing instead a | | | | Convergence |
| margin. RBF networks are outperformed in most | | | | Nothing can be said in general about convergence |
| classification applications by SVMs. In regression | | | | since it depends on a number of factors. Firstly, |
| applications they can be competitive when the | | | | there may exist many local minima. This depends |
| dimensionality of the input space is relatively small. | | | | on the cost function and the model. Secondly, the |
| Kohonen self-organizing network | | | | optimization method used might not be |
| The self-organizing map (SOM) invented by Teuvo | | | | guaranteed to converge when far away from a |
| Kohonen performs a form of unsupervised | | | | local minimum. Thirdly, for a very large amount of |
| learning. A set of artificial neurons learn to map | | | | data or parameters, some methods become |
| points in an input space to coordinates in an | | | | impractical. In general, it has been found that |
| output space. The input space can have different | | | | theoretical guarantees regarding convergence are |
| dimensions and topology from the output space, | | | | an unreliable guide to practical application. |
| and the SOM will attempt to preserve these. | | | | Generalisation and statistics |
| Recurrent network | | | | In applications where the goal is to create a |
| Contrary to feedforward networks, recurrent | | | | system that generalises well in unseen examples, |
| neural networks (RNs) are models with | | | | the problem of overtraining has emerged. This |
| bi-directional data flow. While a feedforward | | | | arises in overcomplex or overspecified systems |
| network propagates data linearly from input to | | | | when the capacity of the network significantly |
| output, RNs also propagate data from later | | | | exceeds the needed free parameters. There are |
| processing stages to earlier stages. Simple | | | | two schools of thought for avoiding this problem: |
| recurrent network | | | | The first is to use cross-validation and similar |
| A simple recurrent network (SRN) is a variation | | | | techniques to check for the presence of |
| on the Multi-Layer Perceptron, sometimes called | | | | overtraining and optimally select hyperparameters |
| an "Elman network" due to its invention by Jeff | | | | such as to minimize the generalisation error. The |
| Elman. A three-layer network is used, with the | | | | second is to use some form of regularisation. This |
| addition of a set of "context units" in the input | | | | is a concept that emerges naturally in a |
| layer. There are connections from the middle | | | | probabilistic (Bayesian) framework, where the |
| (hidden) layer to these context units fixed with a | | | | regularisation can be performed by selecting a |
| weight of one. At each time step, the input is | | | | larger prior probability over simpler models; but |
| propagated in a standard feed-forward fashion, | | | | also in statistical learning theory, where the goal is |
| and then a learning rule (usually back-propagation) | | | | to minimize over two quantities: the 'empirical risk' |
| is applied. The fixed back connections result in the | | | | and the 'structural risk', which roughly correspond |
| context units always maintaining a copy of the | | | | to the error over the training set and the |
| previous values of the hidden units (since they | | | | predicted error in unseen data due to overfitting. |
| propagate over the connections before the | | | | Confidence analysis of a neural network |
| learning rule is applied). Thus the network can | | | | Supervised neural networks that use an MSE cost |
| maintain a sort of state, allowing it to perform | | | | function can use formal statistical methods to |
| such tasks as sequence-prediction that are | | | | determine the confidence of the trained model. |
| beyond the power of a standard Multi-Layer | | | | The MSE on a validation set can be used as an |
| Perceptron. | | | | estimate for variance. This value can then be |
| In a fully recurrent network, every neuron | | | | used to calculate the confidence interval of the |
| receives inputs from every other neuron in the | | | | output of the network, assuming a normal |
| network. These networks are not arranged in | | | | distribution. A confidence analysis made this way |
| layers. Usually only a subset of the neurons | | | | is statistically valid as long as the output probability |
| receive external inputs in addition to the inputs | | | | distribution stays the same and the network is |
| from all the other neurons, and another disjunct | | | | not modified. |
| subset of neurons report their output externally | | | | By assigning a softmax activation function on the |
| as well as sending it to all the neurons. These | | | | output layer of the neural network (or a softmax |
| distinctive inputs and outputs perform the function | | | | component in a component-based neural network) |
| of the input and output layers of a feed-forward | | | | for categorical target variables, the outputs can |
| or simple recurrent network, and also join all the | | | | be interpreted as posterior probabilities. This is |
| other neurons in the recurrent processing. Hopfield | | | | very useful in classification as it gives a certainty |
| network | | | | measure on classifications. |