Table of Contents
Fetching ...

The Nature of Intelligence

Barco Jie You

Abstract

The human brain is the substrate for human intelligence. By simulating the human brain, artificial intelligence builds computational models that have learning capabilities and perform intelligent tasks approaching the human level. Deep neural networks consist of multiple computation layers to learn representations of data and improve the state-of-the-art in many recognition domains. However, the essence of intelligence commonly represented by both humans and AI is unknown. Here, we show that the nature of intelligence is a series of mathematically functional processes that minimize system entropy by establishing functional relationships between datasets over the space and time. Humans and AI have achieved intelligence by implementing these entropy-reducing processes in a reinforced manner that consumes energy. With this hypothesis, we establish mathematical models of language, unconsciousness and consciousness, predicting the evidence to be found by neuroscience and achieved by AI engineering. Furthermore, a conclusion is made that the total entropy of the universe is conservative, and the intelligence counters the spontaneous processes to decrease entropy by physically or informationally connecting datasets that originally exist in the universe but are separated across the space and time. This essay should be a starting point for a deeper understanding of the universe and us as human beings and for achieving sophisticated AI models that are tantamount to human intelligence or even superior. Furthermore, this essay argues that more advanced intelligence than humans should exist if only it reduces entropy in a more efficient energy-consuming way.

The Nature of Intelligence

Abstract

The human brain is the substrate for human intelligence. By simulating the human brain, artificial intelligence builds computational models that have learning capabilities and perform intelligent tasks approaching the human level. Deep neural networks consist of multiple computation layers to learn representations of data and improve the state-of-the-art in many recognition domains. However, the essence of intelligence commonly represented by both humans and AI is unknown. Here, we show that the nature of intelligence is a series of mathematically functional processes that minimize system entropy by establishing functional relationships between datasets over the space and time. Humans and AI have achieved intelligence by implementing these entropy-reducing processes in a reinforced manner that consumes energy. With this hypothesis, we establish mathematical models of language, unconsciousness and consciousness, predicting the evidence to be found by neuroscience and achieved by AI engineering. Furthermore, a conclusion is made that the total entropy of the universe is conservative, and the intelligence counters the spontaneous processes to decrease entropy by physically or informationally connecting datasets that originally exist in the universe but are separated across the space and time. This essay should be a starting point for a deeper understanding of the universe and us as human beings and for achieving sophisticated AI models that are tantamount to human intelligence or even superior. Furthermore, this essay argues that more advanced intelligence than humans should exist if only it reduces entropy in a more efficient energy-consuming way.
Paper Structure (15 sections, 21 equations, 15 figures)

This paper contains 15 sections, 21 equations, 15 figures.

Figures (15)

  • Figure 1: Neural network for deep learning. A multilayer neural network shown by connected nodes (circles), with the input layer feeding in scalar elements of a random variable and output layer emitting out scalar elements of another random variable. Variable hidden layers consist of nodes representing modular functions$(\theta \in \boldsymbol{\theta})$ that take inputs from the last layer and output to the next layer.
  • Figure 2: Feedforward pathway and backpropagation in a multilayer neural network. A multilayer network is trained by a sample dataset through the feedforward pathway and backpropagation of gradient descent. Sample values of variable$\boldsymbol{X}\left(\left\{\boldsymbol{x}_{i}\right\} \in \boldsymbol{X}\right)$ are transformed layer by layer in the forward pass to prediction values $\left\{\widehat{\boldsymbol{y}}_{i}\right\}$, and then errors between prediction values and actual sample values of $\boldsymbol{Y}$ are calculated, with their gradients with respect to the modular function parameters $\left(\frac{\partial \delta}{\partial \boldsymbol{\theta}}\right)$ descending towards 0 in a backpropagated way.
  • Figure 3: The agent-environment interaction in a reinforcement learning process. An agent represented by function$f(\cdot)$ receives inputs $(\boldsymbol{X})$ from the environment, which is represented by another function $g(\cdot)$ conceived by the agent. $\boldsymbol{X}$ is a composed variable by combining the currently perceived environment state ( $\boldsymbol{S}$ ) and reward ( $\boldsymbol{R}$ ) given by the environment at the last time step, which is transformed by $f(\cdot)$ into an action (a) performed by the agent upon the environment. Then, $g(\cdot)$ transforms the current state ( $\boldsymbol{s}$ ) and action ( $\boldsymbol{a}$ ) into a new state ( $\boldsymbol{s}^{\prime}$ ), rewarding $\boldsymbol{r}$. Over time, the agent interacts with the environment in a continuous way or in episodes to minimize the errors of the $g(\cdot)$ function's predictions to targets. In reinforcement learning, $f(\cdot)$ is called the policy, and $g(\cdot)$ is called the value function. Both parameters are optimized by gradient descent over loss functions derived from the divergence between actual returns and estimated values, which is also a value variance reduction process based on the Bellman equation. The gradients flow from value function $g(\cdot)$ to policy $f(\cdot)$.
  • Figure 4: Information structure of GANs and gradient flow paths. GANs are composed of two functions, each of which is differentiable with respect to both its inputs and parameters. The generator is a function$f\left(\cdot ; \boldsymbol{\theta}^{(G)}\right)$ that takes a random variable $\boldsymbol{Z}$ as input and $\boldsymbol{\theta}^{(G)}$ as parameters, while the discriminator is a function $g\left(\cdot ; \boldsymbol{\theta}^{(D)}\right)$ that takes samples $\boldsymbol{X}$ as input and $\boldsymbol{\theta}^{(D)}$ as parameters. Both components have loss functions defined in terms of both their parameters, as $\delta^{(G)}\left(\boldsymbol{\theta}^{(G)}, \boldsymbol{\theta}^{(D)}\right)$ and $\delta^{(D)}\left(\boldsymbol{\theta}^{(G)}, \boldsymbol{\theta}^{(D)}\right)$, respectively, and both wish to minimize their losses by controlling their own parameters because they cannot control others' parameters. The optimization of $\left(\boldsymbol{\theta}^{(G)}, \boldsymbol{\theta}^{(D)}\right)$ is to reach a Nash equilibrium, obtaining a local minimum of $\delta^{(G)}$ with respect to $\boldsymbol{\theta}^{(G)}$ and a local minimum $\delta^{(D)}$ with
  • Figure 5: Encoder-decoder architecture of transformers. Transformers are a kind of seq2seq model with two components: an encoder, representing function$f(\cdot)$, which is a neural network with stacked multihead self-attention layers and other layers (feed-forward, normalization, etc.) transforming input sequences into context vectors, and a decoder, representing function $g(\cdot)$, which is a neural network with stacked multihead self-attention layers to receive preceding words and multihead attention layers to receive outputs from the encoder as well as from previous decoder blocks. Every output of the decoder depends on all words in the input sequence and all preceding outputs of the decoder, achieved by the attention mechanism. The gradients flow from decoder to encoder.
  • ...and 10 more figures