Globally, the data traffic volume of 100 Gigabytes (GB) per day in 1992 had skyrocketed to 2000 GB per second by 2007. It was equivalent to 46,000 GB per second by 2017 and is estimated to reach 150,700 GB per second by 2022 (Cisco, 2020). This rapid evolution is due to streaming of videos with higher quality, video streaming for gaming, augmented reality (AR), artificial intelligence (AI) training, autonomous vehicles with streaming cameras, and blockchain applications.

One area of concern relates to the massive increase in Internet data traffic from video- streaming services in recent years (Marks et al., 2020). The term video- streaming refers to the provision of video files which are hosted on physical servers that are separate from individual content users and their devices (TVs, Smartphones, PCs, Laptops, Tablets, etc.). Streaming refers to a delivery method where media content is provided continuously to the consumer, who does not have to download video files on their device points anymore. Online video- streaming services cover different usages, including in particular “video on demand” like films and series (e.g. Netflix, Disney+ or Amazon Prime), and social network uses (e.g. Facebook, Instagram, Twitter, or TikTok). Total video streaming and downloads are projected to grow from about 72% in 2017 to about 82% of total global consumer Internet traffic by 2022 (CISCO, 2019). This development is related to the technology of online videos, which represents a very dense medium of information. Further increases in video- streaming traffic are expected when 4K/8K resolution displays become more widespread.

Most online videos rely on a program called a codec to compress or encode the video at the source, transmit it over the Internet to the viewer, and then decompress or decode it for playback. These codecs make multiple decisions for each frame in a video. One of these decisions relates to the bitrate. Bitrate is an important factor in how much processing power and bandwidth is required to deliver and store video. It affects everything from a video’s load time to its resolution, buffering, and data usage.

With the increase in video during the COVID -19 pandemic and the expected increase in overall Internet traffic in the future, video compression is an increasingly important problem. Decades of work have gone into optimizing these codecs. However, since reinforcement learning is particularly well suited to sequential decision problems like codecs, MuZero could help optimize this process.

Planning algorithms based on lookahead search have already achieved remarkable success in artificial intelligence. Human world champions have been defeated in classic games such as checkers, chess, Go and poker, and planning algorithms have gained real-world acceptance in applications ranging from logistics to chemical synthesis. However, these planning algorithms all rely on knowledge of the dynamics of the environment, such as the rules of a game or an accurate simulator, which prevents their direct application to real-world domains where the dynamics are typically unknown such as robotics, intelligent assistants or even compressing video files.

Model-based reinforcement learning (RL) aims to solve this problem by first learning a model of the dynamics of the environment and then planning given the learned model.
For instance MuZero, DeepMind’s new approach to model-based RL that achieves superhuman performance on precision planning tasks such as chess, shogi, and Go, without prior knowledge of the game dynamics. MuZero builds on the powerful search and iteration algorithms of DeepMind’s prior algorithm, AlphaZero, but incorporates a learned model into the training procedure.

MuZero focuses on the VP9, Google’s open-source codec, since it is widely used by YouTube and other streaming services. By learning the dynamics of video encoding and determining how best to allocate bits, MuZero Rate-Controller is able to reduce bitrate without quality degradation. While decades of research and engineering have resulted in efficient algorithms, MuZero automatically learns to make these encoding decisions to obtain the optimal rate-distortion trade-off and demonstrated an average bitrate reduction of 4% across a large, diverse set of videos.

“The main idea of the algorithm is to predict those aspects of the future that are directly relevant for planning. The model receives the observation (for example, an image of the Go board) as input and converts it to a hidden state. The hidden state is then iteratively updated by a recurrent process that receives the previous hidden state and a hypothetical next action. At each of these steps, the model generates a strategy (prediction of the move to be played), a value function (prediction of the cumulative reward, e.g., the ultimate win), and an immediate reward prediction (e.g., the points earned by playing a move). The model is trained throughout with the single goal of accurately estimating these three important quantities to achieve the improved strategy and value function generated by the search, as well as the observed reward. There is no direct requirement or constraint for the hidden state to capture all the information necessary to reconstruct the original observation, drastically reducing the amount of information the model must retain and predict. There is also no requirement that the hidden state match the unknown true state of the environment, and there are no other constraints on the semantics of the state. Instead, the hidden states can represent any state that correctly estimates the strategy, value function, and reward. Intuitively, the agent can internally invent any dynamic that leads to accurate planning.