# Machine Learning - Reading Notes on Bayesian Optimization

### Brief Description

Bayesian optimization is a technique to perform optimization tasks when the objective function is unknown or hard to evaluate. A function is hard to evaluate if it takes time to get the results or if it's expensive to evaluate. Here we use the general sense of the term function: a process that generates outputs from inputs. For example, simulations and drug experiments can both be viewed as functions.

The primary goal of an optimization task is to find the optimal value of the function. If the objective function is unknown, there isn't anything we can do about it. We can evaluate the function at different points and select pick the best value among the results or we assume the function takes a certain form so that we can get a theoretical analytic result.

### Surrogate and Acquisition Function

Therefore, there are two aspects of the optimization problem when it involves an unknown objective function:

1. Which form should we use to approximate the unknown objective function?
2. How to select the points at which we want to evaluate the function? Remember the function is hard to evaluate and we want to minimize the number of function evaluations.

In Bayesian optimization, we assume the vector $$[f(x_1), f(x_2), ..., f(x_n)]$$ follows a multivariate normal distribution and we select the next data point to evaluate based on acquisition functions.

Now let's take a look at the while loop. At the beginning of the loop, we have

• n evaluation results: $$f(x_1), f(x_2), ..., f(x_n)$$
• A (prior) distribution of the value $$f(x)$$. (denoted by $$\mathbb{P}_n(f(x))$$).
• A distribution of the optimum $$x^*$$. (denoted by $$\mathbb{P}_n(x^*)$$).

At the end of the loop, we have

• A new evaluation result $$f(x_{n+1})$$.
• A (posterior) distribution of the value $$f(x)$$. (denoted by $$\mathbb{P}_n(f(x))$$).
• A new distribution of the optimum $$x^*$$. (denoted by $$\mathbb{P}_n(x^*)$$).

The important point is that at any time, we have the distribution of the value $$f(x)$$ for any $$x$$ in the domain space.

The next question is how to select the next data point $$x_{n+1}$$ to evaluate. In Bayesian optimization, the next data point is the maximizer of the acquisition function. There are three common acquisition functions used in Bayesian optimization:

• Expected Improvement
• Entropy Search

#### Expected Improvement

As we are evaluating the function at more data points, the known optimal value is increasing. Suppose the current optimum is $$x_n^{*}$$, according to the Expected Improvement acquisition function, we should select the next data point to evaluate so that it can maximize the expected increase of the optimal value.

At step $$n$$, we have the distribution of $$f(x)$$ for all $$x$$. Therefore, we can make an educational guess: the maximizer of the unknown function is the data point $$x^{*}$$ where $$\mathbb{P}_n(x^*)$$ has the best mean denoted by $$\mu_{n}^*$$. According to the Knowledge Gradient method, we should select the next data point so that it can maximize the expected increase of this value:

$$KG_n(x) = E_n[\mu_{n+1}^* - \mu_{n}^* | x_{n+1} = x]$$

#### Entropy Search

As we mentioned earlier, we also have the distribution of the optimum $$\mathbb{P}_n(x^*)$$ at each step. Because entropy represents uncertainty, the idea of Entropy search is that we should select the next data point so that it can maximize the reduction of the entropy of the optimum.

----- END -----

Welcome to join reddit self-learning community.

Want some fun stuff?