class: center, middle, inverse, title-slide # Lec 12 - Numerical optimization ##
Statistical Computing and Computation ### Sta 663 | Spring 2022 ###
Dr. Colin Rundel --- exclude: true ```python import numpy as np import matplotlib as mpl import matplotlib.pyplot as plt import pandas as pd import seaborn as sns plt.rcParams['figure.dpi'] = 200 from scipy import optimize ``` ```r knitr::opts_chunk$set( fig.align="center", cache=FALSE ) ``` ```r local({ hook_err_old <- knitr::knit_hooks$get("error") # save the old hook knitr::knit_hooks$set(error = function(x, options) { # now do whatever you want to do with x, and pass # the new x to the old hook x = sub("## \n## Detailed traceback:\n.*$", "", x) x = sub("Error in py_call_impl\\(.*?\\)\\: ", "", x) hook_err_old(x, options) }) hook_warn_old <- knitr::knit_hooks$get("warning") # save the old hook knitr::knit_hooks$set(warning = function(x, options) { x = sub("<string>:1: ", "", x) hook_warn_old(x, options) }) }) ``` --- ## Numerical optimization - line search Today we will be discussing one particular approach for numerical optimization - line search. This is a family of algorithmic approaches that attempt to find (global or local) minima via iteration on an initial guess. Generally they are an attempt to solve, $$ \underset{\alpha>0}{\text{min}} f(x_k + \alpha \, p_k) $$ where `\(f()\)` is the function we are attempting to minimize, `\(x_k\)` is our current guess at iteration `\(k\)` and `\(\alpha\)` is the step length and `\(p_k\)` is the direction of movement. We will only be dipping our toes in the water of this area but the goal is to provide some context for some of the more common (and easier) use cases. With that in mind, we will be looking at methods for smooth functions (2nd derivative exists and is continuous). --- ## Naive Gradient Descent We will start with a naive approach to gradient descent where we choose a fixed step size and determine the direction based on the gradient of the function at each iteration. ```python def grad_desc_1d(x0, f, grad, step, max_step=100, tol = 1e-6): all_x_i = [x0] all_f_i = [f(x0)] x_i = x0 try: for i in range(max_step): dx_i = grad(x_i) x_i = x_i - dx_i * step f_x_i = f(x_i) all_x_i.append(x_i) all_f_i.append(f_x_i) if np.abs(dx_i) < tol: break except OverflowError as err: print(f"{type(err).__name__}: {err}") if len(all_x_i) == max_step+1: print("Warning - Failed to converge!") return all_x_i, all_f_i ``` --- ## A basic example .pull-left[ $$ `\begin{aligned} f(x) &= x^2 \\ \nabla f(x) &= 2x \end{aligned}` $$ ] .pull-right[ ```python f = lambda x: x**2 grad = lambda x: 2*x ``` ] -- <div> .pull-left[ ```python opt = grad_desc_1d(-2., f, grad, step=0.25) plot_1d_traj( (-2, 2), f, opt ) ``` <img src="Lec12_files/figure-html/unnamed-chunk-4-1.png" width="90%" style="display: block; margin: auto;" /> ] -- .pull-right[ ```python opt = grad_desc_1d(-2, f, grad, step=0.5) plot_1d_traj( (-2, 2), f, opt ) ``` <img src="Lec12_files/figure-html/unnamed-chunk-5-3.png" width="90%" style="display: block; margin: auto;" /> ] </div> --- ## Where can it go wrong? .pull-left[ ```python opt = grad_desc_1d(-2, f, grad, step=0.9) plot_1d_traj( (-2,2), f, opt ) ``` <img src="Lec12_files/figure-html/unnamed-chunk-6-5.png" width="90%" style="display: block; margin: auto;" /> ] -- .pull-right[ ```python opt = grad_desc_1d(-2, f, grad, step=1) ``` ``` ## Warning - Failed to converge! ``` ```python plot_1d_traj( (-2,2), f, opt ) ``` <img src="Lec12_files/figure-html/unnamed-chunk-7-7.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Local minima of a quartic .pull-left[ $$ `\begin{aligned} f(x) &= x^4 + x^3 -x^2 - x \\ \nabla f(x) &= 4x^3 + 3x^2 - 2x - 1 \end{aligned}` $$ ] .pull-right[ ```python f = lambda x: x**4 + x**3 - x**2 - x grad = lambda x: 4*x**3 + 3*x**2 - 2*x - 1 ``` ] -- <div> .pull-left[ ```python opt = grad_desc_1d(-1.5, f, grad, step=0.2) plot_1d_traj( (-1.5, 1.5), f, opt ) ``` <img src="Lec12_files/figure-html/unnamed-chunk-9-9.png" width="90%" style="display: block; margin: auto;" /> ] -- .pull-right[ ```python opt = grad_desc_1d(-1.5, f, grad, step=0.25) plot_1d_traj( (-1.5, 1.5), f, opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-10-11.png" width="90%" style="display: block; margin: auto;" /> ] </div> --- ## Alternative starting points .pull-left[ ```python opt = grad_desc_1d(1.5, f, grad, step=0.2) plot_1d_traj( (-1.5, 1.5), f, opt ) ``` <img src="Lec12_files/figure-html/unnamed-chunk-11-13.png" width="90%" style="display: block; margin: auto;" /> ] -- .pull-right[ ```python opt = grad_desc_1d(1.25, f, grad, step=0.2) plot_1d_traj( (-1.5, 1.5), f, opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-12-15.png" width="90%" style="display: block; margin: auto;" /> ] </div> --- ## Problematic step sizes If the step size is too large it is possible for the algorithm to .pull-left[ ```python opt = grad_desc_1d(-1.5, f, grad, step=0.75) ``` ``` ## OverflowError: (34, 'Result too large') ``` ```python plot_1d_traj( (-1.5, 1.5), f, opt ) ``` <img src="Lec12_files/figure-html/unnamed-chunk-13-17.png" width="90%" style="display: block; margin: auto;" /> ] -- .pull-right[ ```python opt = grad_desc_1d(1.5, f, grad, step=0.25) ``` ``` ## OverflowError: (34, 'Result too large') ``` ```python plot_1d_traj( (-1.5, 1.5), f, opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-14-19.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Gradient Descent w/ backtracking .pull-left[ As we have just seen having too large of a step can<br/> be problematic, one solution is to allow the step size<br/> to adapt. Backtracking involves checking if the proposed move is<br/> advantageous (i.e. `\(f(x_k+\alpha p_k) < f(x_k)\)`), * If it is advantageous then accept <br/> `\(x_{k+1} = x_k+\alpha p_k\)`. * If not, shrink `\(\alpha\)` by a factor `\(\tau\)` (e.g. 0.5) <br/> and check again. Pick larger `\(\alpha\)` to start as this will not fix <br/> inefficiency of small step size. ] .footnote[ This is a hand wavy version of the [Armijo-Goldstein condition](https://en.wikipedia.org/wiki/Backtracking_line_search) <br/> Check `\(f(x_k-\alpha \nabla f(x_k)) \leq f(x_k) - c \alpha (\nabla f(x_k))^2\)`. ] .pull-right[ .small[ ```python def grad_desc_1d_bt(x, f, grad, step, tau=0.5, max_step=100, max_back=10, tol = 1e-6): all_x_i = [x] all_f_i = [f(x)] try: for i in range(max_step): dx = grad(x) * for j in range(max_back): * new_x = x + step * (-dx) * new_f_x = f(new_x) * * if (new_f_x < all_f_i[-1]): * break * * step = step * tau x = new_x f_x = new_f_x all_x_i.append(x) all_f_i.append(f_x) if np.abs(dx) < tol: break except OverflowError as err: print(f"{type(err).__name__}: {err}") if len(all_x_i) == max_step+1: print("Warning - Failed to converge!") return all_x_i, all_f_i ``` ] ] --- .pull-left[ ```python opt = grad_desc_1d_bt(-1.5, f, grad, step=0.75, tau=0.5) plot_1d_traj( (-1.5, 1.5), f, opt ) ``` <img src="Lec12_files/figure-html/unnamed-chunk-16-21.png" width="90%" style="display: block; margin: auto;" /> ] -- .pull-right[ ```python opt = grad_desc_1d_bt(1.5, f, grad, step=0.25, tau=0.5) plot_1d_traj( (-1.5, 1.5), f, opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-17-23.png" width="90%" style="display: block; margin: auto;" /> ] --- ## A 2d cost function We will be using `mk_quad()` to create quadratic functions with varying conditioning (as specified by the `epsilon` parameter). $$ `\begin{align} f(x,y) &= 0.33(x^2 + \epsilon^2 y^2 ) \\ \nabla f(x,y) &= \left[ \begin{matrix} 0.66 \, x \\ 0.66 \, \epsilon^2 \, y \end{matrix} \right] \\ \nabla^2 f(x,y) &= \left[\begin{array}{cc} 0.66 & 0 \\ 0 & 0.66 \, \epsilon^2 \end{array}\right] \end{align}` $$ --- ## Examples .pull-left[ ```python f, grad, hess = mk_quad(0.7) plot_2d_traj((-1,2), (-1,2), f, title="$\\epsilon=0.7$") ``` <img src="Lec12_files/figure-html/unnamed-chunk-20-25.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ```python f, grad, hess = mk_quad(0.02) plot_2d_traj((-1,2), (-1,2), f, title="$\\epsilon=0.02$") ``` <img src="Lec12_files/figure-html/unnamed-chunk-21-27.png" width="100%" style="display: block; margin: auto;" /> ] --- ## 2d gradient descent w/ backtracking .midi[ ```python def grad_desc_2d(x0, f, grad, step, tau=0.5, max_step=100, max_back=10, tol = 1e-6): x_i = x0 all_x_i = [x_i[0]] all_y_i = [x_i[1]] all_f_i = [f(x_i)] for i in range(max_step): dx_i = grad(x_i) for j in range(max_back): new_x_i = x_i - dx_i * step new_f_i = f(new_x_i) if (new_f_i < all_f_i[-1]): break step = step * tau x_i, f_i = new_x_i, new_f_i all_x_i.append(x_i[0]) all_y_i.append(x_i[1]) all_f_i.append(f_i) * if np.sqrt(np.sum(dx_i**2)) < tol: break return all_x_i, all_y_i, all_f_i ``` ] --- ## Well conditioned cost function .pull-left[ ```python f, grad, hess = mk_quad(0.7) opt = grad_desc_2d((1.6, 1.1), f, grad, step=1) plot_2d_traj((-1,2), (-1,2), f, title="$\\epsilon=0.7$", traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-23-29.png" width="95%" style="display: block; margin: auto;" /> ] .pull-right[ ```python f, grad, hess = mk_quad(0.7) opt = grad_desc_2d((1.6, 1.1), f, grad, step=2) plot_2d_traj((-1,2), (-1,2), f, title="$\\epsilon=0.7$", traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-24-31.png" width="95%" style="display: block; margin: auto;" /> ] --- ## Ill-conditioned cost function .pull-left[ ```python f, grad, hess = mk_quad(0.02) opt = grad_desc_2d((1.6, 1.1), f, grad, step=1) plot_2d_traj((-1,2), (-1,2), f, title="$\\epsilon=0.02$", traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-25-33.png" width="95%" style="display: block; margin: auto;" /> ] .pull-right[ ```python f, grad, hess = mk_quad(0.02) opt = grad_desc_2d((1.6, 1.1), f, grad, step=2) plot_2d_traj((-1,2), (-1,2), f, title="$\\epsilon=0.02$", traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-26-35.png" width="95%" style="display: block; margin: auto;" /> ] --- ## Rosenbrock function (very ill conditioned) .pull-left[ ```python f, grad, hess = mk_rosenbrock() opt = grad_desc_2d((1.6, 1.1), f, grad, step=0.25) plot_2d_traj((-2,2), (-2,2), f, traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-27-37.png" width="95%" style="display: block; margin: auto;" /> ] .pull-right[ ```python f, grad, hess = mk_rosenbrock() opt = grad_desc_2d((-0.5, 0), f, grad, step=0.25) plot_2d_traj((-2,2), (-2,2), f, traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-28-39.png" width="95%" style="display: block; margin: auto;" /> ] --- ## Taylor Expansion For any arbitary smooth function, we can construct a 2nd order taylor approximation as follows, $$ `\begin{align} f(x_k + \alpha \, p_k) &= f(x_k) + \alpha \, p_k^T \nabla f(x_k + \alpha \, p_k) \\ &= f(x_k) + \alpha \, p_k^T \nabla f(x_k) + \frac{1}{2} \alpha^2 p_k^T \, \nabla^2 f(x_k + \alpha \, p_k) \, p_k \\ &\approx f(x_k) + \alpha \, p_k^T \nabla f(x_k) + \frac{1}{2} \alpha^2 p_k^T \, \nabla^2 f(x_k) \, p_k \end{align}` $$ --- ## Newton's Method in 1d Lets simplify things for now and consider just the 1d case and write `\(\alpha\,p_k\)` as `\(\Delta\)`, $$ f(x_k + \Delta) \approx f(x_k) + \Delta f'(x_k) + \frac{1}{2} \Delta^2 f''(x_k) $$ to find the `\(\Delta\)` that minimizes this function we can take a derivative with regard to `\(\Delta\)` and set the equation equal to zero which gives, $$ 0 = f'(x_k) + \Delta f''(x_k) \;\; \Rightarrow \;\; \Delta = -\frac{f'(x_k)}{f''(x_k)} $$ which then suggests an iterative update rule of $$ x\_{k+1} = x\_{k} -\frac{f'(x\_k)}{f''(x\_k)} $$ --- ## Generalizing to `\(n\)`d Based on the same argument we can see the follow result for a function in `\(\mathbb{R}^n\)`, $$ f(x_k + \Delta) \approx f(x_k) + \Delta^T \nabla f(x_k) + \frac{1}{2} \Delta^T \, \nabla^2 f(x_k) \,\Delta $$ $$ 0 = \nabla f(x_k) + \nabla^2 f(x_k) \, \Delta \;\; \Rightarrow \;\; \Delta = -\left(\nabla^2 f(x_k)\right)^{-1} \nabla f(x_k) f(x_k) $$ which then suggests an iterative update rule of $$ x\_{k+1} = x\_{k} - (\nabla^2 f(x\_k))^{-1} \, \nabla f(x\_k) $$ --- ```python def newtons_method(x0, f, grad, hess, max_iter=100, max_back=10, tol=1e-8): all_x_i = [x0[0]] all_y_i = [x0[1]] all_f_i = [f(x0)] x_i = x0 for i in range(max_iter): g_i = grad(x_i) step = - np.linalg.solve(hess(x_i), g_i) for j in range(max_back): new_x_i = x_i + step new_f_i = f(new_x_i) if (new_f_i < all_f_i[-1]): break step /= 2 x_i, f_i = new_x_i, new_f_i all_x_i.append(x_i[0]) all_y_i.append(x_i[1]) all_f_i.append(f_i) if np.sqrt(np.sum(g_i**2)) < tol: break return all_x_i, all_y_i, all_f_i ``` .footnote[Based on Chapter 5.1 from [Core Statistics](https://www.maths.ed.ac.uk/~swood34/core-statistics.pdf)] --- ## Well conditioned quadratic cost function .pull-left[ ```python f, grad, hess = mk_quad(0.7) opt = newtons_method((1.6, 1.1), f, grad, hess) plot_2d_traj((-1,2), (-1,2), f, traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-30-41.png" width="95%" style="display: block; margin: auto;" /> ] .pull-right[ ```python f, grad, hess = mk_quad(0.02) opt = newtons_method((1.6, 1.1), f, grad, hess) plot_2d_traj((-1,2), (-1,2), f, traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-31-43.png" width="95%" style="display: block; margin: auto;" /> ] --- ## Rosenbrock function (very ill conditioned) .pull-left[ ```python f, grad, hess = mk_rosenbrock() opt = newtons_method((1.6, 1.1), f, grad, hess) plot_2d_traj((-2,2), (-2,2), f, traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-32-45.png" width="95%" style="display: block; margin: auto;" /> ] .pull-right[ ```python f, grad, hess = mk_rosenbrock() opt = newtons_method((-0.5, 0), f, grad, hess) plot_2d_traj((-2,2), (-2,2), f, traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-33-47.png" width="95%" style="display: block; margin: auto;" /> ] --- ## Conjugate gradients This is a general approach for solving a system of linear equations with the form `\(Ax=b\)` where `\(A\)` is an `\(n \times n\)` symmetric positive definite matrix and b is `\(n \times 1\)` with `\(x\)` unknown. This type of problem can also be expressed as a quadratic minimization problems of the form, $$ \underset{x}{\text{min}} \; f(x) = \frac{1}{2} x^T \, A \, x - b^T x + c $$ The goal is then to find `\(n\)` conjugate vectors ( `\(p^T_i \, A \, p_j = 0\)` for all `\(i \neq j\)`) and their coefficients such that $$ x\_* = \sum_{i=1}^n \alpha_i \, p_i $$ --- ## Conjugate gradient algorithm .pull-left[ Given `\(x_0\)` we set the following initial values, `$$\begin{align} r_0 &= \nabla f(x_0) \\ p_0 &= -r_0 \\ k &= 0 \end{align}$$` while `\(\|r_k\|_2 > \text{tol}\)`, `$$\begin{align} \alpha_k &= \frac{r_k^T \, p_k}{p_k^T \, \nabla^2 f(x_k) \, p_k} \\ x_{k+1} &= x_k + \alpha_k \, p_k \\ r_{k+1} &= \nabla f(x_{k+1}) \\ \beta_{k} &= \frac{ r^T_{k+1} \, \nabla^2 f(x_k) \, p_{k} }{p_k^T \, \nabla^2 f(x_k) \, p_k} \\ p_{k+1} &= -r_{k+1} + \beta_{k} \, p_k \\ k &= k+1 \end{align}$$` ] .footnote[ From Chapter 5.1 of [Numerical Optimization](https://find.library.duke.edu/catalog/DUKE004973775) 2006 ] .pull-right[ ```python def conjugate_gradient(x0, f, grad, hess, max_iter=100, tol=1e-8): all_x_i = [x0[0]] all_y_i = [x0[1]] all_f_i = [f(x0)] x_i = x0 r_i = grad(x0) p_i = -r_i for i in range(max_iter): a_i = - r_i.T @ p_i / (p_i.T @ hess(x_i) @ p_i) x_i_new = x_i + a_i * p_i r_i_new = grad(x_i_new) b_i = (r_i_new.T @ hess(x_i) @ p_i) / (p_i.T @ hess(x_i) @ p_i) p_i_new = -r_i_new + b_i * p_i x_i, r_i, p_i = x_i_new, r_i_new, p_i_new all_x_i.append(x_i[0]) all_y_i.append(x_i[1]) all_f_i.append(f(x_i)) if np.sqrt(np.sum(r_i_new**2)) < tol: break return all_x_i, all_y_i, all_f_i ``` ] --- ## Trajectory .pull-left[ ```python f, grad, hess = mk_quad(0.7) opt = conjugate_gradient((1.6, 1.1), f, grad, hess) plot_2d_traj((-1,2), (-1,2), f, title="$\\epsilon=0.7$", traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-35-49.png" width="95%" style="display: block; margin: auto;" /> ] .pull-right[ ```python f, grad, hess = mk_quad(0.02) opt = conjugate_gradient((1.6, 1.1), f, grad, hess) plot_2d_traj((-1,2), (-1,2), f, title="$\\epsilon=0.02$", traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-36-51.png" width="95%" style="display: block; margin: auto;" /> ] --- ## Rosenbrock's function .pull-left[ ```python f, grad, hess = mk_rosenbrock() opt = conjugate_gradient((1.6, 1.1), f, grad, hess) plot_2d_traj((-2,2), (-2,2), f, traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-37-53.png" width="95%" style="display: block; margin: auto;" /> ] .pull-right[ ```python f, grad, hess = mk_rosenbrock() opt = conjugate_gradient((-0.5, 0), f, grad, hess) plot_2d_traj((-2,2), (-2,2), f, traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-38-55.png" width="95%" style="display: block; margin: auto;" /> ] --- ## CG in scipy Scipy's optimize module implements the conjugate gradient algorithm by Polak and Ribiere, a variant that does not require the hessian, .pull-left[ #### Differences: * `\(\alpha_k\)` is calculated via a line search along the direction `\(p_k\)` * `\(\beta_{k+1}\)` is replaced with $$ \beta\_{k+1}^{PR} = \frac{\nabla f(x\_{k+1}) \left(\nabla f(x\_{k+1}) - \nabla f(x\_{k})\right)}{\nabla f(x\_k)^T \, \nabla f(x\_k)} $$ ] .pull-right[ ```python def conjugate_gradient_scipy(x0, f, grad, tol=1e-8): all_x_i = [x0[0]] all_y_i = [x0[1]] all_f_i = [f(x0)] def store(X): x, y = X all_x_i.append(x) all_y_i.append(y) all_f_i.append(f(X)) optimize.minimize( f, x0, jac=grad, method="CG", callback=store, tol=tol ) return all_x_i, all_y_i, all_f_i ``` ] --- ## Trajectory .pull-left[ ```python f, grad, hess = mk_quad(0.7) opt = conjugate_gradient_scipy((1.6, 1.1), f, grad) plot_2d_traj((-1,2), (-1,2), f, title="$\\epsilon=0.7$", traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-40-57.png" width="95%" style="display: block; margin: auto;" /> ] .pull-right[ ```python f, grad, hess = mk_quad(0.02) opt = conjugate_gradient_scipy((1.6, 1.1), f, grad) plot_2d_traj((-1,2), (-1,2), f, title="$\\epsilon=0.02$", traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-41-59.png" width="95%" style="display: block; margin: auto;" /> ] --- ## Rosenbrock's function .pull-left[ ```python f, grad, hess = mk_rosenbrock() opt = conjugate_gradient_scipy((1.6, 1.1), f, grad) plot_2d_traj((-2,2), (-2,2), f, traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-42-61.png" width="95%" style="display: block; margin: auto;" /> ] .pull-right[ ```python f, grad, hess = mk_rosenbrock() opt = conjugate_gradient_scipy((-0.5, 0), f, grad) plot_2d_traj((-2,2), (-2,2), f, traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-43-63.png" width="95%" style="display: block; margin: auto;" /> ] --- ## Method: Newton-CG Is a variant of Newtons method but does not require inverting the hessian, or even a hessian function - in which case it can be estimated by finite differencing of the gradient. .pull-left[ ```python def newton_cg(x0, f, grad, hess=None, tol=1e-8): all_x_i = [x0[0]] all_y_i = [x0[1]] all_f_i = [f(x0)] def store(X): x, y = X all_x_i.append(x) all_y_i.append(y) all_f_i.append(f(X)) optimize.minimize( f, x0, jac=grad, hess=hess, tol=tol, method="Newton-CG", callback=store ) return all_x_i, all_y_i, all_f_i ``` ] --- ## Trajectory - well conditioned .pull-left[ ```python f, grad, hess = mk_quad(0.7) opt = newton_cg((1.6, 1.1), f, grad) plot_2d_traj((-1,2), (-1,2), f, traj=opt, title="w/o hessian") ``` <img src="Lec12_files/figure-html/unnamed-chunk-45-65.png" width="95%" style="display: block; margin: auto;" /> ] .pull-right[ ```python f, grad, hess = mk_quad(0.7) opt = newton_cg((1.6, 1.1), f, grad, hess) plot_2d_traj((-1,2), (-1,2), f, traj=opt, title="w/ hessian") ``` <img src="Lec12_files/figure-html/unnamed-chunk-46-67.png" width="95%" style="display: block; margin: auto;" /> ] --- ## Trajectory - ill-conditioned .pull-left[ ```python f, grad, hess = mk_quad(0.02) opt = newton_cg((1.6, 1.1), f, grad) plot_2d_traj((-1,2), (-1,2), f, traj=opt, title="w/o hessian") ``` <img src="Lec12_files/figure-html/unnamed-chunk-47-69.png" width="95%" style="display: block; margin: auto;" /> ] .pull-right[ ```python f, grad, hess = mk_quad(0.02) opt = newton_cg((1.6, 1.1), f, grad, hess) plot_2d_traj((-1,2), (-1,2), f, traj=opt, title="w/ hessian") ``` <img src="Lec12_files/figure-html/unnamed-chunk-48-71.png" width="95%" style="display: block; margin: auto;" /> ] --- ## Rosenbrock's function .pull-left[ ```python f, grad, hess = mk_rosenbrock() opt = newton_cg((1.6, 1.1), f, grad) plot_2d_traj((-2,2), (-2,2), f, traj=opt, title="w/o hessian") ``` <img src="Lec12_files/figure-html/unnamed-chunk-49-73.png" width="95%" style="display: block; margin: auto;" /> ] .pull-right[ ```python f, grad, hess = mk_rosenbrock() opt = newton_cg((1.6, 1.1), f, grad, hess) plot_2d_traj((-2,2), (-2,2), f, traj=opt, title="w/ hessian") ``` <img src="Lec12_files/figure-html/unnamed-chunk-50-75.png" width="95%" style="display: block; margin: auto;" /> ] --- ## Method: BFGS The Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm is a quasi-newton which iterative improves its approximation of the hessian, .pull-left[ ```python def bfgs(x0, f, grad, hess=None, tol=1e-8): all_x_i = [x0[0]] all_y_i = [x0[1]] all_f_i = [f(x0)] def store(X): x, y = X all_x_i.append(x) all_y_i.append(y) all_f_i.append(f(X)) optimize.minimize( f, x0, jac=grad, tol=tol, method="BFGS", callback=store ) return all_x_i, all_y_i, all_f_i ``` ] --- ## Trajectory .pull-left[ ```python f, grad, hess = mk_quad(0.7) opt = bfgs((1.6, 1.1), f, grad) plot_2d_traj((-1,2), (-1,2), f, traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-52-77.png" width="95%" style="display: block; margin: auto;" /> ] .pull-right[ ```python f, grad, hess = mk_quad(0.02) opt = bfgs((1.6, 1.1), f, grad) plot_2d_traj((-1,2), (-1,2), f, traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-53-79.png" width="95%" style="display: block; margin: auto;" /> ] --- ## Rosenbrock's function .pull-left[ ```python f, grad, hess = mk_rosenbrock() opt = bfgs((1.6, 1.1), f, grad) plot_2d_traj((-2,2), (-2,2), f, traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-54-81.png" width="95%" style="display: block; margin: auto;" /> ] .pull-right[ ```python f, grad, hess = mk_rosenbrock() opt = bfgs((-0.5, 0), f, grad) plot_2d_traj((-2,2), (-2,2), f, traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-55-83.png" width="95%" style="display: block; margin: auto;" /> ] --- ## Method: Nelder-Mead This is a gradient free method that uses a series of simplexes which are used to iteratively bracket the minimum. .pull-left[ ```python def nelder_mead(x0, f, grad, hess=None, tol=1e-8): all_x_i = [x0[0]] all_y_i = [x0[1]] all_f_i = [f(x0)] def store(X): x, y = X all_x_i.append(x) all_y_i.append(y) all_f_i.append(f(X)) optimize.minimize( f, x0, tol=tol, method="Nelder-Mead", callback=store ) return all_x_i, all_y_i, all_f_i ``` ] --- ## Nelder-Mead .center[ <iframe width="1000" height="560" src="http://nelder-mead.s3-website.us-east-2.amazonaws.com/"> </iframe> ] .footnote[From https://github.com/greg-rychlewski/nelder-mead] --- ## Trajectory .pull-left[ ```python f, grad, hess = mk_quad(0.7) opt = nelder_mead((1.6, 1.1), f, grad) plot_2d_traj((-1,2), (-1,2), f, traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-57-85.png" width="95%" style="display: block; margin: auto;" /> ] .pull-right[ ```python f, grad, hess = mk_quad(0.02) opt = nelder_mead((1.6, 1.1), f, grad) plot_2d_traj((-1,2), (-1,2), f, traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-58-87.png" width="95%" style="display: block; margin: auto;" /> ] --- ## Rosenbrock's function .pull-left[ ```python f, grad, hess = mk_rosenbrock() opt = nelder_mead((1.6, 1.1), f, grad) plot_2d_traj((-2,2), (-2,2), f, traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-59-89.png" width="95%" style="display: block; margin: auto;" /> ] .pull-right[ ```python f, grad, hess = mk_rosenbrock() opt = nelder_mead((-0.5, 0), f, grad) plot_2d_traj((-2,2), (-2,2), f, traj=opt) ``` <img src="Lec12_files/figure-html/unnamed-chunk-60-91.png" width="95%" style="display: block; margin: auto;" /> ]