Lecture 5 Functions of several variables and optimization with several variables

Learning objectives

Define a partial derivative
Identify higher order derivatives and partial derivatives
Define notation for calculus performed on vector and matrix forms
Demonstrate multivariable calculus methods on social scientific research
Calculate critical points, partial derivatives, and double integrals

Supplemental readings

Chapter 14, Pemberton and Rau (2011)
OpenStax Calculus: Volume 3, ch 4

5.1 Higher order derivatives

The first derivative is applying the definition of derivatives on the function, and it can be expressed as

\[f'(x), ~~ y', ~~ \frac{d}{dx}f(x), ~~ \frac{dy}{dx}\]

We can keep applying the differentiation process to functions that are themselves derivatives. The derivative of \(f'(x)\) with respect to \(x\), would then be \[f''(x)=\lim\limits_{h\to 0}\frac{f'(x+h)-f'(x)}{h}\] and we can therefore call it the Second derivative:

\[f''(x), ~~ y'', ~~ \frac{d^2}{dx^2}f(x), ~~ \frac{d^2y}{dx^2}\]

Similarly, the derivative of \(f''(x)\) would be called the third derivative and is denoted \(f'''(x)\). And by extension, the nth derivative is expressed as \(\frac{d^n}{dx^n}f(x)\), \(\frac{d^ny}{dx^n}\).

\[ \begin{aligned} f(x) &=x^3\\ f^{\prime}(x) &=3x^2\\ f^{\prime\prime}(x) &=6x \\ f^{\prime\prime\prime}(x) &=6\\ f^{\prime\prime\prime\prime}(x) &=0\\ \end{aligned} \]

Earlier, we said that if a function is differentiable at a given point, then it must be continuous. Further, if \(f'(x)\) is itself continuous, then \(f(x)\) is called continuously differentiable. All of this matters because many of our findings about optimization rely on differentiation, and so we want our function to be differentiable in as many layers. A function that is continuously differentiable infinitely is called smooth. Some examples include:

\[ \begin{aligned} f(x) &= x^2 \\ f(x) &= e^x \end{aligned} \]

5.2 Multivariate function

A multivariate function is a function with more than one argument.

5.2.0.0.1 Example 1

\[f(x_{1}, x_{2}) = x_{1} + x_{2}\]

5.2.0.0.2 Example 2

\[f(x_{1}, x_{2}) = x_{1}^2 + x_{2}^2\]

5.2.0.0.3 Example 3

\[f(x_{1}, x_{2}) = \sin(x_1)\cos(x_2)\]

5.2.0.0.4 Example 4

\[f(x_{1}, x_{2}) = -(x-5)^2 - (y-2)^2\]

5.2.0.0.5 Example 5

\[f(x_{1}, x_{2}, x_{3} ) = x_1 + x_2 + x_3\]

5.2.0.0.6 Example 6

\[ \begin{aligned} f(\mathbf{x} )&= f(x_{1}, x_{2}, \ldots, x_{N} ) \\ &= x_{1} +x_{2} + \ldots + x_{N} \\ &= \sum_{i=1}^{N} x_{i} \end{aligned} \]

5.2.1 Definition

Definition 5.1 (Multivariate function) Suppose \(f:\Re^{n} \rightarrow \Re^{1}\). We will call \(f\) a multivariate function. We will commonly write,

\[f(\mathbf{x}) = f(x_{1}, x_{2}, x_{3}, \ldots, x_{n} )\]

\(\Re^{n} = \Re \underbrace{\times}_{\text{cartesian}} \Re \times \Re \times \ldots \Re\)
The function we consider will take \(n\) inputs and output a single number (that lives in \(\Re^{1}\), or the real line)

5.2.2 Evaluating multivariate functions

Example 5.1 \[f(x_{1}, x_{2}, x_{3}) = x_1 + x_2 + x_3\]

Evaluate at \(\mathbf{x} = (x_{1}, x_{2}, x_{3}) = (2, 3, 2)\)

\[ \begin{aligned} f(2, 3, 2) & = 2 + 3 + 2 \\ & = 7 \end{aligned} \]

Example 5.2 \[f(x_{1}, x_{2} ) = x_{1} + x_{2} + x_{1} x_{2}\]

Evaluate at \(\mathbf{w} = (w_{1}, w_{2} ) = (1, 2)\)

\[ \begin{aligned} f(w_{1}, w_{2}) & = w_{1} + w_{2} + w_{1} w_{2} \\ & = 1 + 2 + 1 \times 2 \\ & = 5 \end{aligned} \]

Example 5.3 (Preferences for multidimensional policy) Recall that in the spatial model, we suppose policy and political actors are located in a space. Suppose that policy is \(N\) dimensional - or \(\mathbf{x} \in \Re^{N}\). Suppose that legislator \(i\)’s utility is a \(U:\Re^{N} \rightarrow \Re^{1}\) and is given by,

\[ \begin{aligned} U(\mathbf{x}) & = U(x_{1}, x_{2}, \ldots, x_{N} ) \\ & = - (x_{1} - \mu_{1} )^2 - (x_{2} - \mu_{2})^2 - \ldots - (x_{N} - \mu_{N})^{2} \\ & = -\sum_{j=1}^{N} (x_{j} - \mu_{j} )^{2} \end{aligned} \]

Suppose \(\mathbf{\mu} = (\mu_{1}, \mu_{2}, \ldots, \mu_{N} ) = (0, 0, \ldots, 0)\). Evaluate legislator’s utility for a policy proposal of \(\mathbf{m} = (1, 1, \ldots, 1)\)

\[ \begin{aligned} U(\mathbf{m} ) & = U(1, 1, \ldots, 1) \\ & = - (1 - 0)^2 - (1- 0) ^2 - \ldots - (1- 0) ^2 \\ & = -\sum_{j=1}^{N} 1 = - N \\ \end{aligned} \]

Example 5.4 (Regression models and randomized treatments) Often we administer randomized experiments. The most recent wave of interest began with voter mobilization, and wonders if individual \(i\) turns out to vote, \(\text{Vote}_{i}\)

\(T = 1\) (treated): voter receives mobilization
\(T = 0\) (control): voter does not receive mobilization

Suppose we find the following regression model, where \(x_{2}\) is a participant’s age:

\[ \begin{aligned} f(T,x_2) & = \Pr(\text{Vote}_{i} = 1 | T, x_{2} ) \\ & = \beta_{0} + \beta_{1} T + \beta_{2} x_{2} \end{aligned} \]

We can calculate the effect of the experiment as:

\[ \begin{aligned} f(T = 1, x_2) - f(T=0, x_2) & = \beta_{0} + \beta_{1} 1 + \beta_{2} x_{2} - (\beta_{0} + \beta_{1} 0 + \beta_{2} x_{2}) \\ & = \beta_{0} - \beta_{0} + \beta_{1}(1 - 0) + \beta_{2}(x_{2} - x_{2} ) \\ & = \beta_{1} \end{aligned} \]

5.3 Multivariate derivatives

What happens when there’s more than one variable that is changing?

If you can do ordinary derivatives, you can do partial derivatives: just hold all the other input variables constant except for the one with respect to which you are differentiating.

Suppose we have a function \(f\) now of two (or more) variables and we want to determine the rate of change relative to one of the variables. To do so, we would find its partial derivative, which is defined similar to the derivative of a function of one variable.

Definition 5.2 (Partial derivative) Let \(f\) be a function of the variables \((x_1,\ldots,x_n)\). The partial derivative of \(f\) with respect to \(x_i\) is

\[\frac{\partial f}{\partial x_i} (x_1,\ldots,x_n) = \lim\limits_{h\to 0} \frac{f(x_1,\ldots,x_i+h,\ldots,x_n)-f(x_1,\ldots,x_i,\ldots,x_n)}{h}\]

Only the \(i\)th variable changes — the others are treated as constants.

We can take higher-order partial derivatives, like we did with functions of a single variable, except now the higher-order partials can be with respect to multiple variables.

Notice that you can take partials with regard to different variables.

Suppose \(f(x,y)=x^2+y^2\). Then

\[ \begin{aligned} \frac{\partial f}{\partial x}(x,y) &= 2x \\ \frac{\partial f}{\partial y}(x,y) &= 2y\\ \frac{\partial^2 f}{\partial x^2}(x,y) &= 2\\ \frac{\partial^2 f}{\partial x \partial y}(x,y) &= 0 \end{aligned} \]

Let \(f(x,y)=x^3 y^4 +e^x -\log y\). What are the following partial derivaitves?

\[ \begin{aligned} \frac{\partial f}{\partial x}(x,y) &= 3x^2y^4 + e^x\\ \frac{\partial f}{\partial y}(x,y) &=4x^3y^3 - \frac{1}{y}\\ \frac{\partial^2 f}{\partial x^2}(x,y) &= 6xy^4 + e^x\\ \frac{\partial^2 f}{\partial x \partial y}(x,y) &= 12x^2y^3 \end{aligned} \]

Example 5.5 (Rate of change, linear regression) Suppose we regress \(\text{Approval}_{i}\) rate for Trump in month \(i\) on \(\text{Employ}_{i}\) and \(\text{Gas}_{i}\). We obtain the following model:

\[\text{Approval}_{i} = 0.8 -0.5 \text{Employ}_{i} -0.25 \text{Gas}_{i}\]

We are modeling \(\text{Approval}_{i} = f(\text{Employ}_{i}, \text{Gas}_{i} )\). What is the partial derivative with respect to employment?

\[\frac{\partial f(\text{Employ}_{i}, \text{Gas}_{i} ) }{\partial \text{Employ}_{i} } = -0.5\]

5.4 Multivariate optimization

Just as we want to optimize functions with a single variable, we often wish to opimize functions with multiple variables.

Parameters \(\mathbf{\beta} = (\beta_{1}, \beta_{2}, \ldots, \beta_{n} )\) such that \(f(\mathbf{\beta}| \mathbf{X}, \mathbf{Y})\) is maximized
Policy \(\mathbf{x} \in \Re^{n}\) that maximizes \(U(\mathbf{x})\)
Weights \(\mathbf{\pi} = (\pi_{1}, \pi_{2}, \ldots, \pi_{K})\) such that a weighted average of forecasts \(\mathbf{f} = (f_{1} , f_{2}, \ldots, f_{k})\) have minimum loss

\[\min_{\mathbf{\pi}} = - (\sum_{j=1}^{K} \pi_{j} f_{j} - y ) ^ 2\]

As before, we will consider both analytic and computational approaches.

5.4.1 Differences from single variable optimization procedure

It is the same basic approach, except we have multiple parameters of interest. This requires more explicit knowledge of linear algebra to track all the components and optimize over the multidimensional space

Let \(\mathbf{x} \in \Re^{n}\) and let \(\delta >0\). Define a neighborhood of \(\mathbf{x}\), \(B(\mathbf{x}, \delta)\), as the set of points such that,

\[B(\mathbf{x}, \delta) = \{ \mathbf{y} \in \Re^{n} : ||\mathbf{x} - \mathbf{y}||< \delta \}\]

That is, \(B(\mathbf{x}, \delta)\) is the set of points where the vector \(\mathbf{y}\) is a vector in n-dimensional space such that vector norm of \(\mathbf{x} - \mathbf{y}\) is less than \(\delta\)
So the neighborhood is at most \(\delta\) big

Now suppose \(f:X \rightarrow \Re\) with \(X \subset \Re^{n}\). A vector \(\mathbf{x}^{*} \in X\) is a global maximum if , for all other \(\mathbf{x} \in X\)

\[f(\mathbf{x}^{*}) > f(\mathbf{x} )\]

A vector \(\mathbf{x}^{\text{local}}\) is a local maximum if there is a neighborhood around \(\mathbf{x}^{\text{local}}\), \(Q \subset X\) such that, for all \(x \in Q\),

\[f(\mathbf{x}^{\text{local} }) > f(\mathbf{x} )\]

The maximum and minimum values of a function \(f:X \rightarrow \Re\) on the real number line (in n-dimensional space) will fall somewhere along \(X\). This is the same as we saw previously, except now \(X\) is not a scalar value - it is a vector \(\mathbf{X}\).

5.4.2 First derivative test: Gradient

Suppose \(f:X \rightarrow \Re^{n}\) with \(X \subset \Re^{1}\) is a differentiable function. Define the gradient vector of \(f\) at \(\mathbf{x}_{0}\), \(\nabla f(\mathbf{x}_{0})\) as

\[\nabla f (\mathbf{x}_{0}) = \left(\frac{\partial f (\mathbf{x}_{0}) }{\partial x_{1} }, \frac{\partial f (\mathbf{x}_{0}) }{\partial x_{2} }, \frac{\partial f (\mathbf{x}_{0}) }{\partial x_{3} }, \ldots, \frac{\partial f (\mathbf{x}_{0}) }{\partial x_{n} } \right)\]

It is the first partial derivatives for each variable \(x_n\) stored in a vector
Gradient points in the direction that the function is increasing in the fastest direction

So if \(\mathbf{a} \in X\) is a local extremum, then,

\[ \begin{aligned} \nabla f(\mathbf{a}) &= \mathbf{0} \\ &= (0, 0, \ldots, 0) \end{aligned} \]

That is, the root(s) of the gradient are where \(f(\mathbf{a})\) equals \(\mathbf{0}\) in \(n\)-dimensional space.

Example 5.6 \[ \begin{aligned} f(x,y) &= x^2+y^2 \\ \nabla f(x,y) &= (2x, \, 2y) \end{aligned} \]

Example 5.7 \[ \begin{aligned} f(x,y) &= x^3 y^4 +e^x -\log y \\ \nabla f(x,y) &= (3x^2 y^4 + e^x, \, 4x^3y^3 - \frac{1}{y}) \end{aligned} \]

5.4.2.1 Critical points

We can have critical points:

Maximum
Minimum
Saddle point

In order to know if we are at a maximum/minimum/saddle point, we need to perform the second derivative test.

5.4.3 Second derivative test: Hessian

Suppose \(f:X \rightarrow \Re^{1}\) , \(X \subset \Re^{n}\), with \(f\) a twice differentiable function. We will define the Hessian matrix as the matrix of second derivatives at \(\mathbf{x}^{*} \in X\),

\[ \mathbf{H}(f)(\mathbf{x}^{*} ) = \begin{bmatrix} \frac{\partial^{2} f }{\partial x_{1} \partial x_{1} } (\mathbf{x}^{*} ) & \frac{\partial^{2} f }{\partial x_{1} \partial x_{2} } (\mathbf{x}^{*} ) & \ldots & \frac{\partial^{2} f }{\partial x_{1} \partial x_{n} } (\mathbf{x}^{*} ) \\ \frac{\partial^{2} f }{\partial x_{2} \partial x_{1} } (\mathbf{x}^{*} ) & \frac{\partial^{2} f }{\partial x_{2} \partial x_{2} } (\mathbf{x}^{*} ) & \ldots & \frac{\partial^{2} f }{\partial x_{2} \partial x_{n} } (\mathbf{x}^{*} ) \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^{2} f }{\partial x_{n} \partial x_{1} } (\mathbf{x}^{*} ) & \frac{\partial^{2} f }{\partial x_{n} \partial x_{2} } (\mathbf{x}^{*} ) & \ldots & \frac{\partial^{2} f }{\partial x_{n} \partial x_{n} } (\mathbf{x}^{*} ) \\ \end{bmatrix} \]

Hessians are symmetric, and they describe the curvature of the function (think, how bended). To calculate the hessian, you must differentiate on the entire gradient with respect to each \(x_n\).

Example 5.8 \[ \begin{aligned} f(x,y) &= x^2+y^2 \\ \nabla f(x,y) &= (2x, \, 2y) \\ \mathbf{H}(f)(x,y) &= \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix} \end{aligned} \]

Example 5.9 \[ \begin{aligned} f(x,y) &= x^3 y^4 +e^x -\log y \\ \nabla f(x,y) &= (3x^2 y^4 + e^x, \, 4x^3y^3 - \frac{1}{y}) \\ \mathbf{H}(f)(x,y) &= \begin{bmatrix} 6xy^4 + e^x & 12x^2y^3 \\ 12x^2y^3 & 12x^3y^2 + \frac{1}{y^2} \end{bmatrix} \end{aligned} \]

5.4.3.1 Definiteness of a matrix

Consider \(n \times n\) matrix \(\mathbf{A}\). If, for all \(\mathbf{x} \in \Re^{n}\) where \(\mathbf{x} \neq \mathbf{0}\):

\[ \begin{aligned} \mathbf{x}^{'} \mathbf{A} \mathbf{x} &> 0, \quad \mathbf{A} \text{ is positive definite} \\ \mathbf{x}^{'} \mathbf{A} \mathbf{x} &< 0, \quad \mathbf{A} \text{ is negative definite } \end{aligned} \]

If \(\mathbf{x}^{'} \mathbf{A} \mathbf{x} >0\) for some \(\mathbf{x}\) and \(\mathbf{x}^{'} \mathbf{A} \mathbf{x}<0\) for other \(\mathbf{x}\), then we say \(\mathbf{A}\) is indefinite.

\(\mathbf{x}\) is a vector of the appropriate length (can be any vector drawn from \(\Re^n\) space), so a transposed vector times a matrix times a vector will result in a scalar value

5.4.3.2 Second derivative test

If \(\mathbf{H}(f)(\mathbf{a})\) is positive definite then \(\mathbf{a}\) is a local minimum
If \(\mathbf{H}(f)(\mathbf{a})\) is negative definite then \(\mathbf{a}\) is a local maximum
If \(\mathbf{H}(f)(\mathbf{a})\) is indefinite then \(\mathbf{a}\) is a saddle point

5.4.3.3 Use the determinant to assess definiteness

How do we measure definiteness when up until now \(\mathbf{x}\) could be any vector? We can use the determinant of the Hessian of \(f\) at the critical value \(\mathbf{a}\):

\[ \mathbf{H}(f)(\mathbf{a}) = \begin{bmatrix} A & B \\ B & C \\ \end{bmatrix} \]

The determinant for a \(2 \times 2\) matrix can easily be calculated using the known formula \(AC - B^2\).

\(AC - B^2> 0\) and \(A>0\) \(\leadsto\) positive definite \(\leadsto\) \(\mathbf{a}\) is a local minimum
\(AC - B^2> 0\) and \(A<0\) \(\leadsto\) negative definite \(\leadsto\) \(\mathbf{a}\) is a local maximum
\(AC - B^2<0\) \(\leadsto\) indefinite \(\leadsto\) saddle point
\(AC- B^2 = 0\) inconclusive

5.4.4 Basic procedure summarized

Calculate gradient
Set equal to zero, solve system of equations
Calculate Hessian
Assess Hessian at critical values
Boundary values? (if relevant)

5.5 A simple optimization example

Suppose \(f:\Re^{2} \rightarrow \Re\) with

\[f(x_{1}, x_{2}) = 3(x_1 + 2)^2 + 4(x_{2} + 4)^2 \]

Calculate gradient:

\[ \begin{aligned} \nabla f(\mathbf{x}) &= (6 x_{1} + 12 , 8x_{2} + 32 ) \\ \mathbf{0} &= (6 x_{1}^{*} + 12 , 8x_{2}^{*} + 32 ) \end{aligned} \]

We now solve the system of equations to yield

\[x_{1}^{*} = - 2, \quad x_{2}^{*} = -4\]

\[ \textbf{H}(f)(\mathbf{x}^{*}) = \begin{bmatrix} 6 & 0 \\ 0 & 8 \\ \end{bmatrix} \]

\(\det(\textbf{H}(f)(\mathbf{x}^{*}))\) = 48 and \(6>0\) so \(\textbf{H}(f)(\mathbf{x}^{*})\) is positive definite. \(\mathbf{x^{*}}\) is a local minimum.

5.6 Maximum likelihood estimation for a normal distribution

Suppose that we draw an independent and identically distributed random sample of \(n\) observations from a normal distribution,

\[ \begin{aligned} Y_{i} &\sim \text{Normal}(\mu, \sigma^2) \\ \mathbf{Y} &= (Y_{1}, Y_{2}, \ldots, Y_{n} ) \end{aligned} \]

Our task:

Obtain likelihood (summary estimator)
Derive maximum likelihood estimators for \(\mu\) and \(\sigma^2\)

\[ \begin{aligned} L(\mu, \sigma^2 | \mathbf{Y} ) &\propto \prod_{i=1}^{n} f(Y_{i}|\mu, \sigma^2) \\ &\propto \prod_{i=1}^{N} \frac{\exp[ - \frac{ (Y_{i} - \mu)^2 }{2\sigma^2} ]}{\sqrt{2 \pi \sigma^2}} \\ &\propto \frac{\exp[ -\sum_{i=1}^{n} \frac{(Y_{i} - \mu)^2}{2\sigma^2} ]}{ (2\pi)^{n/2} \sigma^{2n/2} } \end{aligned} \]

Taking the logarithm, we have

\[l(\mu, \sigma^2|\mathbf{Y} ) = -\sum_{i=1}^{n} \frac{(Y_{i} - \mu)^2}{2\sigma^2} - \frac{n}{2} \log(2 \pi) - \frac{n}{2} \log (\sigma^2)\]

Let’s find \(\widehat{\mu}\) and \(\widehat{\sigma}^{2}\) that maximizes log-likelihood.

\[ \begin{aligned} l(\mu, \sigma^2|\mathbf{Y} ) &= -\sum_{i=1}^{n} \frac{(Y_{i} - \mu)^2}{2\sigma^2} - \frac{n}{2} \log (\sigma^2) \\ \frac{\partial l(\mu, \sigma^2)|\mathbf{Y} )}{\partial \mu } &= \sum_{i=1}^{n} \frac{2(Y_{i} - \mu)}{2\sigma^2} \\ \frac{\partial l(\mu, \sigma^2)|\mathbf{Y})}{\partial \sigma^2} &= -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{i=1}^{n} (Y_{i} - \mu)^2 \end{aligned} \]

\[ \begin{aligned} 0 &= -\sum_{i=1}^{n} \frac{2(Y_{i} - \widehat{\mu})}{2\widehat{\sigma}^2} \\ 0 &= -\frac{n}{2\widehat{\sigma}^2 } + \frac{1}{2\widehat{\sigma}^4} \sum_{i=1}^{n} (Y_{i} - \mu^{*})^2 \end{aligned} \]

Solving for \(\widehat{\mu}\) and \(\widehat{\sigma}^2\) yields,

\[ \begin{aligned} \widehat{\mu} &= \frac{\sum_{i=1}^{n} Y_{i} }{n} \\ \widehat{\sigma}^{2} &= \frac{1}{n} \sum_{i=1}^{n} (Y_{i} - \overline{Y})^2 \end{aligned} \]

\[ \textbf{H}(f)(\widehat{\mu}, \widehat{\sigma}^2) = \begin{bmatrix} \frac{\partial^{2} l(\mu, \sigma^2|\mathbf{Y} )}{\partial \mu^{2}} & \frac{\partial^{2} l(\mu, \sigma^2|\mathbf{Y} )}{\partial \sigma^{2} \partial \mu} \\ \frac{\partial^{2} l(\mu, \sigma^2|\mathbf{Y} )}{\partial \sigma^{2} \partial \mu} & \frac{\partial^{2} l(\mu, \sigma^2|\mathbf{Y} )}{\partial^{2} \sigma^{2}} \end{bmatrix} \]

Taking derivatives and evaluating at MLE’s yields,

\[ \textbf{H}(f)(\widehat{\mu}, \widehat{\sigma}^2) = \begin{bmatrix} \frac{-n}{\widehat{\sigma}^2} & 0 \\ 0 & \frac{-n}{2(\widehat{\sigma}^2)^2} \\ \end{bmatrix} \]

\(\text{det}(\textbf{H}(f)(\widehat{\mu}, \widehat{\sigma}^2)) = \dfrac{n^2}{2(\widehat{\sigma}^2)^3} > 0\) and \(A = \dfrac{-n}{\widehat{\sigma}^2} < 0\) \(\leadsto\) maximum
Determinant is greater than 0 and \(A\) is less than zero - local maximum

5.7 Computational optimization procedures

As the previous example suggests, analytical approaches can be difficult or impossible for many multivariate functions. Computational approaches simplify the problem.

5.7.1 Multivariate Newton-Raphson

Suppose \(f:\Re^{n} \rightarrow \Re\). Suppose we have guess \(\mathbf{x}_{t}\). Then our update is:

\[\mathbf{x}_{t+1} = \mathbf{x}_{t} - [\textbf{H}(f)(\mathbf{x}_{t})]^{-1} \nabla f(\mathbf{x}_{t})\]

Approximate function with tangent plane
Find value of \(x_{t+1}\) that makes the plane equal to zero
Update again

5.7.1.1 Drawbacks

Expensive to calculate (requires inverting Hessian)
Very sensitive to starting points

5.7.2 Grid search

Example: MLE for a normal distribution
In R, I drew 10,000 realizations from \(Y_{i} \sim \text{Normal}(0.25, 100)\)
Used realized values \(y_{i}\) to evaluate \(l(\mu, \sigma^2| \mathbf{y} )\) across a range of values
Computationally inefficient - have to try a large number of combinations of parameters

5.7.3 Gradient descent

Same approach as before, but now the derivative is a vector (i.e. gradient, hence the name of the approach “gradient” descent).

\[f(x, y) = x^2 + 2y^2\]

References

Pemberton, Malcolm, and Nicholas Rau. 2011. Mathematics for Economists: An Introductory Textbook. 4th edition. University of Toronto Press.