Abstract

In this paper, we hypothesize that gradient-based meta-learning (GBML) implicitly suppresses the Hessian along the optimization trajectory in the inner loop. Based on this hypothesis, we introduce an algorithm called SHOT (Suppressing the Hessian along the Optimization Trajectory) that minimizes the distance between the parameters of the target and reference models to suppress the Hessian in the inner loop. Despite dealing with high-order terms, SHOT does not increase the computational complexity of the baseline model much. It is agnostic to both the algorithm and architecture used in GBML, making it highly versatile and applicable to any GBML baseline. To validate the effectiveness of SHOT, we conduct empirical tests on standard few-shot learning tasks and qualitatively analyze its dynamics. We confirm our hypothesis empirically and demonstrate that SHOT outperforms the corresponding baseline.

Episodic Meta Learning


Meta-learning equips models with the ability to learn from scarce examples, akin to human learning, through episodic sampling, where models tackle a series of small, diverse tasks requiring rapid adaptation. Within each episode, two key processes unfold: the inner loop, where the model fine-tunes its parameters to the specific episode for immediate task performance, and the outer loop, where it generalizes this learning across episodes to enhance future task adaptability by updating its initial learning parameters.
Gradient Based Meta Learning (GBML), solves each episode with gradient descent. Which means, it solves episode within few optimization steps.

Problem formulation

At the start of each inner loop in GBML, the model is functionally equivalent to a random initialization point, as it has no prior knowledge of the new episode. This initiates a learning trajectory similar to conventional deep learning, but with a critical distinction: GBML achieves this task-specific adaptation in typically fewer than three optimization steps, in contrast to the potentially countless steps in standard deep learning settings. GBML necessitates huge movements per step due to the constraint of very few optimization steps for task adaptation. This means the Hessian is dominant in the optimization, while in SGD it does not consider the Hessian.

Our hypothesis: GBML suppresses the Hessian along the inner loop.

The equation above is the condition where the loss decreases. As SGD does not consider the Hessian, it becomes noise in the equation. However, in GBML, the Hessian is dominant in the inner loop as it moves a lot in a few steps. From this perspective, we can deduce the conclusion that every successful GBML model should have a small Hessian size along the inner loop. And we hypothesize that this is what outer loop implicitly does.

SHOT: Suppressing the Hessian along the Optimization Trajectory

Then our goal naturally becomes to explicitly suppress the Hessian along the inner loop. However, directly suppressing the Hessian is not favorable as it is computationally expensive. We solve this problem by observing that the Hessian along the optimization trajectory matters, not the Hessian itself. Which means, we only need the Hessian in the direction of the optimization trajectory, which can be reduced in first order. To do so, we provide ad-hoc measure. From the perspective of gradient flow, the model is less influenced by the Hessian if it embodies more gradient steps and smaller learning rate. Therefore, minimizing the distance between the more-Hessian-influenced model and less-Hessian-influenced model can be a measure the distortion by the Hessian. We call this measure as SHOT. Then, the measure becomes a simple distance between the target and reference models.
And detailed pseudo code is as follows.

Results

Following results show that SHOT behaves as we expected.
We trained the model with only SHOT loss. Which means, we did not use any target loss which aims to achieve meta-learning ability. As you can see, the randomly-initizlized model fails to learn the task. It is no better than random guess. However, SHOT-initialized model can perform better than random guess. The landscape explains why. The loss surface along the gradient direction is flat when we initialize model with SHOT. Therefore, we can guarentee that the loss decreases along the gradient direction.
Table above shows that SHOT can boost performance of various benchmarks, in-domain and cross-domain. Also, we can adopt any measure to compute the distance between the target and reference models.

Citation

Acknowledgements

This work was supported by NRF grant (2021R1A2C3006659) and IITP grants (2021-0-01343, 2022-0-00953), all of which were funded by Korean Government (MSIT).