'Adam'보다 더 빠른 옵티마이저 등장? by 스탠포드 대학

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

arxiv.org

해당 논문의 출처입니다.

먼저 논문을 보기에는 조금 이른 감이 있기 때문에 초록과 결론 부분만 살펴보도록 하겠습니다.

초록 :

Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, Second-order Clipped Stochastic Optimization, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner. The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping. The clipping controls the worst-case update size and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory. Sophia only estimates the diagonal Hessian every handful of iterations, which has negligible average per-step time and memory overhead. On language modeling with GPT-2 models of sizes ranging from 125M to 770M, Sophia achieves a 2x speed-up compared with Adam in the number of steps, total compute, and wall-clock time. Theoretically, we show that Sophia adapts to the curvature in different components of the parameters, which can be highly heterogeneous for language modeling tasks. Our run-time bound does not depend on the condition number of the loss.

Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training.

언어 모델 전처리의 엄청난 비용에서, 최적화 알고리즘의 중대한 향상은 시간과 훈련의 비용에서 자원 절감을 이끌 것이다.

Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead.

아담과 이것을 변형한 여러가지 모델들이 수년동안 최고의 기술이었고 더 정교한 2차 (Hessian을 기반으로 한) optimizers는 자주 단계당 많은 overhead를 발생시킵니다. (오버헤드(overhead)는 어떤 처리를 하기 위해 들어가는 간접적인 처리 시간)

In this paper, we propose Sophia, Second-order Clipped Stochastic Optimization, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner.

이 논문에서 우리는 Sophia(2차 clipped 확률적 최적화) 라는 최적화를 제안합니다. Sophia는 간단한 확장 가능한 2차 optimizer 이고 전제 조건으로 대각선 근사(?) 의 경량 추정치를 사용합니다.

The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping.

이것에 대한 update는 element-wise clipping으로 가져온 추정된 Hessian의 변화하는 평균으로 나눠진 기울기의 변화하는 평균(이동 평균)입니다.

The clipping controls the worst-case update size and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory.

그 clipping은 최악의 update size를 제어하고 그 궤도를 따라 급격하게 변하는 Hessian과 비볼록성의 부정적이 영향을 길들입니다.

Sophia only estimates the diagonal Hessian every handful of iterations, which has negligible average per-step time and memory overhead.

Sophia는 매 반복마다 대각 Hessian을 추정하는데 이는 단계별 평균 시간과 메모리 overhead를 무시할 정도 입니다.

On language modeling with GPT-2 models of sizes ranging from 125M to 770M, Sophia achieves a 2x speed-up compared with Adam in the number of steps, total compute, and wall-clock time.

125M ~ 770M의 범위를 가진 GPT-2 모델로 하는 언어 모델링에서 Sophia는 아담과 비교했을 때 많은 단계, 총 연산, 그리고 wall-clock timed에서 2x의 속도 상승을 이루었다.

Theoretically, we show that Sophia adapts to the curvature in different components of the parameters, which can be highly heterogeneous for language modeling tasks.

이론적으로, 우리는 언어 모델링 과제를 위한 고도로 이질적으로 될 수 있는 매개 변수의 다른 구성 요소의 곡률에 적응하는 것을 보여 주었습니다.

Our run-time bound does not depend on the condition number of the loss.

우리의 런 타임 한계는 손실의 조건 번호에 의존하지 않습니다.

결론 :

We introduced Sophia, a scalable second-order optimizer for language model pre-training. Sophia converges in fewer steps than first-order adaptive methods, while maintaining almost the same per-step cost. On language modeling with GPT-2, Sophia achieves a 2x speed-up compared with AdamW in the number of steps, total compute, and wall-clock time.

우리는 언어 모델 전처리를 위한 확장 가능한 2차 optimizer Sophia를 소개했다. Sophia는 첫번째로 적응한 methods 보다 적은 단계로 수렴하면서 거의 같은 단계별 비용을 유지한다. GPT-2와 같이한 언어 모델링에서 Sophia는 AdamW와 비교해서 많은 단계, 총 연산, 그리고 wall-clock time에서 2x의 속도 향상을 달성했다.

그러니까 결론적으로 아담보다 더 빠른 전처리를 지원하는 Sophia를 만들어냈다고 하는 것이군요!

사전 지식 :

LLM 대형 언어 모델 - 방대한 양의 자연어 데이터를 처리하고 종종 사람이 생성한 텍스트와 구별할 수 없는 응답을 생성할 수 있는 인공 지능 시스템

소피아는 헤시안(Hessian) 행렬을 근사하여 손실 함수 값이 가파르게 증가하는 방향에 더 큰 패널티를 부과하기 때문에, 기존의 아담(Adam) 옵티마이저보다 더 빠르게 학습이 진행됩니다. 또한, 아담의 경우 손실 함수 곡면이 평평한 쪽으로는 파라미터 값도 천천히 업데이트되는데, 소피아는 모든 차원에서 손실 함수가 일정히 감소한다고 하네요.

By 모두의 연구실

이상입니다.

저작자표시 (새창열림)

'AI 관련' 카테고리의 다른 글

MoE(Mixture of Experts) by hugging face (0)	2025.01.22
[논문 리뷰] Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving (2019) (5)	2023.07.13

나는 좋은 일들만 끌어당겨, 그것도 아주 많이

'Adam'보다 더 빠른 옵티마이저 등장? by 스탠포드 대학

'AI 관련' 카테고리의 다른 글

댓글

티스토리툴바

'Adam'보다 더 빠른 옵티마이저 등장? by 스탠포드 대학

'AI 관련' 카테고리의 다른 글

관련글

댓글

티스토리툴바