2026년 1월 7일20 min readJeTech Lab

FOIL: Time-Series Forecasting for Out-of-Distribution Generalization Using Invariant Learning

FOIL 논문의 핵심 내용을 요약하고, 다양한 시계열 예측 모델에 적용한 방법론과 구현 사례를 상세히 분석합니다.

FOIL

Out-of-Distribution

Invariant Learning

Time Series Forecasting

FOIL 논문 개요

FOIL (Time-Series Forecasting for Out-of-Distribution Generalization Using Invariant Learning)은 시계열 예측에서 분포 외 일반화(Out-of-Distribution Generalization) 문제를 해결하기 위한 프레임워크입니다.

기존 시계열 예측 모델들은 학습 데이터와 테스트 데이터의 분포가 다를 때 성능이 크게 저하되는 문제가 있습니다. FOIL은 Invariant Learning 원리를 활용하여 환경에 상관없이 안정적인 예측을 수행할 수 있는 representation을 학습합니다.

FOIL 아키텍처 개요

FOIL은 세 가지 주요 컴포넌트로 구성됩니다:

1. Label Decomposing Component (CLD): 타겟 변수 Y를 충분히 예측 가능한 부분(Y_suf)과 관찰되지 않은 변수에 의해 영향을 받는 예측 불가능한 부분으로 분해합니다. 이를 통해 입력 특징으로부터 예측 가능한 결정론적 측면을 분리합니다.

2. Time-Series Environment Inference Module (MTEI): Time-Series Invariant Learning Module (MTIL)에서 학습된 representation을 기반으로 시간적 환경을 추론합니다. Multi-head network를 활용하여 데이터의 temporal adjacency 구조를 보존하면서 효과적으로 환경을 추론합니다.

3. Time-Series Invariant Learning Module (MTIL): MTEI가 추론한 환경들에 걸쳐 invariant representation을 학습합니다. 서로 다른 환경에서도 타겟 변수와의 안정적인 관계를 유지하는 특징을 포착하여, 분포 변화에 대한 모델의 강건성을 향상시킵니다.

이러한 컴포넌트들은 교대 업데이트 전략(alternating update strategy)을 통해 공동으로 최적화되며, CLD는 MTIL과 MTEI를 위한 예비 단계로 작동합니다. 테스트 단계에서는 MTIL만을 사용하여 예측을 수행합니다.

핵심 개념

1. Out-of-Distribution (OOD) 문제

1.1 Time-series Distribution Shift

시계열 데이터에서 OOD 문제는 다음과 같은 상황에서 발생합니다:

정상성(Stationary): 시간의 흐름에 따라 평균이나 분산 등의 통계적 특성이 변하지 않는 데이터. 정상성이 보장되려면 추세나 계절성이 존재하지 않아야 합니다.
비정상성(Non-stationary): 현실에서 마주하는 대부분의 데이터들은 seasonality, concept drift, change point 등과 같은 변동성으로 인해 정상성을 가지지 못합니다.
시간적 분포 변화: 시계열 데이터는 특성상 시간에 따라 특정 간격으로 순차적으로 수집되므로, 시간 변화에 따라 분포가 변화하여 non-stationary한 특성을 가집니다.

시계열에서 distribution shift는 다음 두 가지 시나리오에서 발생합니다:

1. 시간이 지남에 따라 분포 변화가 지속적으로 발생하는 경우: Concept drift 및 non-stationary process와 관련 2. Training data 및 test data 사이에서 발생하는 distribution shift: 학습 데이터와 테스트 데이터의 분포 차이

대부분의 DL 기반 접근 방식은 stationary time-series data에 좋은 성능을 보이지만, 다양한 분포 변화를 보이는 데이터에 대해서는 약한 모습을 보입니다.

1.2 Out-of-Distribution Generalization 정의

정의: Train 분포 $P_{train}(X, Y)$와 다른 분포 $P_{test}(X, Y)$에서도 성능을 유지하는 능력

기존의 data-driven method를 사용함에 있어서, training dataset과 test dataset은 기본적으로 같은 분포에서 독립적으로 샘플링 되었다는 가정이 존재 (independent identically distribution, iid)
이러한 iid 가정이 깨진다면, Empirical Risk Minimization을 통해서 학습된 모델은 예측 성능이 저하될 가능성이 존재
Empirical Risk Minimization: 실현 불가능한 전체 분포를 대신하여 유한한 학습 데이터에서 계산한 평균 손실

Environment(domain): 단순히 Train/Test의 구분을 넘어서, 데이터가 수집되는 맥락을 의미합니다. 각 환경은 서로 다른 분포를 가지며, 이를 heterogeneous environments라고 표현합니다.

Spurious correlation: X와 Y 사이에 통계적으로는 상관관계가 있어 보이지만, 실제로는 인과적 관계가 없는 경우입니다. OOD 상황에서 domain generalization의 목적은, 전체 환경에서 공유하는 invariant feature(causal)를 학습하여, 학습 시에는 보지 못한 특정 환경의 집합에 속한 데이터들에 대해서도 일관된 성능을 보이는 것입니다.

2. Invariant Learning

정의: 여러 환경(domain)에서 변하지 않는(invariant) 예측 규칙을 학습하여 OOD generalization을 달성하는 방법

학습의 목적은 모든 환경이 아닌, worst-performing environment에서의 empirical risk를 최소화
Empirical Risk Minimization은 모든 환경에서의 risk를 평균적으로 최소화한다면, IL은 최악의 환경에서의 risk를 최소화함으로써, 모든 환경에서 공통으로 잘 작동하는 규칙을 학습하여 보지 못한 환경에서도 성능을 보장하는 것이 목적

2.1 Conventional Assumption

Input features $X$는 invariant features $X_I$와 variant feature $X_S$의 혼합으로 가정합니다.

Sufficiency property: $$Y = g(X_I) + \epsilon$$

여기서 $g(\cdot)$는 임의의 mapping function이고, $\epsilon$은 random noise입니다. 즉, $X_I$는 모든 환경에 걸쳐서 일관되게 Y 예측에 충분한 정보를 포함합니다.

Invariance property: $$E[Y|X_I, e] = E[Y|X_I]$$

모든 환경 $e$에 대하여, 위 식이 성립합니다.

이러한 가정들을 만족한다면, $X_I$는 $Y$에 대하여 충분하고 Invariant한 예측력을 가지며, 최적 OOD 성능을 보장할 수 있습니다.

3. FOIL의 문제 정의 및 도전과제

3.1 Out-of-Distribution in Time-series Forecasting (OOD-TSF)

시계열 데이터는 본질적으로 동적이고 복잡한 특성을 지니고 있어, 데이터 기반 모델링을 수행하는 데 다양한 어려움이 존재합니다:

1. 시계열 데이터는 시간의 경과에 따라 그 분포가 동적으로 변동하는 특성이 존재 2. 시계열 예측의 내재적 복잡성은 관찰되지 않은 외생 요인들에 의해 발생할 수 있음

기존의 TSF 모델들은 ERM을 적용하여 데이터 내의 모든 상관관계를 greedy하게 학습하여 평균적인 학습 오류를 최소화하는 방식을 사용합니다. 그러나 모든 상관관계가 테스트 시점의 새로운 분포에서도 유지되는 것이 아니기에, 이러한 방식으로 학습한 모델은 OOD generalization 능력이 부족할 수 있습니다.

3.2 Invariant Learning을 TSF에 적용할 때의 도전과제

Challenge 1: 관찰되지 않은 변수(Unobserved Variables)의 존재

시계열 데이터에서는 target 변수에 직접적으로 영향을 끼치지만 관측되지 않는 변수들이 항상 존재합니다.

예시:

Observed variables: 기온, 습도, 요일, 공휴일 여부
Unobserved variables (Z): 일사량, 정부 정책(에너지 절약 캠페인 등), 이벤트(축제 등)
Target variables: 전기 발전량

이러한 Unobserved $Z$의 존재는 IL의 conventional assumption을 위반합니다:

1. Sufficiency property 위반:

$X_I$는 Y 예측에 충분한 정보를 포함해야 하지만, Z도 필요한데 관찰 불가능
모델이 관찰 가능한 변수인 $X_I$를 과도하게 해석하여 과적합 발생

2. Invariance property 위반:

$Z$가 환경에 dependency하여, $P(Z|e)$가 환경마다 다르다면, 결과적으로 $P(Y|X_I, e) \neq P(Y|X_I)$
$Z$가 환경에 따라 달라지면, 비록 $X_I$와 $Y$의 관계가 invariant하더라도 marginalization 과정에서 환경 종속성이 생김

Challenge 2: 명시적인 환경 레이블(Environment Labels) 없음

시계열 데이터는 일반적으로 명시적인 환경 레이블(environment labels) 없이 수집됩니다:

대부분의 TSF 데이터셋에서는 환경 레이블을 제공하지 않음
Temporal environments의 복잡성 때문에 이를 명시적으로 annotation하기가 어렵고, 주어진다 하더라도 최적이 아닐 수 있음
대부분의 IL 방법론들에서는 명시적인 환경 레이블을 필요로 하며, 환경 레이블 없이는 학습 자체가 불가능

3.3 FOIL의 해결 방안

FOIL은 IL과 environment inference를 바탕으로 충분히 예측 가능성 있는 target($Y_{suf}$)을 예측하는 것을 목표로 설정합니다:

$Y$가 입력 $X$에 대하여 deterministic한 부분과 uncertain part로 분해될 수 있다고 가정
$X_I$에서 직접적으로 $Y$를 예측하고자 할 경우에는 $Z$로 인한 에러 사항들이 있었으므로, $X_I$들로만 deterministic하게 예측 가능한 $Y_{suf}$가 존재한다고 가정
$Y_{suf}$를 target으로 하면 sufficiency/invariance property에 대한 assumption이 성립하게 되어 IL이 가능해짐

4. FOIL의 주요 구성 요소

FOIL의 공식 아키텍처는 위에서 설명한 CLD, MTEI, MTIL 세 가지 컴포넌트로 구성되며, 본 프로젝트에서는 이를 실용적으로 구현하기 위해 다음과 같이 모듈화했습니다.

FOIL의 학습 과정은 다음과 같이 구성됩니다:

1. 1단계 (Pre-training): 일반적인 방법으로 backbone model 학습 2. 2단계 (MTEI): 학습된 backbone model을 freeze한 후, Environment regressor $\rho_e(\cdot)$ 학습

해당 과정 이후, training dataset을 다시 한번 순회하여 각 시점마다 환경 레이블을 부여해 3단계에서의 메타 정보로 이용

3. 3단계 (MTIL): 1단계에서 학습된 모델을 MTEI 단계에서 저장된 환경 레이블을 바탕으로 재학습

4.1 Label Decomposing Component (CLD)

약자에 대한 설명: 논문에서는 "Label Decomposing Component"를 약자로 𝓒LD로 표기합니다. 논문에서 약자 "CLD"의 정확한 의미(각 글자가 무엇을 나타내는지)는 명시적으로 설명되지 않았으나, "Label Decomposing Component"의 약자로 사용됩니다.

> "Overall Framework. As shown in Figure 2, FOIL consists of three parts: (1) Label Decomposing Component (𝓒LD), which decomposes sufficiently predictable 𝒀^suf from observed 𝒀." (논문 Section 4.1)

> "𝓒LD is used to decompose the sufficiently predictable 𝒀^suf from the observed 𝒀." (논문 Section 4.2)

CLD의 목적은 $Y$로부터 충분히 예측 가능한 $Y_{suf}$를 근사하는 것이지만, 정확하게 $Y_{suf}$를 얻는 것은 불가능합니다. 근본적으로 어떤 시계열 데이터를 생성해내는 generation function을 모르며, unobserved variables $Z$로 인함이 있습니다.

이에 실질적으로는 $Y$를 계산하는데 있어서 $Z$의 영향을 줄일 수 있는 surrogate loss를 제안하는 방식으로 구성됩니다. 즉, $Y_{suf}$를 직접 구하지 않고, $Z$의 영향을 제거한 loss로 간접적으로 $Y_{suf}$를 학습합니다.

수학적 모델링 (논문 Equation 2):

논문에서는 다음과 같은 가정을 제시합니다:

$$\bm{Y} = q(\bm{Y}^{\text{suf}}, \bm{Z}) = \alpha(\bm{Z})(\bm{Y}^{\text{suf}}) + \beta(\bm{Z})\mathbf{1}$$

> "where α(·): ℝ^d_Z → ℝ and β(·): ℝ^d_Z → ℝ could be any mapping function, and 𝟏 ∈ ℝ^{h×d_out} is an all-one matrix." (논문 Section 4.2)

여기서:

$\alpha(\bm{Z})$: Z에 의한 $Y_{suf}$의 증감 비율 (multiplicative effect)
$\beta(\bm{Z})$: Z에 의한 편향 항 (additive effect)
$\alpha, \beta$: 임의의 mapping function

대부분의 실제 현상은 이 두 효과의 조합으로 설명 가능합니다 (예: 간단한 선형 방정식, 필립스 곡선 등).

Instance Residual Normalization (IRN) (논문 Equation 3):

논문에서 제시한 IRN 수식:

$$\tilde{\mathbf{Res}}_t = \frac{\mathbf{Y}_t - \mu\left(\mathbf{Y}_t\right)}{\sigma(\mathbf{Y}_t)} - \frac{\hat{\mathbf{Y}}_t - \mu\left(\hat{\mathbf{Y}}_t\right)}{\sigma(\hat{\mathbf{Y}}_t)} = \tilde{\mathbf{Y}}_t - \tilde{\hat{\mathbf{Y}}}_t$$

> "IRN in Eq. 3 ensures the residuals to have a mean of 0 and a variance of 2−2cov(Ŷ, Y), where cov denotes the covariance." (논문 Section 4.2)

Surrogate Loss (논문 Equation 4):

$$\ell_{\text{suf}}(\hat{\bm{Y}}, \bm{Y}) = \text{MSE}(\tilde{\bm{Res}}, \bm{0}) = \ell(\tilde{\hat{\bm{Y}}}, \tilde{\bm{Y}})$$

> "where MSE(Res~, 0) = (1/h)∑_{j=1}^h (Res~_{t+j})^2. Note that our IRN fundamentally differs from the existing instance normalization (IN) methods." (논문 Section 4.2)

4.2 Time-Series Environment Inference Module (MTEI)

> "𝓜TEI aims to infer environments 𝑬_infer, thereby providing environment labels for the time-series invariant learning module 𝓜TIL." (논문 Section 4.3)

MTEI는 label이 없던 environment $E_{infer}$ 자체를 추론하는 것을 목표로 하며, 추후 이를 통하여 Time-series Invariant Learning module MTIL에 environment label을 제공함으로써 효과적인 모델 학습을 돕습니다.

해당 과정은 Environment-specific regressor를 학습시키는 과정(M-step)과, Environmental label을 할당하는 과정(E-step)으로 분리됩니다.

M-step: Optimizing Environment-Specific Regressors (논문 Equation 6):

$$\min_{\{\rho^{(e)}\}} \bm{\mathcal{L}}_{\text{TEI}} = \mathbb{E}_{e \in \bm{E}_{\text{infer}}} \bm{\mathcal{R}}^{(e)}_{\text{suf}}(\rho^{(e)}, \phi^*)$$

$$= \sum_{e \in \bm{E}_{\text{infer}}} \frac{1}{|\mathcal{D}_e|} \sum_{(\mathbf{X}, \mathbf{Y}) \in \mathcal{D}_e} \ell_{\text{suf}}\left(\rho^{(e)}\left(\phi^*(\mathbf{X})\right), \mathbf{Y}\right)$$

> "In the M step, we optimize {ρ^(e)} to better fit the data from current environment partition 𝑬_infer of E step" (논문 Section 4.3)

E-step: Estimating Environment Labels (논문 Equation 7, 8):

Step 1 - 환경 재할당 (논문 Equation 7): $$\bm{E}_{\text{infer}}(t) \leftarrow \arg \min_{e \in \bm{E}_{\text{infer}}} \left\{\ell_{\text{suf}}\left(\rho^{(e)}\left(\phi^*(\mathbf{X}_t)\right), \mathbf{Y}_t\right)\right\}$$

> "Reallocating based on the distances with the center of each cluster (environment). We use the loss with respect to regressor ρ^(e) to describe the distance with the center of cluster e." (논문 Section 4.3)

Step 2 - Label Propagation (논문 Equation 8): $$\bm{E}_{\text{infer}}(t) \leftarrow \text{mode}\left\{\bm{E}_{\text{infer}}(t+j)\right\}_{j=-r}^{r}$$

> "where mode implements majority voting by considered temporal neighbors selected via the radius r ∈ ℤ^+." (논문 Section 4.3)

이를 통해 temporal adjacency를 보존하면서 환경을 추론합니다.

4.3 Invariant Representation Extractor (MTIL 구현)

환경에 상관없이 안정적인 representation을 추출하는 모듈입니다. 이는 FOIL 논문의 MTIL (Time-Series Invariant Learning Module)에 해당합니다.

class InvariantRepresentation(nn.Module):
    """
    Invariant Feature Extractor
    환경에 상관없이 안정적인 representation 추출
    """
    def __init__(self, input_dim, hidden_dim=64, num_layers=2):
        super().__init__()
        layers = []
        dim = input_dim
        for i in range(num_layers):
            layers.extend([
                nn.Linear(dim, hidden_dim),
                nn.ReLU(),
                nn.Dropout(0.1)
            ])
            dim = hidden_dim
        self.network = nn.Sequential(*layers)

    def forward(self, x):
        """
        Args:
            x: (B, L, input_dim) 입력 시계열
        Returns:
            features: (B, L, hidden_dim) invariant representation
        """
        return self.network(x)

이 모듈은 입력 시계열을 환경에 독립적인 특징 공간으로 변환합니다.

4.4 Time-Series Invariant Learning Module (MTIL)

> "𝓜TIL is used to learn invariant representations φ*(𝑿) across inferred environments 𝑬*_infer from 𝓜TEI." (논문 Section 4.4)

MTIL에서는 이전 과정을 통해서 구해진 환경 레이블을 기반으로, invariant features $X_I$의 정보를 모두 담고 인코딩할 수 있는 $\phi^*(X)$를 학습하여 invariant variable와 충분한 예측 능력을 바탕으로 정확한 $Y_{suf}$를 예측하는 것을 목적으로 합니다.

이론적 목표 (논문 Equation 9):

$$\phi^* = \arg \max_{\phi} I(\mathbf{Y}^{\text{suf}}; \phi(\mathbf{X})) - I(\mathbf{Y}^{\text{suf}}; \bm{E}^*_{\text{learn}} | \phi(\mathbf{X}))$$

> "where I(·;·) measures Shannon mutual information. The first and second terms correspond to ensure sufficiency and invariance property of φ(𝑿), respectively." (논문 Section 4.4)

실제 Loss 함수 (논문 Equation 10):

$$\min_{\rho, \phi} \mathcal{L}_{\text{TIL}} = \mathbb{E}_{e \in \bm{E}^*_{\text{infer}}} \bm{\mathcal{R}}^{(e)}_{\text{suf}}(\rho, \phi) + \lambda_1 \bm{\mathcal{R}}_{\text{ERM}}(\rho, \phi) + \lambda_2 \text{Var}_{e \in \bm{E}^*_{\text{infer}}}\left[\bm{\mathcal{R}}^{(e)}_{\text{suf}}(\rho, \phi)\right]$$

> "where λ₁, λ₂ are hyper-parameters, 𝓡ERM(ρ,φ) = 𝔼[𝐗,𝐘][ℓ(ρ(φ(𝐗)), 𝐘)] is the ERM loss on raw 𝒀, 𝓡^e_suf(ρ,φ) defined in Eq. 10 is the loss of inferred environment e on 𝒀^suf, and Var_{e∈𝑬*_infer}[𝓡^(e)_suf(ρ,φ)] implies the variance of loss across inferred environments." (논문 Section 4.4)

> "The third term further balanced by λ₂ ensures the invariance property and is robust to marginal distribution shifts of input, theoretically guaranteed by (Krueger et al., 2021)" (논문 Section 4.4)

학습 목표:

1. $Y$의 예측 가능한 부분인 $Y_{suf}$만 학습 (전체 $Y$를 예측하도록 학습하지 않음) 2. Invariance property 보장을 위해 모든 환경에서 균등한 성능 강제

각 환경의 loss 분산을 계산하여, 환경 간 성능 차이를 최소화

이를 통해 환경에 상관없이 일관된 예측을 수행할 수 있는 representation을 학습합니다.

class EnvironmentInference(nn.Module):
    """
    Time-series Environment Inference Module
    Temporal adjacency를 보존하면서 환경을 추론
    """
    def __init__(self, num_environments=3, hidden_dim=64, seq_len=100):
        super().__init__()
        self.num_environments = num_environments

        # Multi-head network for environment-specific regressors
        self.env_heads = nn.ModuleList([
            nn.Sequential(
                nn.Linear(hidden_dim, hidden_dim // 2),
                nn.ReLU(),
                nn.Linear(hidden_dim // 2, 1)
            ) for _ in range(num_environments)
        ])

        # Environment assignment network (temporal-aware)
        # LSTM을 사용하여 temporal adjacency 보존
        self.env_lstm = nn.LSTM(
            input_size=hidden_dim,
            hidden_size=hidden_dim,
            num_layers=1,
            batch_first=True,
            bidirectional=False
        )
        self.env_assigner = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, num_environments),
            nn.Softmax(dim=-1)
        )

LSTM을 사용하여 시간적 연속성을 보존하면서 환경 확률을 추론합니다.

4.5 Environment Inference Module (MTEI 구현)

시계열 데이터에서 환경을 자동으로 추론하는 모듈입니다. Temporal adjacency를 보존하면서 환경을 추론합니다. 이는 FOIL 논문의 MTEI (Time-Series Environment Inference Module)에 해당하며, multi-head network를 사용하여 환경별 regressor를 구현합니다.

4.6 FOIL Loss

FOIL은 여러 손실 함수를 결합하여 학습합니다. 이는 CLD (Label Decomposing Component)의 역할을 일부 포함하며, 관찰되지 않은 변수에 대한 불확실성을 고려합니다:

1. Invariant Loss: 환경에 상관없이 일관된 예측 (MTIL의 핵심 목표) 2. Environment-specific Loss: 환경별 regressor의 예측 오차 (MTEI의 환경별 예측) 3. Surrogate Loss: 관찰되지 않은 변수에 대한 불확실성 고려 (CLD의 역할 일부) 4. Diversity Loss: 환경이 너무 비슷해지지 않도록 다양성 유지 (MTEI의 환경 구분 강화)

def compute_foil_loss(self, features, predictions, targets, env_probs):
    """
    FOIL loss 계산

    Args:
        features: (B, L, hidden_dim) invariant representation
        predictions: (B, L, D) 예측값 (원본 모델 출력)
        targets: (B, L, D) 타겟값
        env_probs: (B, L, num_envs) 환경 확률
    Returns:
        total_loss: FOIL loss 값
    """
    # 1. Invariant loss (환경에 상관없이 일관된 예측)
    invariant_pred = self.invariant_predictor(features)
    invariant_loss = F.mse_loss(invariant_pred, targets)

    # 2. Environment-specific loss (환경별 regressor)
    env_predictions = self.env_inference(features, mode="predict")
    # 가중 평균으로 환경별 예측 결합
    weighted_pred = torch.sum(
        env_predictions.unsqueeze(-1) * env_probs.unsqueeze(-2), dim=-1
    )
    env_loss = F.mse_loss(weighted_pred, targets)

    # 3. Surrogate loss (unobserved variables 완화)
    pred_variance = torch.var(predictions, dim=1, keepdim=True)
    surrogate_loss = -self.foil_lambda * torch.mean(pred_variance)

    # 4. Environment diversity loss (환경이 너무 비슷해지지 않도록)
    env_entropy = -torch.sum(
        env_probs * torch.log(env_probs + 1e-8), dim=-1
    )
    diversity_loss = -0.01 * torch.mean(env_entropy)

    total_loss = invariant_loss + env_loss + surrogate_loss + diversity_loss
    return total_loss

FOIL 학습 과정 상세

3단계 학습 프로세스

FOIL은 다음과 같은 3단계 학습 프로세스를 따릅니다:

1단계: Pre-training

일반적인 방법으로 backbone model을 학습합니다. 이 단계에서는 표준 ERM (Empirical Risk Minimization)을 사용하여 모델을 초기화합니다.

본 프로젝트 구현 코드 (trainer.py의 _pre_training 메서드):

def _pre_training(self, train_loader, vali_loader, test_loader, setting, exp_folder):
    """
    Stage 1: Pre-training - Backbone model을 표준 ERM으로 학습
    """
    model_optim = self._select_optimizer()
    criterion = self._select_criterion()

    # Pre-training epochs (전체 epochs의 일부 사용)
    pretrain_epochs = max(10, self.args.train_epochs // 3) if self.args.use_foil else self.args.train_epochs

    early_stopping = EarlyStopping(patience=self.args.patience, verbose=True)
    path = os.path.join(exp_folder, "checkpoints")

    for epoch in range(pretrain_epochs):
        train_loss = []
        self.model.train()

        for i, batch in enumerate(train_loader):
            model_optim.zero_grad()

            batch_x, batch_y, batch_x_mark, batch_y_mark = batch
            batch_x = batch_x.float().to(self.device)
            batch_x_mark = batch_x_mark.float().to(self.device) if batch_x_mark is not None else None
            batch_y = batch_y.float().to(self.device)
            batch_y_mark = batch_y_mark.float().to(self.device) if batch_y_mark is not None else None

            if torch.isnan(batch_x).any():
                continue

            # Forward pass
            outputs = self.model(batch_x, batch_x_mark, batch_y, batch_y_mark)

            # Shape 맞춤
            if self.args.features == "MS":
                f_dim = -1
                outputs = outputs[:, -self.args.pred_len :, f_dim:]
                batch_y = batch_y[:, -self.args.pred_len :, f_dim:].to(self.device)
            else:
                outputs = outputs[:, -self.args.pred_len :, :]
                batch_y = batch_y[:, -self.args.pred_len :, :].to(self.device)

            if outputs.shape != batch_y.shape:
                min_len = min(outputs.shape[1], batch_y.shape[1])
                min_dim = min(outputs.shape[2], batch_y.shape[2])
                outputs = outputs[:, :min_len, :min_dim]
                batch_y = batch_y[:, :min_len, :min_dim]

            # Standard ERM loss
            loss = criterion(outputs, batch_y)

            loss.backward()
            model_optim.step()
            train_loss.append(loss.item())

        train_loss = np.average(train_loss)
        vali_loss = self._validate(vali_loader, criterion)
        test_loss = self._validate(test_loader, criterion)

        print(f"Pre-training Epoch {epoch+1}/{pretrain_epochs}, Train Loss: {train_loss:.4f}, Val Loss: {vali_loss:.4f}, Test Loss: {test_loss:.4f}")

        early_stopping(vali_loss, self.model, path)
        if early_stopping.early_stop:
            print("Early stopping")
            break

    # Best model 로드
    best_model_path = os.path.join(path, 'checkpoint.pth')
    if os.path.exists(best_model_path):
        self.model.load_state_dict(torch.load(best_model_path, map_location=self.device, weights_only=False))
        print(f"✅ Pre-training 완료. Best model loaded from {best_model_path}")

    return self.model

2단계: MTEI (Time-Series Environment Inference)

학습된 backbone model을 freeze한 후, Environment regressor $\rho_e(\cdot)$를 학습합니다:

1. M-step: Freeze된 backbone model의 representation을 사용하여 각 환경별 regressor를 학습 2. E-step: 각 instance에 대해 최소 에러를 가지는 환경을 선택하여 할당하고, Label Propagation을 통해 temporal adjacency를 보존

해당 과정 이후, training dataset을 다시 한번 순회하여 각 시점마다 환경 레이블을 부여해 3단계에서의 메타 정보로 이용합니다.

본 프로젝트 구현 코드 (trainer.py의 _mtei_training, _compute_irn_loss, _label_propagation 메서드):

def _compute_irn_loss(self, predictions, targets):
    """
    IRN-based loss (논문 Equation 4)
    ℓ_suf(Ŷ, Y) = MSE(Res~, 0) = ℓ(Ŷ~, Y~)
    """
    # Normalize predictions and targets (논문 Equation 3)
    pred_mean = predictions.mean(dim=-1, keepdim=True)
    pred_std = predictions.std(dim=-1, keepdim=True) + 1e-8
    pred_norm = (predictions - pred_mean) / pred_std

    target_mean = targets.mean(dim=-1, keepdim=True)
    target_std = targets.std(dim=-1, keepdim=True) + 1e-8
    target_norm = (targets - target_mean) / target_std

    # Compute normalized residual
    res_normalized = pred_norm - target_norm

    # MSE of normalized residual
    loss = F.mse_loss(res_normalized, torch.zeros_like(res_normalized))
    return loss

def _label_propagation(self, env_labels, radius=2):
    """
    Label Propagation: Majority voting (논문 Equation 8)
    E_infer(t) ← mode{E_infer(t+j)}_{j=-r}^r
    """
    new_labels = env_labels.clone()
    n = len(env_labels)

    for t in range(n):
        # Get temporal neighbors
        start_idx = max(0, t - radius)
        end_idx = min(n, t + radius + 1)
        neighbors = env_labels[start_idx:end_idx]

        # Majority voting
        mode_value = torch.mode(neighbors)[0]
        new_labels[t] = mode_value

    return new_labels

def _mtei_training(self, train_loader, num_environments=3, num_iterations=10):
    """
    Stage 2: MTEI - Environment Inference Module 학습
    논문 Algorithm 1의 Stage 2에 해당
    """
    # 실제 모델 가져오기 (DataParallel 대응)
    if hasattr(self.model, 'module'):
        actual_model = self.model.module
    else:
        actual_model = self.model

    if not hasattr(actual_model, 'use_foil') or not actual_model.use_foil:
        print("⚠️  FOIL이 활성화되지 않았습니다. MTEI 단계를 건너뜁니다.")
        return None

    if not hasattr(actual_model, 'foil'):
        print("⚠️  FOIL 모듈이 없습니다. MTEI 단계를 건너뜁니다.")
        return None

    foil_module = actual_model.foil

    # Freeze backbone model
    for param in actual_model.parameters():
        param.requires_grad = False

    # Only train environment regressors
    env_optimizer = optim.Adam(foil_module.env_inference.env_heads.parameters(), lr=1e-3)

    # Initialize environment labels randomly
    total_samples = len(train_loader.dataset)
    env_labels = torch.randint(0, num_environments, (total_samples,), device=self.device)

    for iteration in range(num_iterations):
        # M-step: Optimize environment-specific regressors (논문 Equation 6)
        foil_module.train()
        total_loss = 0

        for batch_idx, (batch_x, batch_y, batch_x_mark, batch_y_mark) in enumerate(train_loader):
            env_optimizer.zero_grad()

            batch_x = batch_x.float().to(self.device)
            batch_y = batch_y.float().to(self.device)

            # Get frozen backbone representation
            with torch.no_grad():
                # FOIL을 통해 representation 추출
                features = foil_module.invariant_extractor(batch_x)  # φ*(X)

            # Get environment labels for this batch
            batch_start = batch_idx * train_loader.batch_size
            batch_end = min((batch_idx + 1) * train_loader.batch_size, total_samples)
            batch_env_labels = env_labels[batch_start:batch_end]

            # Compute environment-specific predictions and losses
            env_losses = []
            for e in range(num_environments):
                # Get predictions from environment-specific regressor ρ^(e)
                env_pred = foil_module.env_inference.env_heads[e](features)  # (B, L, 1)
                env_pred = env_pred.squeeze(-1)  # (B, L)

                # Compute loss for this environment (논문 Equation 6)
                mask = (batch_env_labels == e)
                if mask.sum() > 0:
                    # IRN-based loss (논문 Equation 4)
                    # 예측값과 타겟의 길이 맞춤
                    pred_len = min(env_pred.shape[1], batch_y.shape[1])
                    env_pred_aligned = env_pred[:, :pred_len]
                    batch_y_aligned = batch_y[:, :pred_len]

                    # 각 샘플별로 loss 계산
                    sample_losses = []
                    for sample_idx in range(env_pred_aligned.shape[0]):
                        if mask[sample_idx]:
                            sample_pred = env_pred_aligned[sample_idx:sample_idx+1]
                            sample_target = batch_y_aligned[sample_idx:sample_idx+1]
                            sample_loss = self._compute_irn_loss(sample_pred, sample_target)
                            sample_losses.append(sample_loss)

                    if sample_losses:
                        env_loss = torch.stack(sample_losses).mean()
                        env_losses.append(env_loss)

            # M-step loss: average across environments
            if env_losses:
                m_step_loss = sum(env_losses) / len(env_losses)
                m_step_loss.backward()
                env_optimizer.step()
                total_loss += m_step_loss.item()

        print(f"MTEI Iteration {iteration+1}/{num_iterations}, M-step Loss: {total_loss/len(train_loader):.4f}")

        # E-step: Reallocate environment labels (논문 Equation 7, 8)
        foil_module.eval()
        new_env_labels = []

        with torch.no_grad():
            for batch_x, batch_y, batch_x_mark, batch_y_mark in train_loader:
                batch_x = batch_x.float().to(self.device)
                batch_y = batch_y.float().to(self.device)

                # Get frozen backbone representation
                features = foil_module.invariant_extractor(batch_x)

                # Compute loss for each environment (논문 Equation 7)
                batch_size = batch_x.shape[0]
                env_losses_per_sample = []

                for e in range(num_environments):
                    env_pred = foil_module.env_inference.env_heads[e](features)
                    env_pred = env_pred.squeeze(-1)

                    # 각 샘플별 IRN loss 계산
                    pred_len = min(env_pred.shape[1], batch_y.shape[1])
                    env_pred_aligned = env_pred[:, :pred_len]
                    batch_y_aligned = batch_y[:, :pred_len]

                    sample_losses = []
                    for sample_idx in range(batch_size):
                        sample_pred = env_pred_aligned[sample_idx:sample_idx+1]
                        sample_target = batch_y_aligned[sample_idx:sample_idx+1]
                        sample_loss = self._compute_irn_loss(sample_pred, sample_target)
                        sample_losses.append(sample_loss)

                    env_losses_per_sample.append(torch.stack(sample_losses))

                # Assign to environment with minimum loss
                env_losses_tensor = torch.stack(env_losses_per_sample, dim=0)  # (num_envs, batch_size)
                assigned_envs = torch.argmin(env_losses_tensor, dim=0)  # (batch_size,)
                new_env_labels.append(assigned_envs.cpu())

        env_labels = torch.cat(new_env_labels, dim=0).to(self.device)

        # Label Propagation: Majority voting (논문 Equation 8)
        env_labels = self._label_propagation(env_labels, radius=2)

        print(f"  E-step: Environment distribution: {torch.bincount(env_labels.cpu())}")

    print("✅ MTEI 완료")
    return env_labels.cpu()

3단계: MTIL (Time-Series Invariant Learning)

1단계에서 학습된 모델을 MTEI 단계에서 저장된 환경 레이블을 바탕으로 재학습합니다:

환경 레이블을 기반으로 invariant representation을 학습
모든 환경에서 균등한 성능을 강제하여 invariance property 보장
$Y_{suf}$만 예측하도록 학습 (전체 $Y$가 아닌)

본 프로젝트 구현 코드 (trainer.py의 _mtil_training 메서드):

def _mtil_training(self, train_loader, vali_loader, test_loader, env_labels, setting, exp_folder):
    """
    Stage 3: MTIL - Time-Series Invariant Learning Module 학습
    논문 Algorithm 1의 Stage 1에 해당 (환경 레이블 기반 학습)
    """
    # 실제 모델 가져오기
    if hasattr(self.model, 'module'):
        actual_model = self.model.module
    else:
        actual_model = self.model

    # Unfreeze backbone model for fine-tuning
    for param in actual_model.parameters():
        param.requires_grad = True

    # Optimizer for model and FOIL module
    model_optim = optim.Adam(
        list(actual_model.parameters()),
        lr=self.args.learning_rate * 0.1  # Lower learning rate for fine-tuning
    )

    criterion = self._select_criterion()
    early_stopping = EarlyStopping(patience=self.args.patience, verbose=True)
    path = os.path.join(exp_folder, "checkpoints")

    # MTIL epochs (나머지 epochs 사용)
    mtil_epochs = self.args.train_epochs - max(10, self.args.train_epochs // 3)

    lambda1 = getattr(self.args, 'foil_lambda', 0.1)  # ERM loss weight
    lambda2 = getattr(self.args, 'foil_lambda', 0.1)  # Variance loss weight

    for epoch in range(mtil_epochs):
        train_loss = []
        self.model.train()

        for batch_idx, (batch_x, batch_y, batch_x_mark, batch_y_mark) in enumerate(train_loader):
            model_optim.zero_grad()

            batch_x = batch_x.float().to(self.device)
            batch_x_mark = batch_x_mark.float().to(self.device) if batch_x_mark is not None else None
            batch_y = batch_y.float().to(self.device)
            batch_y_mark = batch_y_mark.float().to(self.device) if batch_y_mark is not None else None

            if torch.isnan(batch_x).any():
                continue

            # Forward pass
            outputs = self.model(batch_x, batch_x_mark, batch_y, batch_y_mark)

            # Shape 맞춤
            if self.args.features == "MS":
                f_dim = -1
                outputs = outputs[:, -self.args.pred_len :, f_dim:]
                batch_y = batch_y[:, -self.args.pred_len :, f_dim:].to(self.device)
            else:
                outputs = outputs[:, -self.args.pred_len :, :]
                batch_y = batch_y[:, -self.args.pred_len :, :].to(self.device)

            if outputs.shape != batch_y.shape:
                min_len = min(outputs.shape[1], batch_y.shape[1])
                min_dim = min(outputs.shape[2], batch_y.shape[2])
                outputs = outputs[:, :min_len, :min_dim]
                batch_y = batch_y[:, :min_len, :min_dim]

            # Get environment labels for this batch
            batch_start = batch_idx * train_loader.batch_size
            batch_end = min((batch_idx + 1) * train_loader.batch_size, len(env_labels))
            batch_env_labels = env_labels[batch_start:batch_end].to(self.device)

            # Compute environment-specific losses (논문 Equation 10, first term)
            env_losses = []
            for e in range(actual_model.foil.num_environments):
                mask = (batch_env_labels == e)
                if mask.sum() > 0:
                    # IRN-based loss for environment e
                    env_loss = self._compute_irn_loss(outputs[mask], batch_y[mask])
                    env_losses.append(env_loss)

            # First term: Average environment-specific loss
            env_avg_loss = sum(env_losses) / len(env_losses) if env_losses else torch.tensor(0.0, device=self.device)

            # Second term: ERM loss on raw Y (논문 Equation 10, second term)
            erm_loss = criterion(outputs, batch_y)

            # Third term: Variance of losses across environments (논문 Equation 10, third term)
            if len(env_losses) > 1:
                env_losses_tensor = torch.stack(env_losses)
                variance_loss = torch.var(env_losses_tensor)
            else:
                variance_loss = torch.tensor(0.0, device=self.device)

            # Total MTIL loss (논문 Equation 10)
            total_mtil_loss = env_avg_loss + lambda1 * erm_loss + lambda2 * variance_loss

            total_mtil_loss.backward()
            model_optim.step()

            train_loss.append(total_mtil_loss.item())

        train_loss = np.average(train_loss)
        vali_loss = self._validate(vali_loader, criterion)
        test_loss = self._validate(test_loader, criterion)

        print(f"MTIL Epoch {epoch+1}/{mtil_epochs}, Train Loss: {train_loss:.4f}, Val Loss: {vali_loss:.4f}, Test Loss: {test_loss:.4f}")

        early_stopping(vali_loss, self.model, path)
        if early_stopping.early_stop:
            print("Early stopping")
            break

    print("✅ MTIL 완료")
    return self.model

# 전체 FOIL 학습 프로세스 (본 프로젝트 구현) def train_foil(model, foil_module, train_loader, val_loader, config): """ Complete FOIL training pipeline

Args: model: Backbone time-series forecasting model foil_module: FOIL module train_loader: Training data loader val_loader: Validation data loader config: Configuration dictionary """ print("=" * 50) print("Stage 1: Pre-training") print("=" * 50)

# Stage 1: Pre-training optimizer = torch.optim.Adam(model.parameters(), lr=config['lr']) model = pre_training(model, train_loader, optimizer, F.mse_loss, num_epochs=config['pretrain_epochs'])

print("\n" + "=" * 50) print("Stage 2: MTEI - Environment Inference") print("=" * 50)

# Stage 2: MTEI env_labels = mtei_training(model, foil_module, train_loader, num_environments=config['num_environments'], num_iterations=config['mtei_iterations'])

print("\n" + "=" * 50) print("Stage 3: MTIL - Invariant Learning") print("=" * 50)

# Stage 3: MTIL optimizer = torch.optim.Adam( list(model.parameters()) + list(foil_module.parameters()), lr=config['lr'] * 0.1 # Lower learning rate for fine-tuning ) model, foil_module = mtil_training( model, foil_module, train_loader, env_labels, optimizer, lambda1=config['lambda1'], lambda2=config['lambda2'], num_epochs=config['mtil_epochs'] )

print("\n" + "=" * 50) print("FOIL Training Complete!") print("=" * 50)

return model, foil_module, env_labels


### FOIL의 핵심 아이디어

FOIL은 다음과 같은 핵심 아이디어를 바탕으로 설계되었습니다:

1. **$Y_{suf}$를 target으로 설정**: 관찰되지 않은 변수 $Z$의 영향을 받지 않는, 충분히 예측 가능한 부분만 학습
2. **Environment 자동 추론**: 명시적인 환경 레이블 없이도 temporal environments를 자동으로 추론
3. **Invariant representation 학습**: 추론된 환경들에 걸쳐 일관된 예측을 수행할 수 있는 특징 학습

## FOIL 모듈 통합

FOIL은 RevIN과 유사한 방식으로 기존 모델에 통합할 수 있습니다. `FOILModule`은 다음과 같이 사용됩니다:

class FOILModule(nn.Module): """ FOIL 통합 모듈 - RevIN처럼 사용

Usage: # __init__에서 if getattr(configs, 'use_foil', False): from layers.foil import FOILModule self.foil = FOILModule(...)

# forecast에서 if self.use_foil: invariant_features = self.foil(x_enc, mode="extract") env_probs = self.foil(x_enc, mode="infer_env")

# forward에서 (training 시) if self.use_foil and self.training: return dec_out, {'invariant_features': ..., 'env_probs': ...} """


## 다양한 모델에의 적용

### 1. PatchTST에의 적용

PatchTST는 패치 기반 Transformer 모델로, 시계열을 패치로 나누어 처리합니다. FOIL을 적용하면 다음과 같이 구현됩니다:

def forecast(self, x_enc, x_mark_enc, x_dec, x_mark_dec): # RevIN 정규화 x_enc = self.revin(x_enc, "norm")

# FOIL: Invariant representation and environment inference foil_info = None if self.use_foil: invariant_features = self.foil(x_enc, mode="extract") env_probs = self.foil(x_enc, mode="infer_env") foil_info = { 'invariant_features': invariant_features, 'env_probs': env_probs }

# 기존 PatchTST 처리 x_enc = x_enc.permute(0, 2, 1) enc_out, n_vars = self.patch_embedding(x_enc) enc_out, attns = self.encoder(enc_out) # ... (나머지 처리)

# Return FOIL info during training if self.use_foil and self.training and foil_info is not None: return dec_out, foil_info return dec_out


**적용 효과**:
- 패치 기반 처리 전에 invariant representation을 추출하여 환경 변화에 강건한 특징 학습
- 환경 추론을 통해 시장 상황에 따른 예측 조정 가능

### 2. TSMixer에의 적용

TSMixer는 MLP 기반의 간단한 아키텍처로, temporal과 channel mixing을 수행합니다.

def forecast(self, x_enc, x_mark_enc, x_dec, x_mark_dec, mask=None): # FOIL: Invariant representation and environment inference foil_info = None if self.use_foil: invariant_features = self.foil(x_enc, mode="extract") env_probs = self.foil(x_enc, mode="infer_env") foil_info = { 'invariant_features': invariant_features, 'env_probs': env_probs }

# 기존 TSMixer 처리 x_enc = x_enc for i in range(self.layer): x_enc = self.modeli enc_out = self.projection(x_enc.transpose(1, 2)).transpose(1, 2)

# Return FOIL info during training if self.use_foil and self.training and foil_info is not None: return enc_out, foil_info return enc_out


**적용 효과**:
- MLP 기반 모델에서도 환경 변화에 강건한 예측 가능
- 간단한 아키텍처에 최소한의 오버헤드로 OOD 일반화 성능 향상

### 3. TimeLLM에의 적용

TimeLLM은 LLM을 활용한 시계열 예측 모델입니다. FOIL을 적용하면 LLM의 강력한 표현력을 환경 변화에 강건하게 활용할 수 있습니다.

def forecast(self, x_enc, x_mark_enc, x_dec, x_mark_dec): # FOIL: Invariant representation and environment inference foil_info = None if self.use_foil: invariant_features = self.foil(x_enc, mode="extract") env_probs = self.foil(x_enc, mode="infer_env") foil_info = { 'invariant_features': invariant_features, 'env_probs': env_probs }

x_enc = self.normalize_layers(x_enc, "norm")

# TimeLLM의 프롬프트 생성 및 LLM 처리 # ... (나머지 처리)


**적용 효과**:
- LLM의 일반화 능력과 FOIL의 환경 불변성 학습을 결합
- 다양한 도메인으로의 zero-shot 전이 성능 향상

### 4. Nonstationary Transformer에의 적용

Nonstationary Transformer는 비정상성을 처리하기 위한 모델입니다. FOIL과 결합하면 더욱 강건한 예측이 가능합니다.

# Nonstationary Transformer의 정규화 및 처리 # ... (나머지 처리)


**적용 효과**:
- 비정상성 처리와 환경 불변성 학습의 시너지
- 시장 환경 변화에 더욱 강건한 예측

### 5. 기타 적용 모델들

다음 모델들에도 동일한 패턴으로 FOIL을 적용했습니다:

- **FEDformer**: Frequency Enhanced Decomposed Transformer
- **TimeMixer**: Multi-scale mixing architecture
- **TimeXer**: Cross-attention based model
- **ETSformer**: Exponential Smoothing Transformer
- **LightTS**: Lightweight time series model
- **SimTS**: Simple time series model
- **FreTS**: Frequency domain Transformer

모든 모델에서 동일한 인터페이스를 사용하여 일관된 통합이 가능합니다.

## Trainer에서의 FOIL Loss 통합

학습 과정에서 FOIL loss는 다음과 같이 통합됩니다:

# 모델 출력 model_outputs = self.model(batch_x, batch_x_mark, batch_y, batch_y_mark)

# FOIL 사용 시 추가 정보 추출 foil_info = None if hasattr(self.model, 'use_foil') and self.model.use_foil and isinstance(model_outputs, tuple): outputs, foil_info = model_outputs else: outputs = model_outputs

# 기본 MSE loss loss = criterion(outputs, batch_y)

# FOIL loss 추가 if foil_info is not None and hasattr(self.model, 'foil'): invariant_features = foil_info.get('invariant_features') env_probs = foil_info.get('env_probs') if invariant_features is not None and env_probs is not None: # 길이 맞춤 처리 # ... (shape alignment)

foil_loss = self.model.foil.compute_foil_loss( features=features_for_loss, predictions=outputs, targets=batch_y, env_probs=env_probs_for_loss ) loss = loss + foil_loss


FOIL loss는 기본 예측 loss에 추가되어 함께 최적화됩니다.

## 하이퍼파라미터 설정

FOIL 모듈의 주요 하이퍼파라미터는 다음과 같습니다:

- `use_foil`: FOIL 사용 여부 (기본값: False)
- `num_environments`: 환경 개수 (기본값: 3)
- `foil_hidden_dim`: FOIL 모듈의 hidden dimension (기본값: 64)
- `foil_lambda`: Surrogate loss의 가중치 (기본값: 0.1)

parser.add_argument('--use_foil', action='store_true', default=False, help='Use FOIL for OOD generalization') parser.add_argument('--num_environments', type=int, default=3, help='Number of environments for FOIL') parser.add_argument('--foil_hidden_dim', type=int, default=64, help='Hidden dimension for FOIL modules') parser.add_argument('--foil_lambda', type=float, default=0.1, help='FOIL surrogate loss weight')


## FOIL의 장점

### 1. 모델 독립적 통합

FOIL은 기존 모델의 아키텍처를 크게 변경하지 않고도 통합할 수 있습니다. RevIN과 유사한 방식으로 모델의 전처리 단계에 추가됩니다.

### 2. 최소한의 오버헤드

FOIL 모듈은 상대적으로 작은 파라미터 수를 가지며, 추론 시 추가 계산 비용이 적습니다.

### 3. 환경 자동 추론

명시적인 환경 라벨 없이도 시계열 데이터에서 환경을 자동으로 추론할 수 있습니다.

### 4. 다양한 손실 함수의 조합

여러 손실 함수를 조합하여 환경 불변성과 예측 정확도를 동시에 최적화합니다.

## 논문과의 비교 분석

### 학습 알고리즘 (논문 Appendix A)

논문의 Algorithm 1에서 제시한 학습 프로세스:

> "**Algorithm 1**: The training procedure of our FOIL.
>
> **Require**: Time-series dataset 𝒟 = {(𝐗_i, 𝐘_i)}_{i=1}^N
>
> **Ensure**: An optimized predictor ρ(φ(·)): 𝒳 → 𝒴
>
> 1. Initialize ρ(·), {ρ^(e)(·)}, φ(·)
> 2. Random assign environment label for each (𝐗_i, 𝐘_i).
>
> **while not converged do**
>
> **Stage 1: Time-series Invariant Learning**: Update φ(·), ρ(·) according to Equation 10.
>
> **Stage 2: Time-series Environment Inference**:
> - **M Step**: Fit models according to Equation 6, update {ρ^(e)}.
> - **E Step**: Reallocate environment labels according to Equation 7 and Equation 8.
>
> **end while**
>
> return ρ(·) and φ(·)."

### 공식 GitHub 저장소 구현

[AdityaLab/FOIL](https://github.com/AdityaLab/FOIL) 공식 저장소는 **2단계 학습 프로세스**를 사용합니다:

1. **Step 0 (환경 추론)**: `ILI-Pred4-0.py`
   - MTEI 단계: 환경 레이블 추론
   - Backbone model을 freeze하고 environment regressor 학습

2. **Step 1 (Invariant Learning)**: `ILI-Pred4-1.py`
   - MTIL 단계: 추론된 환경 레이블을 기반으로 invariant representation 학습

### 현재 프로젝트 설계와의 비교

**학습 프로세스**:
- ✅ 논문의 Algorithm 1과 개념적으로 일치: Stage 1 (MTIL)과 Stage 2 (MTEI)의 교대 업데이트
- ⚠️ 차이점:
  - 논문: 교대 업데이트 (alternating updates) 방식
  - 공식 저장소: 별도의 스크립트로 분리된 2단계 학습
  - 현재 프로젝트: 통합된 학습 프로세스 설계

**구현 방식**:
- **공식**: Informer 전용 구현, 데이터셋별 스크립트 분리
- **현재**: Model-agnostic 설계, RevIN과 유사한 방식으로 다양한 모델에 통합 가능

### 주요 수식 요약

논문에서 제시한 주요 수식들:

1. **CLD (Section 4.2)**
   - 수식 (2): $\bm{Y} = \alpha(\bm{Z})(\bm{Y}^{\text{suf}}) + \beta(\bm{Z})\mathbf{1}$
   - 수식 (3): IRN 정규화
   - 수식 (4): Surrogate loss $\ell_{\text{suf}}(\hat{\bm{Y}}, \bm{Y}) = \text{MSE}(\tilde{\bm{Res}}, \bm{0})$

2. **MTEI (Section 4.3)**
   - 수식 (6): M-step loss
   - 수식 (7): E-step 환경 재할당
   - 수식 (8): Label Propagation (majority voting)

3. **MTIL (Section 4.4)**
   - 수식 (9): 이론적 목표 (mutual information)
   - 수식 (10): 실제 loss 함수 (3가지 항: 환경별 loss, ERM loss, Variance loss)

> **참고**: 본 문서의 모든 논문 발췌는 [Time-Series Forecasting for Out-of-Distribution Generalization Using Invariant Learning](https://arxiv.org/html/2406.09130v1) (arXiv:2406.09130, ICML 2024)에서 직접 인용한 것으로, 각 섹션에 해당 논문의 Section 번호와 Equation 번호를 명시했습니다.

## 실험 및 검증

### 논문 실험 설정

논문에서는 다음과 같은 실험 설정을 사용했습니다:

**데이터셋**:
- ETTh1, ETTh2: Train/Valid/Test = 6/2/2
- Exchange, Illness: Train/Valid/Test = 7/1/2

**구현 세부사항**:
- Look-back window: 96 (ILI: 36)
- Prediction length: [24, 48, 96, 168, 336, 720] (ILI: [4, 8, 12, 16, 20, 24])

**Backbone 모델**:
- Informer (AAAI 2021)
- Crossformer (ICLR 2023)
- PatchTST (ICLR 2023)

**Baseline 모델**:
- **TSF Distribution Shift 방법**: RevIN, NST
- **OOD Generalization 방법**:
  - Environment labels 필요: GroupDRO, IRM, IB-ERM, VREX, SD
  - Environment labels 불필요: EIL
- **Hybrid**: IRM+RevIN, EIL+RevIN

Baseline 모델 중, environment labels를 필요로 하는 모델에서는 학습 데이터를 k개 구간(2~10으로 튜닝)으로 나누어 미리 정의된 environment labels를 사용했습니다.

### 논문 실험 결과

#### Main Results

1. **FOIL은 모든 데이터셋과 예측 길이에 걸쳐 일관되고 유의미하게 성능을 향상**
   - MSE에서 최대 85% 향상을 보이며 FOIL의 효과성을 입증
   - SOTA 모델인 PatchTST에서도 FOIL은 일관되게 성능을 높였고, 최대 30% 개선을 달성
   - 상대적으로 성능이 낮은 Informer에서는 FOIL이 훨씬 큰 개선을 보여, 한 자릿수 이상 개선되어 경쟁력 있는 결과를 보임

2. **FOIL은 장기 예측보다 단기 예측에서 더 뛰어난 성능을 보임**
   - 장기 예측은 불확실성이 커서 IL 수행에 방해가 되기 때문으로 보임
   - 특히 ILI 데이터셋에서는 FOIL의 개선 폭이 가장 컸는데, 이는 COVID-19 시기 동안 심각한 OOD 분포 변화가 존재했던 테스트 데이터로 인함

#### Comparison with Distribution Shifts Methods

- FOIL은 모든 데이터셋에서 기존 distribution shift method를 MSE/MAE에서 평균적으로 10% 이상, 5.5% 이상으로 개선
- Surrogate loss는 단순한 instance normalization method로는 대체 불가능하며, 관찰되지 않은 core covariates 문제를 완화하는 데 중요한 역할
- 일반 OOD 방법은 성능이 낮음 - 즉, 기존의 IL을 별다른 조정 없이 TSF 과제에 직접 적용하는 것이 부적절함을 확인

#### Ablation Study

FOIL의 각 모듈이나 loss의 효과성을 입증하기 위하여 ablation study를 진행했습니다:

- **FOIL∖Suf**: 충분히 예측 가능한 $Y_{suf}$을 분해하는 데 사용되는 surrogate loss function을 제거
- **FOIL∖TEI**: MTEI 모듈을 제거
- **FOIL∖LP**: MTEI에서 Label Propagation 접근법을 제거

**결과**:
- FOIL∖Suf에서 성능이 크게 저하. 이는 TSF에 IL을 적용할 때 관측되지 않은 covariate 문제를 완화하는 것이 필요함을 보여줌
- FOIL∖TEI는 모든 데이터셋에서 FOIL∖LP보다 일관되게 우수한 성능을 보여, 시계열 데이터의 인접성 구조를 보존하는 것이 효과적임을 검증

#### Analysis of Inferred Environments

FOIL이 추론한 환경의 합리성을 보이기 위해, ILI 데이터셋을 대상으로 case study를 진행했습니다:

- 여름(매년 6월에서 8월), 겨울(매년 12월에서 2월), H1N1-09(2009년 4월~2010년 8월) 시기의 contribution을 시각화
- 전체 environment 수는 2로 설정

**주요 발견**:
1. 환경 1과 환경 2의 주요 구성 요소는 겨울과 여름으로 구분. 독감은 계절성 질환으로, 겨울 동안 전파되고 여름 이전에 종료
2. H1N1-09 기간은 환경 1에서 환경 2보다 더 많은 기여를 보이는데, 이는 H1N1-09 기간과 겨울 독감 시즌이 유사성을 가진다는 사실과 일치

### 본 프로젝트 실험 설정

다양한 시계열 예측 모델에 FOIL을 적용하여 OOD 일반화 성능을 검증했습니다:

- **데이터셋**: 다양한 금융 시계열 데이터 (S&P500, 나스닥, 채권 등)
- **평가 지표**: MSE, MAE, 그리고 OOD 시나리오에서의 성능
- **비교 모델**: FOIL 적용 전/후, 다양한 baseline 모델
- **트레이딩 전략**: Last Strategy, Confidence Strategy, Ratio Strategy

### 실험 결과: 누적 수익률 비교

FOIL을 적용한 모델의 학습 과정에서 Validation과 Test 세트에 대한 누적 수익률을 추적했습니다. 세 가지 트레이딩 전략(Last, Confidence, Ratio)에 대해 Epoch별 성능을 비교 분석했습니다.

#### FOIL 적용 전 성능

![Cumulative Returns Comparison - Baseline](./epoch_returns_comparison.jpg)

FOIL을 적용하기 전의 성능을 보면:

- **Last Strategy**: Validation 세트에서는 일부 Epoch에서 양의 수익률을 보이지만, Test 세트에서는 지속적으로 음의 수익률을 기록했습니다. 이는 일반화 성능이 부족함을 나타냅니다.

- **Confidence Strategy**: 세 가지 전략 중 가장 안정적인 성능을 보였습니다. Test 세트에서도 상대적으로 작은 손실(-0.01 ~ -0.02)을 기록하며, Validation과 Test 간의 성능 격차가 가장 작았습니다.

- **Ratio Strategy**: Validation 세트에서는 높은 변동성을 보이며 일부 Epoch에서 양의 수익률을 기록했지만, Test 세트에서는 지속적으로 큰 음의 수익률(-0.8 ~ -1.0)을 보였습니다.

**주요 관찰사항**:
- 모든 전략에서 Validation과 Test 간의 성능 격차가 존재 (일반화 문제)
- Test 세트에서 지속적인 음의 수익률 (OOD 일반화 실패)
- Confidence Strategy가 상대적으로 가장 안정적

#### FOIL 적용 후 성능

![Cumulative Returns Comparison - FOIL](./epoch_returns_comparison_fol.jpg)

FOIL을 적용한 후의 성능을 보면:

- **Last Strategy**: Test 세트에서 Epoch 5 이후 안정적인 양의 수익률(약 0.21)을 기록했습니다. Validation 세트는 여전히 변동성이 있지만, 전반적으로 양의 수익률을 유지했습니다.

- **Confidence Strategy**: Test 세트에서 Epoch 3 이후 작지만 일관된 양의 수익률(약 0.02)을 기록했습니다. Validation과 Test 간의 성능 격차가 크게 줄어들었습니다.

- **Ratio Strategy**: Validation 세트에서는 여전히 변동성이 있지만, Test 세트의 성능이 크게 개선되었습니다. Epoch 2 이후 점진적으로 회복하며, 최종적으로 -0.6 수준으로 개선되었습니다.

**주요 개선사항**:
- **OOD 일반화 성능 향상**: Test 세트에서 양의 수익률 달성 (Last, Confidence Strategy)
- **Validation-Test 격차 감소**: 일반화 성능이 크게 개선됨
- **안정성 향상**: Epoch가 진행될수록 안정적인 성능 유지

### 주요 발견

1. **OOD 일반화 성능 향상**: FOIL 적용 후 학습 데이터와 다른 분포의 테스트 데이터에서 성능 저하가 크게 감소했습니다. 특히 Last Strategy와 Confidence Strategy에서 Test 세트에서 양의 수익률을 달성했습니다.

2. **환경별 예측 조정**: 환경 추론을 통해 시장 상황에 따른 예측을 조정할 수 있습니다. Confidence Strategy에서 가장 안정적인 성능을 보인 것은 환경 추론이 신뢰도 기반 포지션 계산과 잘 결합되었기 때문으로 보입니다.

3. **모델 간 일관성**: 다양한 아키텍처에서 일관된 성능 향상을 확인했습니다. FOIL의 환경 불변성 학습이 다양한 트레이딩 전략에 효과적으로 적용되었습니다.

4. **학습 안정성**: FOIL을 적용한 모델은 Epoch가 진행될수록 더 안정적인 성능을 보였으며, Validation과 Test 간의 성능 격차가 줄어들었습니다.

## 구현 세부사항

### 1. Temporal Adjacency 보존 및 Label Propagation

Environment Inference Module은 LSTM을 사용하여 시간적 연속성을 보존합니다:

# LSTM으로 temporal 정보 보존하면서 환경 확률 추론 lstm_out, _ = self.env_lstm(features) # (B, L, hidden_dim) env_probs = self.env_assigner(lstm_out) # (B, L, num_envs)


이를 통해 인접한 시간 스텝들이 비슷한 환경에 할당되도록 합니다.

**Label Propagation**: 시계열 데이터가 가지는 인접성을 보장하며 noise를 제거하고자, 주변에 할당된 환경 내에서 majority voting을 수행합니다. 이는 E-step에서 각 instance에 대해 최소 에러를 가지는 환경을 선택한 후, temporal adjacency를 고려하여 환경 레이블을 평활화(smoothing)하는 과정입니다.

### 2. Environment Diversity

Diversity loss를 통해 환경이 너무 비슷해지지 않도록 합니다:

env_entropy = -torch.sum( env_probs * torch.log(env_probs + 1e-8), dim=-1 ) diversity_loss = -0.01 * torch.mean(env_entropy) # 엔트로피 최대화


엔트로피를 최대화하여 환경의 다양성을 유지합니다.

### 3. Surrogate Loss

관찰되지 않은 변수에 대한 불확실성을 고려합니다:

pred_variance = torch.var(predictions, dim=1, keepdim=True) surrogate_loss = -self.foil_lambda * torch.mean(pred_variance)


예측의 분산을 증가시켜 불확실성을 표현합니다.

## 향후 연구 방향

### 1. 환경 개수 자동 결정

현재는 환경 개수를 하이퍼파라미터로 설정하지만, 데이터에 따라 자동으로 결정하는 방법을 연구할 수 있습니다.

### 2. 계층적 환경 구조

단순한 환경 분류를 넘어서 계층적인 환경 구조를 모델링할 수 있습니다.

### 3. 메타 학습과의 결합

FOIL과 메타 학습을 결합하여 빠른 적응 능력을 향상시킬 수 있습니다.

### 4. 해석 가능성 향상

환경 추론 결과를 해석 가능하게 만들어 실제 시장 상황과의 연관성을 분석할 수 있습니다.

### 5. 다른 정규화 기법과의 비교

RevIN, Nonstationary Transformer의 정규화 기법과 FOIL의 조합 효과를 더 깊이 연구할 수 있습니다.

## 결론

FOIL은 시계열 예측에서 Out-of-Distribution 일반화 문제를 해결하기 위한 효과적인 프레임워크입니다. 주요 기여사항은 다음과 같습니다:

### 핵심 기여사항

1. **시계열 데이터의 분포 변화를 OOD generalization 문제로 재정의**
   - 여러 환경(domain)에서 변하지 않는(invariant) 예측 규칙을 학습하여 OOD generalization을 달성

2. **시계열 데이터에서 일반적인 Invariant Learning이 적용될 수 없는 이유를 정의하고 해결**
   - Challenge 1: 시계열에서는 target 변수에 직접적으로 영향을 끼치지만 관측되지 않는 변수들이 항상 존재
   - Challenge 2: 시계열 데이터는 일반적으로 명시적인 환경 레이블(environment labels) 없이 수집
   - 가상의 deterministic하게 예측 가능한 $Y_{suf}$를 정의하여 이론적으로 적용 가능함을 보임

3. **3가지 모듈을 통한 해결 방안 제시**
   - **Label Decomposing Component (CLD)**: 관측된 Y로부터 충분히 예측 가능한 $Y_{suf}$를 근사
   - **Environment Inference Module (MTEI)**: MTIL으로부터 학습된 representation을 기반으로 temporal environments를 추론
   - **Time-Series Invariant Learning Module (MTIL)**: MTEI로부터 추론된 environments에 걸쳐 Invariant representation을 학습

### 실험 결과 요약

- 모든 데이터셋과 예측 길이에 걸쳐 일관되고 유의미한 성능 향상 (MSE 최대 85% 향상)
- 기존 distribution shift method 대비 평균 10% 이상 개선
- Ablation study를 통해 각 모듈의 효과성 입증
- 추론된 환경의 합리성 검증 (ILI 데이터셋 case study)

### 실용적 가치

FOIL은 model-agnostic 프레임워크로 설계되어 다양한 시계열 예측 모델에 쉽게 통합할 수 있으며, RevIN과 유사한 방식으로 기존 모델의 전처리 단계에 추가됩니다. 실험 결과, FOIL을 적용한 모델들은 OOD 시나리오에서 더욱 강건한 성능을 보였으며, 환경 추론을 통해 시장 상황에 따른 예측 조정이 가능했습니다.

향후 연구를 통해 FOIL의 성능을 더욱 향상시키고, 다양한 시계열 예측 문제에 적용할 수 있을 것으로 기대합니다.

## 참고 문헌

1. **Liu, H., et al.** "Time-Series Forecasting for Out-of-Distribution Generalization Using Invariant Learning" (ICML 2024).
   - [논문 원문 (arXiv:2406.09130)](https://arxiv.org/html/2406.09130v1)
   - [공식 GitHub 저장소](https://github.com/AdityaLab/FOIL)
   - [ICML 2024 Proceedings](https://proceedings.mlr.press/v235/liu24ae.html)

2. Arjovsky, M., et al. "Invariant Risk Minimization" (2020)

3. Domain Generalization 관련 연구들

---

### JeTech Lab

FOIL과 같은 최신 연구를 빠르게 프로젝트에 통합하여 실제 트레이딩 성과에 반영하고 있습니다.

OOD 일반화는 금융 시계열 예측에서 매우 중요한 문제입니다. 시장 환경은 끊임없이 변화하며, 과거 데이터로 학습한 모델이 미래의 다른 환경에서도 잘 작동해야 합니다.

FOIL을 통해 다양한 시장 환경에서도 안정적인 예측을 제공할 수 있도록 지속적으로 연구하고 있습니다.