---
title: Quickstart — 5분 안에 실험 다시 돌리기
date: 2026-06-28
type: guide
domain: istio
tags: [istio, envoy, graceful-termination, haproxy]
---


> [!abstract]
> 홈랩 Istio graceful-termination 실험을 **다시 펼쳐 빠르게 재현**하기 위한 명령 시퀀스 + 검증 모음이다. 핵심 결론 한 줄: 모든 재현은 *모드를 토글하고 → 종료 이벤트를 만들고 → 그 순간의 in-flight 요청 결과를 측정*하는 **동일한 3-스텝 루프**이며, current↔improved의 expect 라인 차이가 곧 가설의 증거다. 메커니즘 설명은 [big picture](/public/istio/gt__src-w1-big-picture.html)에, 시나리오 정본 정의(S2 포함)는 [test scenarios](/public/istio/gt__src-w5-test-scenarios.html)에 있다.

**대상환경:** homelab k8s v1.30.6 (master1/worker1/worker2) + .211 노드 HAProxy + Istio IGW. **대상독자:** 이 실험을 이미 한 번 돌려봤고 다시 펼치려는 사람. **범위:** 재현 명령·검증·함정만 — 왜 이런 결과가 나오는지의 메커니즘은 링크된 문서로 분리. **선행개념:** Envoy drain, k8s preStop/grace-period, HAProxy backend health.

---

## 0. 왜 이 실험이 존재하나 — 풀어야 할 문제

쿠버네티스에서 pod가 사라지는 건 일상이다(rollout, scale-down, eviction). 문제는 **pod가 죽는 그 짧은 창(window) 동안 이미 그 pod로 들어와 처리 중이던 요청(in-flight request)에 무슨 일이 일어나는가**이다. 순진하게 종료하면 진행 중 요청이 RST(TCP reset)나 stream CANCEL로 끊겨 클라이언트는 5xx·exit≠0을 본다. 이게 "ungraceful termination"이고, 트래픽이 많은 프로덕션에서는 배포할 때마다 소수의 요청이 조용히 깨지는 형태로 새어 나온다.

graceful termination의 처방은 단순하다 — **죽기 전에 "나 이제 안 받아요"를 먼저 알리고(drain), 받은 요청은 끝까지 처리할 시간을 확보한 뒤(지연 종료) 진짜로 죽는다.** 이 실험은 그 처방이 *실제로* 효과가 있는지를 두 모드로 대조 측정한다:

| 모드 | 동작 | in-flight 요청 운명 |
|---|---|---|
| **current** (broken) | abrupt shutdown | RST/CANCEL로 끊김 → 5xx·exit≠0 |
| **improved** (graceful) | drain + 지연 종료 | 끝까지 완주 → 200·exit=0 |

이 문서는 그 대조를 "어떤 명령을, 어떤 순서로, 무엇을 기대하며" 돌리는지로 압축한 것이다. 즉 이 문서는 *결론을 만드는 절차서*이고, *왜 그 결론이 나오는가*는 [big picture](/public/istio/gt__src-w1-big-picture.html)·[test scenarios](/public/istio/gt__src-w5-test-scenarios.html)에 있다.

---

## 1. 머릿속 한 장 — 모든 재현을 지배하는 3-스텝 루프

ANCHOR: **이 실험에 시나리오가 셋이지만 골격은 하나다 — 토글 → 종료 이벤트 → 측정.** 시나리오들은 이 루프의 *변수*만 바꾼다: ① 종료 이벤트의 종류(단일 pod kill vs rollout), ② 요청의 형태(단발 long-request vs 지속 트래픽 vs streaming). 그래서 한 시나리오를 이해하면 나머지는 델타만 보면 된다.

```mermaid
flowchart LR
  toggle["1. TOGGLE<br/>current ↔ improved"] --> inflight["2a. start in-flight<br/>request(s)"]
  inflight --> kill["2b. kill / rollout<br/>(종료 이벤트)"]
  kill --> measure["3. MEASURE<br/>http_code · exit · chunks"]
  measure --> verdict{"current vs improved<br/>expect 라인 차이?"}
  verdict -->|차이 있음| proven["가설 입증"]
```

왜 이 구조여야 하나 — **측정하려는 건 "종료 순간에 걸쳐 있던 요청"의 운명**이다. 그래서 요청을 *먼저* 띄워 in-flight 상태로 만든 뒤(2a), *그 다음에* pod를 죽이고(2b), 죽는 도중에 그 요청이 어떻게 끝나는지를 본다(3). 순서가 바뀌면(예: 죽이고 나서 요청) 측정 대상이 사라진다. 그리고 current와 improved의 expect 라인이 **다르게** 나와야 처방이 작동한다는 증거가 된다 — 같으면 실험이 망가진 것이다.

### 변수 격리: replicas가 측정의 신뢰성을 좌우한다

루프에 숨은 전제가 하나 있다 — **내가 죽이는 그 pod로 트래픽이 실제로 가야** 측정이 성립한다. HAProxy가 `balance roundrobin`이라, replicas=2면 curl이 살아있는 다른 worker pod로 돌아가 정상 응답을 받아버린다(개입한 변수가 사라짐 → 가설 검증 불가, §5 Q1). 그래서:

- **S1/S4 (replicas=1):** 트래픽이 죽일 pod 하나로 강제 → 종료 영향을 정면으로 측정.
- **S3 (replicas=2):** 의도적 — rollout disruption(여러 pod가 순차 교체되는 동안의 연결 안정성)을 측정하려면 복수 pod가 필요.

---

## 2. 메커니즘 — 모드 전환은 왜 "3개가 함께" 움직이나

전환의 한 줄 모델: **한 모드 = {IGW manifest, HAProxy cfg, mode 라벨} 세 가지가 정합된 상태**이고, 토글은 이 셋을 동시에 갈아끼우는 일이다. 하나라도 빠지면 상태가 어긋나 결과가 오염된다.

```mermaid
flowchart LR
  op[operator] --> k8s["K8s: apply 20-current<br/>or 21-improved IGW"]
  op --> ha["node .211: install + reload<br/>haproxy cfg (current/improved)"]
  op --> del["delete old-mode pods<br/>grace-period=5"]
  k8s --> done[rollout status OK]
  ha --> done
  del --> done
```

| 대상 | current | improved | 왜 바꿔야 하나 |
|---|---|---|---|
| IGW manifest | `manifests/20-igw-current.yaml` | `manifests/21-igw-improved.yaml` | pod의 preStop/drain/grace 동작 자체를 정의 — 처방의 본체 |
| HAProxy cfg (.211) | `haproxy/haproxy-current.cfg` | `haproxy/haproxy-improved.cfg` | L7 앞단의 health-check·연결 처리. pod 동작과 정합돼야 종료가 깔끔 |
| 옛 pod | `mode!=current` 강제 삭제 | `mode!=improved` 강제 삭제 | 새 manifest를 apply해도 옛 mode pod이 deadlock으로 안 죽으면(§4) 옛 동작이 잔존 |

**왜 옛 pod을 손으로 죽여야 하나** — 가장 비자명한 부분이다. anti-affinity required + maxUnavailable=0 + N(pod)=N(nodes)이면, 새 RS pod은 좌석이 없어 Pending, 옛 RS pod은 maxUnavailable=0이라 종료 불가 → **deadlock**(§5 Q3). `--grace-period=5` 강제 삭제로 좌석을 비워 rollout을 풀어준다.

```bash
# improved 모드로
kubectl --context homelab apply -f manifests/21-igw-improved.yaml
scp haproxy/haproxy-improved.cfg homelab:/tmp/haproxy.cfg
ssh homelab "scp /tmp/haproxy.cfg jinsoo@203.0.113.211:/tmp/ && \
  ssh jinsoo@203.0.113.211 'sudo install -m 0644 /tmp/haproxy.cfg /etc/haproxy/haproxy.cfg && sudo systemctl reload haproxy'"

# current 모드로 (반대) — improved와 동일 패턴, cfg/manifest만 교체
kubectl --context homelab apply -f manifests/20-igw-current.yaml
scp haproxy/haproxy-current.cfg homelab:/tmp/haproxy.cfg
ssh homelab "scp /tmp/haproxy.cfg jinsoo@203.0.113.211:/tmp/ && \
  ssh jinsoo@203.0.113.211 'sudo install -m 0644 /tmp/haproxy.cfg /etc/haproxy/haproxy.cfg && sudo systemctl reload haproxy'"

# 옛 mode pod 강제 종료 (anti-affinity deadlock 회피)
kubectl --context homelab -n service-a delete pod \
  $(kubectl --context homelab -n service-a get pod -l app=service-a-igw \
    -o jsonpath='{range .items[?(@.metadata.labels.mode!="<TARGET_MODE>")]}{.metadata.name} {end}') \
  --grace-period=5

# rollout 완료 대기
kubectl --context homelab -n service-a rollout status deploy/service-a-igw --timeout=180s
```

> `reload` ≠ `restart`: HAProxy가 6443 control-plane frontend도 함께 물고 있어 `restart`는 접속을 일시 끊는다. cfg 변경은 항상 `reload`(§4 마지막 행).

### 사전 점검 (30초) — 루프를 돌리기 전 4가지 전제

이 실험은 클러스터·istiod·HAProxy backend·이미지 4가지가 정상이라는 전제 위에 선다. 하나라도 빠지면 측정이 오염되므로 루프 전에 확인한다.

```bash
# 클러스터 살아있나
kubectl --context homelab get nodes -o wide
# expect: master1/worker1/worker2 모두 Ready, v1.30.6

# istiod 떠있나
kubectl --context homelab -n istio-system get pod
# expect: istiod-* 1/1 Running

# HAProxy backend 정상인가
ssh homelab "ssh jinsoo@203.0.113.211 'echo show stat | sudo socat /run/haproxy/admin.sock stdio'" \
  | awk -F, '/istio-http-backend/{print $2"="$18}'
# expect: master1=UP, worker1=UP, worker2=UP (또는 pod 분포에 따라 일부 DOWN)

# 이미지 3 노드에 모두 있나
for n in 212 213 214; do
  ssh homelab "ssh jinsoo@203.0.113.$n 'sudo crictl images 2>/dev/null | grep service-a'" | head -2
done
# expect: service-a-backend:dev, service-a-hc:dev 둘 다 노드별로 표시
```

이 중 하나 빠지면 [runbook](/public/istio/gt__src-runbook.html)의 복구 절차 참조.

---

## 3. 시나리오 재현 — 같은 루프, 다른 델타

세 시나리오는 §1 루프의 인스턴스다. 아래는 각각의 *델타*(종료 이벤트 × 요청 형태 × replicas)와 함께, apply 그대로의 완전한 명령을 싣는다. S2는 S1의 improved drain 폴링 FSM 타이밍 검증판이라 S1 expect 라인에 흡수되어 별도 블록 없이 [test scenarios](/public/istio/gt__src-w5-test-scenarios.html)를 참조한다.

| 시나리오 | 종료 이벤트 | 요청 형태 | replicas | 측정값 |
|---|---|---|---|---|
| **S1** | 단일 pod kill | 단발 long-request | 1 | http_code · time_total |
| **S3** | rollout restart | 지속 트래픽(10 worker) | 2 | 5xx 수 · errors(RST) |
| **S4** | 단일 pod kill | HTTP/2 streaming | 1 | chunks 수 · exit |

### S1 — current long-request RST 재현 (replicas=1)

요청 하나(`/sleep?seconds=60`)를 띄워 60초간 in-flight로 잡아두고, 5초 뒤 그 pod을 죽인다. current면 진행 중 요청이 502로 끊기고(t=~8.25s), improved면 끝까지 완주한다(http=200, t=~60s).

```bash
kubectl --context homelab -n service-a scale deploy/service-a-igw --replicas=1
# … rollout status 대기 …

ART=tests/artifacts/$(date +%Y%m%d-%H%M%S)/S1-rerun && mkdir -p $ART
TARGET=$(kubectl --context homelab -n service-a get pod -l app=service-a-igw -o name | head -1 | cut -d/ -f2)
kubectl --context homelab -n service-a logs $TARGET -c hc --follow --tail=0 > $ART/hc.log 2>&1 &
HC_PID=$!
(curl -sS --max-time 90 --no-buffer --cacert haproxy/certs/ca.pem \
   --resolve example.local:443:203.0.113.211 \
   -w '\n---\nhttp=%{http_code} t=%{time_total}\n' \
   'https://example.local/sleep?seconds=60' > $ART/curl.out; echo "exit=$?" >> $ART/curl.out) &
CURL_PID=$!
sleep 5
kubectl --context homelab -n service-a delete pod $TARGET --grace-period=210 --wait=false
wait $CURL_PID
kill $HC_PID 2>/dev/null
cat $ART/curl.out
# expect: http=502 t=~8.25s exit=0  (current 모드, S1 실측치)
# expect: http=200 t=~60s exit=0 (improved 모드)
```

### S3 — continuous + rollout restart (replicas=2)

10개의 worker loop가 90초간 `/fast`를 두드리는 중에 `rollout restart`로 pod 전체를 굴린다. 단일 kill이 아니라 *배포 도중 연결이 새지 않는가*를 본다. current는 RST로 ~9건 errors, improved는 0건.

```bash
kubectl --context homelab -n service-a scale deploy/service-a-igw --replicas=2
# … rollout 대기 …

ART=tests/artifacts/$(date +%Y%m%d-%H%M%S)/S3-rerun && mkdir -p $ART
T_END=$(($(date +%s) + 90))
for i in $(seq 1 10); do
  (while [ $(date +%s) -lt $T_END ]; do
    curl -sS --cacert haproxy/certs/ca.pem --resolve example.local:443:203.0.113.211 \
      -o /dev/null -w "$(date +%s.%3N) %{http_code} %{time_total}\n" \
      https://example.local/fast 2>>$ART/curl-err.log
  done) >> $ART/curl.tsv &
done
sleep 10
kubectl --context homelab -n service-a rollout restart deploy/service-a-igw
wait

awk '$2~/^[0-9]+$/{c[$2]++} END{for(k in c) print k": "c[k]}' $ART/curl.tsv | sort
echo "errors: $(wc -l < $ART/curl-err.log)"
# current expect: 5xx=0, errors=~9 (connection RST)
# improved expect: 5xx=0, errors=0
```

> S3에서 5xx=0인데 errors가 있는 게 핵심 단서다 — HTTP 응답 코드가 아니라 **연결 자체가 RST로 끊겨** curl이 에러를 뱉는다. graceful이면 이 RST가 사라진다.

### S4 — streaming (replicas=1)

HTTP/2 stream(`/stream?seconds=60&interval=1`, 초당 1 chunk)을 받는 도중 8초 뒤 pod을 죽인다. current면 stream이 chunk 11/12 즈음 CANCEL되어 exit=92, improved면 60 chunk 거의 다 받고 exit=0.

```bash
# 전제: replicas=1 (개입한 pod에 traffic이 가야 chunks=~12/exit=92 재현됨; 이유는 §5 Q1 /
#   [test scenarios](/public/istio/gt__src-w5-test-scenarios.html)). S3가 replicas=2로 끝났으므로 반드시 1로 되돌린다.
kubectl --context homelab -n service-a scale deploy/service-a-igw --replicas=1
kubectl --context homelab -n service-a rollout status deploy/service-a-igw --timeout=180s

ART=tests/artifacts/$(date +%Y%m%d-%H%M%S)/S4-rerun && mkdir -p $ART
TARGET=$(kubectl --context homelab -n service-a get pod -l app=service-a-igw -o name | head -1 | cut -d/ -f2)
(curl -sS --max-time 90 --no-buffer --cacert haproxy/certs/ca.pem \
   --resolve example.local:443:203.0.113.211 \
   'https://example.local/stream?seconds=60&interval=1' > $ART/curl.body 2>$ART/curl.err
 echo "exit=$?" > $ART/curl.exit) &
sleep 8
kubectl --context homelab -n service-a delete pod $TARGET --grace-period=210 --wait=false
wait

grep -c '^chunk' $ART/curl.body
cat $ART/curl.exit $ART/curl.err
# current expect: chunks=~12, exit=92 (HTTP/2 stream CANCEL @ chunk 11/12s)
# improved expect: chunks=59/60, exit=0
```

---

## 4. 결과 위치 + 분석 — 무엇을 어디서 읽나

artifact는 시나리오별 디렉토리에 모인다. 클라이언트 측(curl.*)이 *결과*, hc.log/envoy.log/stat.csv가 *왜 그 결과가 나왔는지의 타임라인*이다.

```
tests/artifacts/<YYYYMMDD-HHMMSS>/<scenario>/
  ├── curl.out / curl.body / curl.err      ← 클라이언트 측 결과
  ├── hc.log                               ← FSM 전이 (event=transition 라인 grep)
  ├── envoy.log                            ← Envoy access log + drain 라인
  ├── stat.csv / stat-timeline.csv         ← HAProxy show stat 시계열
  └── run.log                              ← 시나리오 실행 타임라인
```

**정렬의 닻은 hc.log의 transition timestamp다.** 6 events 통합 스크립트가 가장 먼저 보는 컬럼이 `hc.log`의 `event=transition from=... to=... reason=...` 라인 timestamp이고(§5 Q2), 그게 event 1(health_fail)의 시작점이다. 나머지 envoy.log·stat.csv는 이 timestamp 기준으로 정렬된다.

```bash
# FSM 전이만
grep transition <ART>/hc.log

# HAProxy backend status 변화
awk -F, '$2!=""{c[$2","$3]++} END{for(k in c) print k": "c[k]}' <ART>/stat.csv

# 첫 503 시점
grep -m1 503 <ART>/hc.log

# 6 events 통합 (artifacts dir 통째로)
bash tests/05-collect-timestamps.sh <ART>
```

---

## 5. 자주 마주치는 문제

| 증상 | 원인 | 해결 |
|---|---|---|
| `kubectl rollout status` timeout | anti-affinity required + maxUnavailable=0 + N=N nodes deadlock | 옛 mode pod 강제 삭제: `kubectl delete pod -l app=service-a-igw,mode=<old>` (메커니즘 심화는 [runbook](/public/istio/gt__src-runbook.html)) |
| Pod `0/2 Running` Envoy SDS 에러 | `workload-spiffe-uds` emptyDir 누락 | `manifests/2X-igw-*.yaml` 의 volumes/volumeMounts 확인 — workload-socket/credential-socket/workload-certs 3종 필요 |
| `ErrImagePull` from `203.0.113.2:5000` | 노드 containerd가 HTTPS로 시도 | `ssh homelab 'docker save ... && scp + ctr -n k8s.io image import'` 로 사이드로딩 |
| HAProxy backend `master1=DOWN check=L4TOUT` | master1에 IGW pod 없거나, pod의 hc 컨테이너 not ready | EndpointSlice ready 상태 확인 → 이미지 pull 또는 readinessProbe 확인 |
| HAProxy 6443 일시 끊김 | `systemctl restart haproxy` 시 모든 frontend 재시작 | `systemctl reload haproxy` 사용 (drop-in 변경 시) — 본 실험은 reload로 충분 |

---

## 6. 회상 quiz

<details>
<summary>Q1. S1을 replicas=2로 돌리면 결과가 어떻게 달라지나?</summary>

curl traffic이 HAProxy `balance roundrobin`으로 다른 worker pod에 갈 수 있음 → 그 worker pod이 살아있으니 정상 응답. **개입한 변수가 사라져 가설 검증 불가**. 그래서 S1·S2·S4는 replicas=1.

</details>

<details>
<summary>Q2. `tests/05-collect-timestamps.sh`가 가장 먼저 보는 컬럼은?</summary>

`hc.log`의 `event=transition from=... to=... reason=...` 라인의 timestamp. 이게 6 events의 시작점(event 1: health_fail). 그 다음 envoy.log, stat.csv 등에서 같은 timestamp 기준으로 정렬.

</details>

<details>
<summary>Q3. 모드 전환 시 옛 pod이 안 죽는 패턴은?</summary>

`anti-affinity required + maxUnavailable=0 + N=N nodes`. 새 RS pod이 좌석 없어 Pending, 옛 RS pod도 maxUnavailable=0이라 종료 못 함 → deadlock. 해소: master1 untaint(좌석 +1) 또는 옛 pod 강제 삭제 또는 manifest의 maxUnavailable=1로 변경.

</details>

---

## 핵심 정리
- **모든 재현은 한 루프다 — 토글 → 종료 이벤트 → in-flight 측정.** 시나리오는 (종료 이벤트 × 요청 형태 × replicas)의 델타만 다르다.
- **모드 전환 = 3개 정합** — IGW manifest(`20`/`21`) + HAProxy cfg 교체 + 옛 mode pod 강제 삭제(§5 표). 하나 빠지면 옛 동작 잔존.
- **replicas는 변수 격리 장치** — S1/S4는 죽일 pod에 트래픽을 강제하려 replicas=1, S3는 rollout disruption 측정용 replicas=2.
- **current↔improved expect 차이가 곧 증거** — S1: 502/8.25s↔200/60s, S3: errors~9↔0, S4: chunks~12·exit92↔59·exit0.
- **결과 정렬의 닻** — `tests/artifacts/<ts>/<scenario>/`에 모이며 `hc.log`의 transition timestamp가 6 events 정렬 기준점.
- 시나리오 정본 정의(S2 포함)는 [test scenarios](/public/istio/gt__src-w5-test-scenarios.html), 복구·deadlock 메커니즘 심화는 [runbook](/public/istio/gt__src-runbook.html).

## What you might be missing
- **S2가 본문에 없는 건 의도다.** S2는 S1의 improved drain 폴링 FSM 타이밍 검증판이라 S1 expect 라인(http=200/t=~60s)에 흡수된다. 별도 재현이 필요하면 [test scenarios](/public/istio/gt__src-w5-test-scenarios.html)를 따라간다.
- **replicas 되돌리기를 잊으면 재현이 깨진다.** S3(replicas=2) 직후 S4를 그대로 돌리면 traffic이 살아있는 다른 pod로 가 chunks=~12/exit=92가 안 나온다 — S4 블록 첫 줄의 `scale --replicas=1`이 그래서 필수다.
- **`reload` vs `restart`.** HAProxy는 6443 frontend도 함께 물고 있어 `restart`는 control-plane 접속을 일시 끊는다. cfg 변경은 항상 `reload`로(§5 마지막 행).
- **5xx만 보면 graceful 실패를 놓친다.** S3에서 5xx=0이어도 RST는 연결 레이어에서 발생해 curl errors로만 잡힌다 — graceful 검증은 http code뿐 아니라 exit≠0/연결 에러까지 봐야 한다.
- **deadlock 해소는 강제 삭제만이 아니다.** master1 untaint로 좌석을 +1 하거나 manifest의 maxUnavailable=1로 바꾸는 우회도 있다(§6 Q3).
</content>
</invoke>