从 cgroup 的介绍中，我们知道了通过设置 /sys/fs/cgroup/ 的值，并且使用 cgroup-tools 启动程序同时指定一个 cgroup，可以达到控制进程使用系统资源的目的。

起因

一个 Go 程序运行在 k8s 环境中，在某一行代码前后设置 start timestamp 和 end timestamp，发现有时候 p99 的 latency 非常高，正常情况下在 1-3 ms，极端情况下有 50-90 ms。百思不得其解，猜测各种可能加查阅资料后，发现应该是没有正确的设置 runtime.GOMAXPROCS。设置为 1 后，极高 latency 的情况明显减少。

为什么

出现这个问题有三个条件，缺一不可：

是 Go 程序，并且采用系统默认 GOMAXPROCS
运行在 k8s 或者 docker 这样的容器环境
宿主机上有多个 CPU 核

GOMAXPROCS 是什么

回忆一下 Go 并发的 GPM 模型：

G代表 goroutine，即用户创建的 goroutines
P代表 Logical Processor，是类似于 CPU 核心的概念，其用来控制并发的 M 数量
M是操作系统线程。在绝大多数时候，P的数量和M的数量是相等的。每创建一个P, 就会创建一个对应的M

而 go 的 runtime GOMAXPROCS 代表的就是 P 的数量，其底层就是 runtime 直接调用 Linux 系统调用 sched_getaffinity()

func getproccount() int32 {
	// This buffer is huge (8 kB) but we are on the system stack
	// and there should be plenty of space (64 kB).
	// Also this is a leaf, so we're not holding up the memory for long.
	// See golang.org/issue/11823.
	// The suggested behavior here is to keep trying with ever-larger
	// buffers, but we don't have a dynamic memory allocator at the
	// moment, so that's a bit tricky and seems like overkill.
	const maxCPUs = 64 * 1024
	var buf [maxCPUs / 8]byte
	r := sched_getaffinity(0, unsafe.Sizeof(buf), &buf[0])
	if r < 0 {
		return 1
	}
	n := int32(0)
	for _, v := range buf[:r] {
		for v != 0 {
			n += int32(v & 1)
			v >>= 1
		}
	}
	return n
}

k8s 容器环境有什么不一样

Kubernetes 与docker --cpus 一样，都是利用CFS Bandwith Control 来对 CPU 进行资源使用的限制。实际底层就是通过 cgroups 的 cpu.cfs_period_us和cpu.cfs_quota_us限制 Pod 内进程使用 CPU 的时间，让人感觉到仿佛 Pod 只使用了2 个 CPU 或者是二分之一个 CPU 。

当 Go 程序运行在 pod 中，sched_getaffinity()并不会感知到 cgroups 对 pod 的 CPU进行了限制，依旧返回 Host 上真实的 CPU 个数。

因此，假设 Pod 只有一个 CPU，而 Host 有 8 个 CPU，默认情况下 Go 程序误以为 CPU 为 8，进而创建了相同数量的 P，导致 runtime 频繁的进行调度和上下文切换，导致 p99 的延迟非常大。

如何解决

开源库 go.uber.org/automaxprocs 解决了这个问题，其核心思想就是自己去查看cpu.cfs_period_us, cpu.cfs_quota_us计算出一个合适 CPU 值。

在他的 README.md 我们可以看到这样一段话

Data measured from Uber's internal load balancer. 
We ran the load balancer with 200% CPU quota (i.e., 2 cores):

| GOMAXPROCS         |  RPS      | P50 (ms) | P99.9 (ms) |
| ------------------ | --------- | -------- | ---------- |
| 1                  | 28,893.18 | 1.46     | 19.70      |
| 2 (equal to quota) | 44,715.07 | 0.84     | 26.38      |
| 3                  | 44,212.93 | 0.66     | 30.07      |
| 4                  | 41,071.15 | 0.57     | 42.94      |
| 8                  | 33,111.69 | 0.43     | 64.32      |
| Default (24)       | 22,191.40 | 0.45     | 76.19      |

可见，GOMAXPROCS 确实是 p99 的延迟非常明显，p50 一下几乎没有影响，与我观察到的特征一致。

automaxprocs 如何实现

最简单的一个 PoC 代码就是直接打开 /sys/fs/cgroup/cpu.max 文件即可，但是 automaxprocs 作为一个 library 肯定要考虑适配各种情况。

首先，就是同时适配 cgroup v1 和 cgroup v2。

其次，automaxprocs 没有直接去 /sys/fs 目录下读文件，而是先读取了 /proc/self/mountinfo

那么 /proc/self/mountinfo 包含什么信息呢？通过查询 linux 自带的手册 man 5 proc 并搜索 mountinfo 可以看到详细的解释。总之，通过 parse mountinfo 的内容，找到文件系统类型为 cgroup2 或者 cgroup 的，提取出挂载路径即可。

最后，做一个小实验，用下面的 docker 命令启动一个容器，

$ docker run -it --cpu-period 100000 --cpu-quota 50000 ubuntu bash

# cd /sys/fs/cgroup/
# cat cpu.max
50000 100000

进入容器后 cat /proc/self/mountinfo ，毫无意外，得到 cgroup 路径为 /sys/fs/cgroup ，然后查看 cpu.max 符合我们预期的值。

Set GOMAXPROCS Properly in Go Program

起因

为什么

GOMAXPROCS 是什么

k8s 容器环境有什么不一样

如何解决

automaxprocs 如何实现

参考资料

起因#

为什么#

GOMAXPROCS 是什么#

k8s 容器环境有什么不一样#

如何解决#

automaxprocs 如何实现#

参考资料#

起因

为什么

GOMAXPROCS 是什么

k8s 容器环境有什么不一样

如何解决

automaxprocs 如何实现

参考资料