We can't find the internet
Attempting to reconnect
### 起因
今天在搭建一个递归服务器, 想要测试一下递归结果, 根据 ECS 来判断相关的递归服务器
结果
结果在进行递归查询的时候, 发现一个问题, 递归携带了 subnet, 查询后再更换 subnet,
递归直接读取了缓存, 而不是带着新的 subnet 去递归.
### 排查
为了对比这个问题,使用了不同的域名来进行递归,发现一个问题,两个域名在递归服务器有
不同的结果. 域名 z.gsmiot.com 在第一次递归后会一直缓存, aws.amazon.com 域名会根
据子网变换每次都会递归.
我对比里查询的结果, 发现了一个区别:
z.gsmiot.com
```text
[root@10-9-104-141 unbound]# dig z.gsmiot.com @127.0.0.1 +subnet=178.24.161.99/16
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-16.P2.el7_8.6 <<>> z.gsmiot.com @127.0.0.1 +subnet=178.24.161.99/16
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36175
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; CLIENT-SUBNET: 178.24.0.0/16/0
;; QUESTION SECTION:
;z.gsmiot.com. IN A
;; ANSWER SECTION:
z.gsmiot.com. 3600 IN A 8.8.8.8
;; Query time: 994 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Nov 03 11:03:27 CST 2020
;; MSG SIZE rcvd: 67
```
aws.amazon.com
```text
[root@10-9-104-141 unbound]# dig aws.amazon.com @127.0.0.1 +subnet=178.24.161.99/16
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-16.P2.el7_8.6 <<>> aws.amazon.com @127.0.0.1 +subnet=178.24.161.99/16
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2145
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; CLIENT-SUBNET: 178.24.0.0/16/24
;; QUESTION SECTION:
;aws.amazon.com. IN A
;; ANSWER SECTION:
aws.amazon.com. 300 IN CNAME tp.8e49140c2-frontier.amazon.com.
tp.8e49140c2-frontier.amazon.com. 300 IN CNAME dr49lng3n1n2s.cloudfront.net.
dr49lng3n1n2s.cloudfront.net. 300 IN A 54.230.150.74
;; Query time: 420 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Nov 03 11:03:53 CST 2020
;; MSG SIZE rcvd: 147
```
对比发现了响应的结果有区别, 在`OPT PSEUDOSECTION`这个部分,有一个地方有区别:
```text
AWS 权威
; CLIENT-SUBNET: 178.24.0.0/16/24
ZDNS 权威
; CLIENT-SUBNET: 178.24.0.0/16/0
```
又测试了 subnet 生效的 aws 域名, 递归会根据发现子网这个地
址`178.24.0.0/[A]/[B]`中的 A 和 B 两个 prefix 来做判断, 如果携带的 A 范围较大,
就是用 A, 而 B 范围较大, 就使用 B. 现在猜测这里的 B 可能是由权威返回的.
查询了一下相关文档
[RFC7871 Client Subnet in DNS Queries](https://tools.ietf.org/html/rfc7871#section-7.2.1)
其中在 7.2.1 节中内容:
```text
7.2.1. Authoritative Nameserver
When a query containing an ECS option is received, an Authoritative
Nameserver supporting ECS MAY use the address information specified
in the option to generate a tailored response.
Authoritative Nameservers that have not implemented or enabled
support for the ECS option ought to safely ignore it within incoming
queries, per [RFC6891], Section 6.1.2. Such a server MUST NOT
include an ECS option within replies to indicate lack of support for
it. Implementers of Intermediate Nameservers should be aware,
however, that some nameservers incorrectly echo back unknown EDNS0
options. In this protocol, that should be mostly harmless, as the
SCOPE PREFIX-LENGTH should come back as 0, thus marking the response
as covering all networks.
A query with a wrongly formatted option (e.g., an unknown FAMILY)
MUST be rejected and a FORMERR response MUST be returned to the
sender, as described in [RFC6891], "Transport Considerations".
```
其中一段
`In this protocol, that should be mostly harmless, as the SCOPE PREFIX-LENGTH should come back as 0, thus marking the response as covering all networks.`
这里`SCOPE PREFIX-LENGTH`如果为 0 的话, 响应就会应用到所有网络. 目前猜测 ZDNS 的
权威可能属于这个问题的范围.
查询了一下文档, 找到了 IETF 上有一个相关文档
[A Look at the ECS Behavior of DNS Resolvers](https://www.ietf.org/proceedings/106/slides/slides-106-maprg-a-look-at-the-ecs-behavior-of-dns-resolvers-kyle-schomp-01)
有如下一段内容:
```text
ECS Purpose
• Enable CDN server selection by ADNS based on client subnet
• ECS Option in DNS queries from resolvers to ADNS includes
• Client IP address prefix
• Source prefix length
• ECS Option in DNS responses from ADNS to resolvers includes
• Scope prefix length
```
现在可以推断极大概率是因为这个 scop prefix length 的问题了
### 验证
使用 tcpudump 抓包两次递归服务的递归结果
```shell
tcpdump -i eth0 port 53 -w zcloud.cap
tcpdump -i eth0 port 53 -w aws.cap
```
在 wireshark 中打开看到结果
在响应结果的记录中
- 打开 Domain Name System (response)
- 打开 Addional records
- 打开<Root>: type OPT
- 打开<Option: CSUBNET - Client subnet
对比看到 Scope Netmask 有区别, aws 返回为 24, zdns 返回为 0
问题确认, 是由于权威返回结果的问题导致的