• 流水账

    前天看 B 站的跨年演唱会,觉得还不错,今天买入一些 B 站。

    今天一上班就搞了两个故障,一个把 FWS SIN 的机器重启后起不来了,可能是 FSTAB 配置有问题,另外一个是更新 Hickwall Kafka 集群配置后,operator 直接重启了 POD,导致一天都在做数据复制。

  • Disk Automatically Unmounts Immediately After Mounting

    https://www.bentasker.co.uk/posts/documentation/linux/480-disk-automatically-unmounts-immediately-after-mounting.html

    When it happens, it’s incredibly frustrating - you’ve had a disk replaced on a linux box, the disk has shown up with a different name in /dev, so you edit /etc/fstab and then try to mount the disk.

    The command runs, without error, but the disk isn’t mounted, and doesn’t appear in df

    This documentation details the likely cause, and how to resolve it

    If you look in dmesg, you might see something like the following

    [  462.754500] XFS (sdc): Mounting V5 Filesystem
    [  462.857216] XFS (sdc): Ending clean mount
    [  462.871119] XFS (sdc): Unmounting Filesystem
    

    Which, whilst it shows the disk is getting unmounted almost immediately, isn’t otherwise very helpful. It doesn’t tell us why.

    However, if you look in syslog (e.g. /var/log/messages, journalctl or /var/log/syslog) you may well see this logged again with a couple of additional relevant lines

    kernel: XFS (sde): Mounting V5 Filesystem
    kernel: XFS (sde): Ending clean mount
    systemd: Unit cache2.mount is bound to inactive unit dev-sdc.device. Stopping, too.
    systemd: Unmounting /cache2...
    kernel: XFS (sde): Unmounting Filesystem
    systemd: Unmounted /cache2.
    

    We can now see that the erstwhile init system - systemd - decided to unmount the filesystem

    systemd: Unit cache2.mount is bound to inactive unit dev-sdc.device. Stopping, too. The reason for this is that at boot time systemd-fstab-generator generates, in effect, a bunch of dynamic unit files for each mount. From the output above we can tell the disk used to be sdc but is now sde. Despite fstab saying

    /dev/sde      /mnt/cache2   xfs   defaults,nofail  0 0
    

    When we issue the command

    mount /cache2
    

    SystemD picks up on the fact that it has an inactive unit file (inactive because the block device has gone away) which should be mounted to that path, decides there’s a conflict, and that it knows better, and unmounts your mount again If you’re in this position, then, you should be able to briefly resolve with a single command

    systemctl daemon-reload
    

    Keep in mind that if your disk moves back following a reboot, you’ll be back to this point where SystemD decides you can’t have wanted to mount your disk after all.

    SystemD have a bug for this, raised in 2015 and seemingly still unresolved (it’s certainly still attracting complaints at time of writing). Rather worryingly, it suggests that the above will not always resolve the issue, and instead suggests the following “workaround”

  • Linux Shell Variable Render

    a='a' echo "$a"` 和 `a='a' ; echo "$a"` 有什么区别?
    

    我现在的理解是,shell 渲染这个 $a 的时机(顺序)问题,如果没有分号,$a 先渲染,然后才执行命令,foo=bar 是命令的一部分。

    shell 会在执行当前命令 之前把 $XX 这种先渲染掉(如果是单引号里面就不渲染了)

    加了分号或者是使用 && ,就是两个使用,后面命令执行的时候,变量已经被赋值了。如果没有分号或者没有&&,a=’a’ 是命令的一部分,bash 渲染变量在执行命令之前,所以渲染的时候还没有值。

    另外多说一下,在 echo 命令执行的时候,$a 其实是有值的,只不过 echo 跟的参数是 ““,而不是 env(a)。

  • Org.xerial.snappy.snappy

    java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy

    可能是因为 /tmp 挂载点掉了?

  • Resize Fs

    lsblk
    fdisk /dev/nvme0n1
    partprobe
    pvcreate /dev/nvme0n1p4
    vgextend VolGroup00 /dev/nvme0n1p4
    lvdisplay
    lvextend -l +100%FREE /dev/VolGroup00/lv_root
    lvs
    resize2fs /dev/VolGroup00/lv_root
    df -h
    
  • Tanyan

    跟烦的人打交道的时候,就想着赶紧把他的事处理掉拉倒,反而让对方占便宜了,操,这是一个矛盾。

  • Respect

    人的相处,我觉得最重要是尊重。

    不过每个人对尊重的理解不一样吧,不强求。礼尚往来吧。

  • Update Kafka Cert

    在重启 ZK 之前先更新 broker 里面的证书并 reload

    export c="" && for i in {0..9} ; do k exec -ti $c-shaxy-b-kafka-$i  -npro-kafka -- sh -c 'export CERTS_STORE_PASSWORD=$(grep listener.name.replication-9091.ssl.keystore.password /tmp/strimzi.properties | cut -f2 -d=) && sh /opt/kafka/kafka_tls_prepare_certificates.sh' ; k exec -ti $c-shaxy-b-kafka-$i -npro-kafka -- bin/kafka-configs.sh --bootstrap-server 127.0.0.1:9092  --entity-type brokers --entity-name $i --alter --add-config listener.name.replication-9091.ssl.truststore.location=/tmp/kafka/cluster.truststore.p12 ; done
    

    这个命令应该是会去重载证书(虽然路径名字未变),因为中间尝试删除 truststore 再运行命令行报错,说文件不存在。

    而且还做了一个“破坏性”的测试:不运行 kafka-config.sh 直接重启一个 Broker,会导致其他 Broker 全部报证书验证失败。然后再刷一下 kafka-confg.sh,报错停止。所以看起来这个脚本的确是重载了新的证书。

    按这个方法跑下来,RB 全部无损重启了。

    但 XY 有大概6个集群还是出问题,需要重启 Broker 才行。

    这些需要重启的 Broker 在支行上面的 kafka-config 命令时报错如下:

    Error while executing config command with args '--bootstrap-server 127.0.0.1:9092 --entity-type brokers --entity-name 0 --alter --add-config listener.name.replication-9091.ssl.truststore.location=/tmp/kafka/cluster.truststore.p12'
    java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.InvalidRequestException: Invalid config value for resource ConfigResource(type=BROKER, name='0'): Validation of dynamic config update of SSLFactory failed: javax.net.ssl.SSLHandshakeException: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed
    	at org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
    	at org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
    	at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:104)
    	at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:272)
    	at kafka.admin.ConfigCommand$.alterConfig(ConfigCommand.scala:345)
    	at kafka.admin.ConfigCommand$.processCommand(ConfigCommand.scala:297)
    	at kafka.admin.ConfigCommand$.main(ConfigCommand.scala:90)
    	at kafka.admin.ConfigCommand.main(ConfigCommand.scala)
    Caused by: org.apache.kafka.common.errors.InvalidRequestException: Invalid config value for resource ConfigResource(type=BROKER, name='0'): Validation of dynamic config update of SSLFactory failed: javax.net.ssl.SSLHandshakeException: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed
    
  • Kafka Unclean Leader Election

    刚刚发现一个现象,正常吗?

    如果一个Topic是下面这样的,它不会 unclean leader elect 选出来0 做leader,需要再次触发一下 unclean.leader.election.enable=true,哪怕这个 topic 已经是 true

    {
      "PartitionErrorCode": 72,
      "PartitionID": 43,
      "Leader": -1,
      "LeaderEpoch": 21,
      "Replicas": [
        0,
        3
      ],
      "Isr": [
        3
      ],
      "OfflineReplicas": [
        3
      ]
    }
    
  • Zookeeper Reconfig

    语法: server.<positive id> = <address1>:<port1>:<port2>[:role];[<client port address>:]<client port>

    https://zookeeper.apache.org/doc/r3.5.2-alpha/zookeeperReconfig.html

    The client port specification is to the right of the semicolon. The client port address is optional, and if not specified it defaults to “0.0.0.0”. As usual, role is also optional, it can be participant or observer (participant by default).

    前提:

    1. ZK 启动的时候需要有一个 admin 用户,Environment=”SERVER_JVMFLAGS=-Dzookeeper.DigestAuthenticationProvider.superDigest=admin:cGFzc3dvcmQK” 通过启动的时候添加这个参数

    2. addauth digest admin:password zk shell 里面配置认证,然后再 reconfig