aws一台实例无法ssh故障排查

阅兵完第一天上班,就遇到灵异事件,业务突然告知有台AWS机器无法ssh了,登录平台看,状态正常,没啥问题,就用美国的一台机器进行ssh,发现同样无法登录,轮到运维三板斧上场了,先给来个狠招:重启,结果提示无法重启,未知错误,NND,既然重启不了,我关机总可以吧,结果同样提示无法关机,未知错误,我去,还跟我杠上了,接下来怎么办呢?其实对付这种问题,一般的解决方案是可以新建一台主机,然后把数据盘都挂到新机器上就可以了,老机器可以kill掉了,其实本以为这个方案就OK,跟业务负责人商量是否可以,结果告知这台机器是SVN服务器,很重要,很多海外用户都要使用这台机器,搭建起来有困难,既然那么重要,不到万不得已,还是不新建的好,所以呢,

第一步:提工单,让AWS工程师从底层协助查看解决,因为有商务支持,所以在1个小时内,对方就回复了,以下是全部邮件正文,在此我声明下之所以贴邮件不是我凑文章字数,是让各位看官也了解下国外工程师对于故障处理的思路和答复,对于我们国内运维很有借鉴,隐私部分已用xxxxx替代,正文如下:

Hi,

Thanks for contacting AWS Support. I understand that you are facing an issue where you are unable to SSH to EC2 Linux instance i-xxxxx

Please let us know if this understanding is incorrect.

Looking at the details of the instance, the security group (sg-xxxxxx) associated with it allows all incoming connections on all ports from the full IP address space. The access control list also does not have rules to restrict incoming connections on port 22. The instance is also passing status checks which indicates that the instance and the underlying host is healthy.

I have attempted to connect to port 22 on the instance from outside and the connection times out.

We do not have access to the operating system on the instance to be able to troubleshoot issues at the operating system level, including those with the configuration of the SSH daemon (sshd).

Can you kindly confirm if any recent changes have been made to the instance which may have triggered the issue? Are there any firewalls (e.g. iptables) on the instance which may be preventing this connectivity. You may also find the troubleshooting procedure useful [1].

Can you kindly check whether you are able to connect to any other active port on the instance from within and outside the VPC (vpc-xxxx)? I attempted to check TCP ports 80, 443 and 3690 but am receiving timeouts on all of them. You can use “telnet” or “netcat” to check connectivity:

| telnet ip <port>

| nc -z ip <port>

At this moment, if there do not seem to be any option to SSH to the instance, I would recommend to stop and start the instance [2]. You may choose to do this in a maintenance window. Since the instance only has 3 EBS volumes attached, you do not usually risk losing data on the volumes when stopping/starting the instance.

This can help in reverting any non-permanent changes made to the instance operating system e.g. iptables rules.

If, however, permanent changes have been made to the instance, you can attempt stopping the instance, mounting the root volume on another EC2 instance and revert any changes that were made.

The procedure involves the following at a high level:

  1. Stop the instance [2].
  2. Take a snapshot of the root volume as a backup in case of failure [3].
  3. Detach the root volume (vol-b473e3fc) from the instance [4].
  4. Attach the volume to a running EC2 instance and mount it [5][6].
  5. Make the necessary changes.
  6. Unmount and detach the volume from the running instance and attach to back to instance i-xxxxxx.
  7. Start instance i-xxxxxx and check to see if the issue is resolved.

I am hopeful the above steps will be able to help in resolving the issue. If, however, the mentioned steps are unsuccessful in resolving the issue, you are requested to check for snapshots/AMI of the instance and attempt recovery from it.

Please let us know if you continue to face issues in this regard, and provide details. We will attempt to assist you in whatever way possible to troubleshoot and potentially resolve the issue.

References:

[1] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/TroubleshootingInstancesConnecting.html#TroubleshootingInstancesConnectionTimeout

[2] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Stop_Start.html

[3] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-creating-snapshot.html

[4] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-detaching-volume.html

[5] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-attaching-volume.html

[6] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-using-volumes.html

 

不管怎么说,就最近跟AWS海外工程师接触,发现都非常客气,邮件一上来不是:Thanks for contacting AWS Support……就是,I am so sorry 或者 apologize 什么的,特别的谦虚,先给AWS工程师态度点个赞,那废话少说,其实对方邮件内容一是确认机器信息,其他主要是告诉你排查的步骤,最后还给出一些可能会涉及的文档内容。

看邮件,AWS工程师也无法直接从底层解决,所以我尝试再点停止的时候,竟然可以了,那既然这样,就简单多了,所以第二步,按他推荐的步骤来操作了,排查系统问题,看看iptables等,操作如下:

1、新建一个实例,启动

2、将故障实例的root分区分离下来(已停),重要:请记住设备名和卷ID,我的是/dev/sda1  卷ID是vol-b475e3ac

3、将root分区挂载到新启动的实例上,直接mount

4、然后在新实例上就可以查看数据了,首先看系统日志是否有故障,然后看了/etc/sysconfig/iptables 发现也没有问题,然后设置crontab  sshd服务每1小时启动一次,启动自动清除防火墙放到/etc/rc.local下,并新建一个sshd_config配置文件,把端口修改为6000,启动两个ssh服务

5、设置完毕后重新挂载到原实例上,启动,注意挂载时需要输入设备名称,就是/dev/sda ,记住没有数字1了。

 

启动完毕后再ssh,发现仍然无法登录,尼玛,要整死人的节奏,然后ssh  6000,发现竟然可以,这到底什么情况,难道22被封了???然后就让IT和网络去检查,目前没有答复,其实想想好像也不应该是网络问题,因为除了这台,其它机器ssh是没问题,也就是说22肯定是通的,那到底是为毛呢?其实截止目前还不知道答案,不过好在我可以用6000端口去连接该服务器了,本着我们目的的原则,其实到这里我们已经全部OK了,那关于22端口的问题呢,还要继续排查中……