CyberWorld | VMware ESXi: Detect and Restart Unresponsive Virtual Machines on ESXi with Bash Script

Purpose of the Script

In my environment I need to detect and restart unresponsive Virtual Machines on ESXi Host. I automated this steps with Ubuntu Bash Script.

Managing a virtualized environment can often present unexpected challenges. One such challenge is when a virtual machine (VM) becomes unresponsive, causing critical services to stop functioning and potentially impacting business operations. Knowing how to detect and recover (restart) from this scenario is essential for maintaining service availability and minimizing downtime.

In this article, I will guide you through how I detected and resolved an issue with an unresponsive Virtual Machine (VM) using a combination of ping checks and SSH access to the ESXi server.

Prerequisites

Root access to the ESXi Host
Enabled SSH Access for the ESXi Host

The Logic I Use in my Script to Detect Unresponsive VM and Restart It

The first step in diagnosing an unresponsive VM is setting up a monitoring system that checks the VM’s availability. For this, I used a simple yet effective approach: periodic ping tests from another server. This external server continuously monitors the VM’s status by sending ping requests at regular intervals. If the VM responds, it is marked as available, but if there is no response after a set number of attempts, an alert is triggered to indicate that the VM might be unresponsive.

Once an alert is received, the next step is to remotely connect to the ESXi host via SSH. This requires enabling SSH access on the ESXi server, which should be secured and limited to authorized personnel. Using SSH, I connected to the ESXi server and utilized the vim-cmd command to interact with the VM.

Connect to the ESXi and List all VMs

Step 1: Connect to the ESXi and get VMID of VM

The command vim-cmd vmsvc/getallvms lists all registered VMs, making it easy to identify the VM in question using its VMID.

In our case we will restart the VM with ID 42 (will be used later in script).

With the VM identified, I issued the command vim-cmd vmsvc/power.off <VMID> to force the VM to shut down and vim-cmd vmsvc/power.on <VMID> to restart it.

We will use the command on ESXi:

vim-cmd vmsvc/power.reset 42

This method has proven to be reliable for automating the recovery of unresponsive VMs. By integrating monitoring scripts and SSH-based recovery commands, I could promptly resolve service interruptions without needing to manually access the management interface. This approach also opens up possibilities for automation using cron jobs or monitoring tools that can execute scripts to perform these actions automatically.

Implementation of the Bash Script

Implementing such a solution ensures that downtime is minimized and the overall reliability of the virtualized environment is improved.

Create folders for scripts

# Create directories
mkdir /opt/scripts
mkdir /opt/scripts/log

Create Monitoring Script

# Create script
nano /opt/scripts/check_vm.sh

#!/bin/bash

PING_IP="192.168.10.201"        # IP to ping
MAX_LOST_PINGS=3        # Number of lost pings before restarting server
LOG_FILE="/opt/scripts/log/check_vm_adguard_$(date +%Y-%m-%d).log" # Location for log file

echo "Script is executed on $(date)"  >> $LOG_FILE

lost_pings=0   # Define variable to enable tracking lost pings
# Loop indefinitely
while :
do
    # Ping the IP once, wait for up to 1 second
    if ping -c 1 -W 1 $PING_IP > /dev/null; then
        echo "Ping to $PING_IP OK"
        if [ $lost_pings -ge $MAX_LOST_PINGS ]; then
        echo "$(date): Ping to $PING_IP OK" >> $LOG_FILE
        fi
        lost_pings=0
    else
        # Ping failed, log the date, time, and IP
        if [ $lost_pings -lt $MAX_LOST_PINGS ]; then
        echo "$(date): Ping to $PING_IP failed and it is $lost_pings" >> $LOG_FILE
        fi
        # Count the number of consecutive lost pings
        lost_pings=$((lost_pings + 1))
        # If the number of lost pings has reached the limit, restart the service
        if [ $lost_pings -eq $MAX_LOST_PINGS ]; then
            echo "$(date): We reach treshold of loosing $MAX_LOST_PINGS ping in a row" >> $LOG_FILE
        fi
        # Check if we lost 5 pings lost_pings is even and call the script to restart VM
        if [ $lost_pings -gt 5 ] && [ $((lost_pings % 2)) -eq 0 ]; then
            echo "$(date): $lost_pings lost pings is an even number, calling reboot script" >> $LOG_FILE
            sudo bash /opt/scripts/vm_reboot.sh >> $LOG_FILE
            echo "$(date): Script to reboot VM is called. Next check." >> $LOG_FILE
        fi


    fi
    # Wait for 3 seconds before pinging again
    sleep 3
    # Check if the date has changed
    if [ "$(date +%Y-%m-%d)" != "$(date -r "$LOG_FILE" +%Y-%m-%d)" ]; then
        # Create a new log file for the new day
        LOG_FILE="/opt/scripts/log/check_vm_$(date +%Y-%m-%d).log"
        echo "Script is executed on $(date)" >> "$LOG_FILE"
    fi

done

chmod +x /opt/scripts/check_vm.sh

Script to Restart the VM on ESXi Host

To connect to the Linux server without password you can make the steps described on the link.
In this script we will have a hardcoded password – Not recommended to hardcode passwords; consider using SSH keys!

nano /opt/scripts/vm_reboot.sh

#!/bin/bash

# Variables, ESXi Host IP Adress, User Password
HOST="192.168.10.102" 
USER="root"
PASSWORD="" # Not recommended to hardcode passwords; consider using SSH keys or prompting for password
PORT=22
LOG_FILE="/opt/scripts/log/reboot_$(date +%Y-%m-%d).log" # Location for log file


# Commands to disable reboot VM in case that ping is down
restart_commands=(
    " vim-cmd vmsvc/power.reset 42"
)


# Function to send commands via SSH
execute_commands() {
    local commands=("$@")
    sshpass -p "$PASSWORD" ssh -o StrictHostKeyChecking=no -p $PORT $USER@$HOST << EOF
$(for cmd in "${commands[@]}"; do echo "$cmd"; done)
EOF
}

# Restarting VM
echo "$(date): Script called to restart VM..."
echo "$(date): Restartinb VM..."  >> $LOG_FILE

execute_commands "${restart_commands[@]}"

# Wait for 20 seconds
echo "$(date): Waiting for 20 seconds..."
echo "$(date): Waiting for 20 seconds..."  >> $LOG_FILE

sleep 20

echo "$(date): Done."
echo "$(date): Done."  >> $LOG_FILE