Purpose of the Script
In my environment I need to detect and restart unresponsive Virtual Machines on ESXi Host. I automated this steps with Ubuntu Bash Script.
Managing a virtualized environment can often present unexpected challenges. One such challenge is when a virtual machine (VM) becomes unresponsive, causing critical services to stop functioning and potentially impacting business operations. Knowing how to detect and recover (restart) from this scenario is essential for maintaining service availability and minimizing downtime.
In this article, I will guide you through how I detected and resolved an issue with an unresponsive Virtual Machine (VM) using a combination of ping checks and SSH access to the ESXi server.
Prerequisites
- Root access to the ESXi Host
- Enabled SSH Access for the ESXi Host
The Logic I Use in my Script to Detect Unresponsive VM and Restart It
The first step in diagnosing an unresponsive VM is setting up a monitoring system that checks the VM’s availability. For this, I used a simple yet effective approach: periodic ping tests from another server. This external server continuously monitors the VM’s status by sending ping requests at regular intervals. If the VM responds, it is marked as available, but if there is no response after a set number of attempts, an alert is triggered to indicate that the VM might be unresponsive.
Once an alert is received, the next step is to remotely connect to the ESXi host via SSH. This requires enabling SSH access on the ESXi server, which should be secured and limited to authorized personnel. Using SSH, I connected to the ESXi server and utilized the vim-cmd
command to interact with the VM.
Connect to the ESXi and List all VMs
Step 1: Connect to the ESXi and get VMID of VM
The command vim-cmd vmsvc/getallvms
lists all registered VMs, making it easy to identify the VM in question using its VMID.
In our case we will restart the VM with ID 42 (will be used later in script).
With the VM identified, I issued the command vim-cmd vmsvc/power.off <VMID>
to force the VM to shut down and vim-cmd vmsvc/power.on <VMID>
to restart it.
We will use the command on ESXi:
vim-cmd vmsvc/power.reset 42
This method has proven to be reliable for automating the recovery of unresponsive VMs. By integrating monitoring scripts and SSH-based recovery commands, I could promptly resolve service interruptions without needing to manually access the management interface. This approach also opens up possibilities for automation using cron jobs or monitoring tools that can execute scripts to perform these actions automatically.
Implementation of the Bash Script
Implementing such a solution ensures that downtime is minimized and the overall reliability of the virtualized environment is improved.
Create folders for scripts
# Create directories
mkdir /opt/scripts
mkdir /opt/scripts/log
Create Monitoring Script
# Create script
nano /opt/scripts/check_vm.sh
#!/bin/bash
PING_IP="192.168.10.201" # IP to ping
MAX_LOST_PINGS=3 # Number of lost pings before restarting server
LOG_FILE="/opt/scripts/log/check_vm_adguard_$(date +%Y-%m-%d).log" # Location for log file
echo "Script is executed on $(date)" >> $LOG_FILE
lost_pings=0 # Define variable to enable tracking lost pings
# Loop indefinitely
while :
do
# Ping the IP once, wait for up to 1 second
if ping -c 1 -W 1 $PING_IP > /dev/null; then
echo "Ping to $PING_IP OK"
if [ $lost_pings -ge $MAX_LOST_PINGS ]; then
echo "$(date): Ping to $PING_IP OK" >> $LOG_FILE
fi
lost_pings=0
else
# Ping failed, log the date, time, and IP
if [ $lost_pings -lt $MAX_LOST_PINGS ]; then
echo "$(date): Ping to $PING_IP failed and it is $lost_pings" >> $LOG_FILE
fi
# Count the number of consecutive lost pings
lost_pings=$((lost_pings + 1))
# If the number of lost pings has reached the limit, restart the service
if [ $lost_pings -eq $MAX_LOST_PINGS ]; then
echo "$(date): We reach treshold of loosing $MAX_LOST_PINGS ping in a row" >> $LOG_FILE
fi
# Check if we lost 5 pings lost_pings is even and call the script to restart VM
if [ $lost_pings -gt 5 ] && [ $((lost_pings % 2)) -eq 0 ]; then
echo "$(date): $lost_pings lost pings is an even number, calling reboot script" >> $LOG_FILE
sudo bash /opt/scripts/vm_reboot.sh >> $LOG_FILE
echo "$(date): Script to reboot VM is called. Next check." >> $LOG_FILE
fi
fi
# Wait for 3 seconds before pinging again
sleep 3
# Check if the date has changed
if [ "$(date +%Y-%m-%d)" != "$(date -r "$LOG_FILE" +%Y-%m-%d)" ]; then
# Create a new log file for the new day
LOG_FILE="/opt/scripts/log/check_vm_$(date +%Y-%m-%d).log"
echo "Script is executed on $(date)" >> "$LOG_FILE"
fi
done
chmod +x /opt/scripts/check_vm.sh
Script to Restart the VM on ESXi Host
- To connect to the Linux server without password you can make the steps described on the link.
- In this script we will have a hardcoded password – Not recommended to hardcode passwords; consider using SSH keys!
nano /opt/scripts/vm_reboot.sh
#!/bin/bash
# Variables, ESXi Host IP Adress, User Password
HOST="192.168.10.102"
USER="root"
PASSWORD="" # Not recommended to hardcode passwords; consider using SSH keys or prompting for password
PORT=22
LOG_FILE="/opt/scripts/log/reboot_$(date +%Y-%m-%d).log" # Location for log file
# Commands to disable reboot VM in case that ping is down
restart_commands=(
" vim-cmd vmsvc/power.reset 42"
)
# Function to send commands via SSH
execute_commands() {
local commands=("$@")
sshpass -p "$PASSWORD" ssh -o StrictHostKeyChecking=no -p $PORT $USER@$HOST << EOF
$(for cmd in "${commands[@]}"; do echo "$cmd"; done)
EOF
}
# Restarting VM
echo "$(date): Script called to restart VM..."
echo "$(date): Restartinb VM..." >> $LOG_FILE
execute_commands "${restart_commands[@]}"
# Wait for 20 seconds
echo "$(date): Waiting for 20 seconds..."
echo "$(date): Waiting for 20 seconds..." >> $LOG_FILE
sleep 20
echo "$(date): Done."
echo "$(date): Done." >> $LOG_FILE
chmod +x /opt/scripts/vm_reboot.sh