It was a big mistake on my part but I did it: I’ve set a weird root password on my ESXi, I thought I’d remember it, didn’t write it down, didn’t setup SSH key access or vSphere… and lost access to the admin of my server. Bummer. The official documentation says “reinstalling the ESXi host is the only supported way to reset a password“, but I thought it was kinda lame and they simply didn’t wanted to officially support shady config file manipulation. Surely there’s a way to just replace the password like you would on a standard Linux box, right? right?…. Well… Not quite. Here’s how I had to do it.Read more: ESXi 7 root password reset
First, how is my server set up? At the time it was running ESXi-7.0U3m, hosted at OVHCloud, the host doesn’t have any TPM (it’s quite important, you’ll see why), already had SSH service turned on (also quite important), and I can force reboot onto a recovery system that is able to access the data on the discs without booting the installed OS, as if I have physical access to the machine itself (it also supports IPMI which effectively gives full control like I’m sitting right in front of the machine, but it’s not necessary in that case).
TL;DR: I had to reboot the server in rescue mode, mount the correct bootbank, fetch state.tgz, decrypt local.tgz.ve by installing a temporary ESXi server and replacing its encryption key, then add the SSH keys into the decrypted local.tgz and put it back on the server, so I can SSH back again and change the root password.
According to the most popular tutorial on how to manually reset an ESXi password, you just reboot your server on a live CD (or on the rescue system in my case), mount the bootbank partition (sda5), open state.tgz, then local.tgz, change the etc/shadow file, repack everything back together, replace the state file in the bootbank and you’re good to go. So I went to work: I halted all my VMs, force rebooted the server into rescue, and grabbed the state.tgz file. And that’s when I discovered that, since ESXi 7.0.3, the local.tgz file is encrypted! That was very underwhelming. Its filename is local.tgz.ve and if you don’t have a TPM, it should be accompanied by an encryption.info file, containing the encryption keys.
First I thought: well that doesn’t look so bad, surely there’s a tool to decrypt it using the encryption.info data, it’s just a little setback. Well, not so fast, there’s two problems: first, you have to use an ESXi 7 system to decrypt such a file (you can only use the proprietary crypto-util command, and it’s not available on Linux), which is quite difficult when you don’t have access to your server or any other server. And second, each server uses its own encryption keys, so you cannot just “use” or “import” the decryption key (at least I didn’t find any easy way, tell me in the comments if you do). Well, there’s still a way to bypass those two problems. Unless you have a TPM, then you can’t decrypt the file outside the very same server, so I guess you have no choice but reinstall everything.
So I need to gain access to another ESXi server somehow. I could install it on a spare machine, but it’s a mess, I didn’t have the hardware on hand, and I don’t even know if it would be supported at all… Maybe a VM? I tried with VMware Workstation 17 on Windows 10, but my CPU doesn’t support virtualizing Intel VT-x (nested virtualization). I though I was out of luck, but I tried it anyway: I lied to VMware by pretending I was installing Windows 10 instead, without nested virtualization support. I didn’t have high hopes, but as it turns out, you can install and run a fully virtualized ESXi system without proper CPU support! The installer warns you about it, but to my surprise it doesn’t prevent you from proceeding with the installation anyway. You won’t be able to run any virtual machine obviously, but that’s not what I needed, so it was fine for me. That’s it for step one.
So I finally had access to that crypto-util command. I know the command line to use, found it on some other post: “crypto-util envelope extract –aad ESXConfiguration local.tgz.ve local.tgz“. Will it work? No, it doesn’t read the provided encryption.info file, and there’s no proper way to use it. My virtual system has its own encryption.info file with a different keyset, so I get an “ESXi kernel key cache error“. I would have to import it somehow. Maybe… by replacing the key on the virtual system! It turns out the local.tgz file doesn’t need to be encrypted to be used by the system on boot, it only tries to decrypt it if the local.tgz.ve file is present (see how the /bin/auto-backup.sh file works). So what I did is: uncompress /boobank/state.tgz, decrypt the local.tgz.ve file, then remove it, and replace the encryption.info file with the one from my server. I then made the state.tgz file again including both files and rebooted the virtual system.
mkdir /tmp/a cd /tmp/a tar xzf /bootbank/state.tgz crypto-util envelope extract --aad ESXConfiguration local.tgz.ve local.tgz rm local.tgz.ve (replace encryption.info) tar czf /bootbank/state.tgz encryption.info local.tgz
To my amazement, the virtual server happily booted and replaced its own encryption key with the same key used by my real server! That means I was finally able to decrypt my server’s local.tgz.ve file, gaining access to all the configuration files. Success!… Well, not yet: on ESXi7, there’s no shadow file anymore, the root password is stored in one of its proprietary config files, and it’s a pain to change it. So instead, I’ve set up SSH key login by placing my public keys into the etc/ssh/keys-root/authorized_keys file. That way, I’d be able to login without a password, and change it using the passwd command. Easy, right?
Well, That’s all that was left to do. So I halted all my VMs and force rebooted my server for the second time, grabbed the state.tgz file out of the server, did my little kitchenery to uncompress/decrypt/replace/repack everything, replaced the resulting state.tgz file back into place, rebooted and… nothing. I still couldn’t login, as if I did nothing! And that’s pretty much what I did, nothing: I forgot about the altbootbank… I changed the wrong bootbank! That’s right, ESXi uses two boot filesystems, in case one doesn’t boot anymore after an update. So it switches back and forth after each update. sda5 is the first bootbank, and sda6 is the second. I only did sda5. So, after redoing the same process all over again on the correct bootbank, it finally worked: I was able to SSH into the server, set a new password, and save it to a secure place. I also kept my SSH keys in place, just in case. And I will certainly not make the same mistake again.