My desktop machine broke itself on a reboot last Friday evening, so that was a nice thing to look forward to fixing on a Monday morning …
I’m still trying to characterise the problem (it feels like a problem with the underlying nest of symbolic links that is the debian “alternatives” mechanism). The two initial symptoms were keyboard not working and graphics interface not starting.
The underlying problem was that the X server wouldn’t start because it couldn’t find the X “nvidia” driver (this is a confusing area to describe – graphics drivers come as a kernel “driver” or module that has to be loadable into the running operating system kernel, and an X “driver” or module that has to be loadable into the running X server during its startup. Both drivers have to be of matching versions, or they won’t talk to each other).
The underlying underlying problem was that the nvidia-340-updates (and nvidia-346-updates, and presumably all the other variations on that theme) ubuntu package was sticking its nvidia driver for the X server into a location it controls, but not creating a symlink to any location that the X server would look.
The quick hacky fix is
ln -s /usr/lib/nvidia-340-updates/xorg/nvidia_drv.so /usr/lib/x86_64-linux-gnu/xorg/extra-modules/
but I really want to know why it didn’t get it right itself.
More alarmingly, but even more difficult to pin down / characterise – on at least two occasions, I managed to hand-start the X server and it found the nvidia_drv.so file in an appropriate place, but when I started it via the display manager immediately afterwards it failed because the nvidia_drv.so file was no longer there. This does not feel to be a sane or plausible situation to be in.
My suspicion that something is amiss with the “alternatives” infrastructure lies here:
ls -l /usr/lib/x86_64-linux-gnu/xorg/extra-modules/
lrwxrwxrwx 1 root root 53 Aug 6 13:00 /usr/lib/x86_64-linux-gnu/xorg/extra-modules -> /etc/alternatives/x86_64-linux-gnu_xorg_extra_modules
I suspect this will stay with a bandaid plastered over it until next time there’s an update to the kernel version or the nvidia drivers, at which point it will sproing apart.
I’ve had a couple of X hangs on my centos 6 desktop with the latest nvidia driver update kmod-nvidia-340xx-340.76-2.el6.elrepo.x86_64 – sufficient that I’ve backed off to 76.1 in the hopes of curing it.
I don’t seem to have had problems with the driver not loading, though the .ko files and symlinks to them seem to have changed structure – not what I’d have expected from a minor version number update.
I’ve not yet had any problems on my test Centos 7 client, though it does get less use.
Proceed with caution seems to be the takeaway here.