Hm, it can be hard to describe without motion, but i'll try. It'll be very simplified, but basically check this:
Imagine that the vertical black line is a monitor, the little horizontal lines separate the pixels of the monitor and the monitor is showing the scene at the left side (the eye at the right side is the viewer :-P). To render the scene the GPU takes "samples" from the scene for each pixel in the monitor:
The dotted lines are the samples - imagine a sample as a ray that begins from the pixel's center towards the far end of the scene. These rays "hit" the objects in the scene and whatever color these objects are is what the pixel's color becomes. Since only one sample is taken, any gaps between the objects (or between an object and emptiness) is hard - which is what creates jaggies.
Now on to shimmering, notice the little three objects at the top - these are very small and may or may not be sampled. In fact in this image, the top object is not sampled at all, but if these objects (or the camera) move a bit then this happens:
All objects moved down a little, which for the larger objects didn't change anything but for the small ones did: the object at the top is now sampled - which will make it appear in the monitor - and the second (green) object is now not sampled anymore - which will make it disappear from the monitor. Also the yellow object is now only hit by one sample instead of two, so it now appears a bit smaller in the monitor.
As objects (or the camera) moves, these smaller objects can appear and disappear as they fall between samples. Note that in this image i used full objects, but GPU doesn't really sample objects - it samples geometry and runs a shader (a very small program) for each sample which is what gives back the color. A very simple shader can simply apply a texture to an object, meaning that the object may appear as more detailed than it really is, like this:
The object at the middle might be simple geometrically but its texture cause it to have multiple colors - in this case the color that the sample hits is green, even though the object has four other colors, so the pixel will become green. As far as sampling is concerned, this object might as well be five smaller objects close to each other. Of course this means that as the camera or object moves, it will also have its features come and go from view.
Basically all this is what we call aliasing (at least in computer graphics) and antialiasing is the effort to minimize its effect.
The best approach is just taking more than one sample for each pixel from different locations inside the pixel and then averaging the result - this is called "supersampling". The idea is that with more samples you have more chances to hit smaller details (and so in motion less chances for these smaller details to come and go) while by averaging the samples, the objects you hit more often are those that contribute more to the final color (so a small detail will only contribute a little, whereas a larger detail will contribute more).
The obvious issue with supersampling is that it is very slow since it requires multiple samples and depending on what is rendered each sample can be very heavy (at the minimum it requires performing several triangle checks and, if textures are used, accessing the texture memory several times per pixel - at least once per sample). Computationally this is essentially the same as rendering at a higher resolution - 4x supersampling would be like rendering the scene at 2*horizontal by 2*vertical resolution - while 4x is barely enough to get rid of jaggies and not really enough to get rid of the shimmering (detail appearing/disappearing in motion) since you need at least 8x for this (at least for current monitor resolutions and sizes) though preferably 16x is better (...and really this is kinda subjective, but in general, the more the better).
As an example for pure SSAA (supersampling antialiasing) in action see the "super resolution" features that modern GPU drivers provide which allow you to use higher resolutions than your monitor allows by rendering at higher resolutions and then downsampling - use 200% to get a 4x supersampling output.
So unsurprisingly, pretty much nothing does that - the only exception is, IIRC, 3dfx's Voodoo 4 and 5 that used SSAA (and AFAIK introduced antialiasing to consumer GPUs in general).
The next best thing is MSAA ("multisampling" antialiasing). MSAA was introduced in the GeForce 2 days where things like pixel shaders didn't exist so all you'd be worried about was textures, which thanks to other techniques like bilinear filtering and mip-mapping didn't really suffer as much by using a single sample per pixel. So what MSAA did was that instead of taking a fixed number of samples per pixel, it took a single sample *inside* triangles and multiple samples at the triangle edges to figure out how much of a triangle covered that pixel. Then when that "coverage" was found, it was used to calculate the triangle's contribution to the final pixel color combined with the texture (if any) to produce the final image. This means that MSAA required a single texture access per pixel (assuming a single texture was used in the triangle being rendered anyway) and since most pixels fall either fully inside or fully outside triangles (especially at the time when models and environments were very low poly), also required single samples per pixel for the majority of the pixels.
Things got a bit more complicated with the introduction of pixel shaders. Pixel shaders essentially are little programs that run for each pixel to produce the final color for that pixel. When used with MSAA, the pixel shader is still executed once per pixel but it is also given the samples that are taken so it can decide what to do with them. A pixel shader can simply run the full shading calculations for each sample or the heavy calculations once and approximate the result for each sample. However even when doing the full shading, MSAA still only takes a single sample per pixel for the triangle interiors, meaning that even with MSAA, a shader running for a pixel that lies at the interior of a triangle will (practically) act as if there was no MSAA.
That last bit is important when considering techniques like normal maps. Normal maps are basically textures that instead of containing color, they contain small surface details. They are used to approximate small detail without having geometry for that detail, allowing you to have, e.g., a single big triangle that looks like it is made by tons of smaller triangles. But since MSAA takes a single sample for triangle interiors and as far as it knows, what you are rendering isn't tons of smaller triangles but a single big one, it will render that detail as if there was no MSAA - ie, only a single sample per pixel.
Which brings the issue back to that in the third image above with the objects falling between samples. And as the camera or the object that contains the normal map moves, those smaller details (even if the objects they lie on are big) appear and disappear all the time as they fall in-between the single sample for each pixel.
TAA is another attempt to fix that. TAA isn't really a single specific algorithm but instead a family of algorithms (AFAIK one of the earliest attempts in games was in a few PSP games that relied on how the LCD works to do antialiasing when the game was running at higher framerates), but all of them have the same idea at their core: trying to approximate SSAA for the entire scene but instead of taking multiple samples per pixel, they take a single (or a few) samples per pixel and take advantage of the fact that most frames are either the same or very similar to the frames that preceded them, so they combine this single (or few) sample(s) taken in the current frame with the samples taken in previous frames to calculate the final color. Given enough frames, TAA can approximate SSAA with a very high number of samples - this is why if you stand still in games that use TAA you get a very soft scene.
TAA has its own problems since the scene can change between samples (creating ghosting artifacts) and the sample position inside the pixel isn't configurable in all graphics APIs (it is only in D3D12, Vulkan and OpenGL but in all of them is optional - in Vulkan and OpenGL only via extensions - and isn't available in Direct3D 11 which is what a ton of engines still use as their primary API) which means that to take multiple samples "manually" the scene is randomly shifted/jittered a little for each frame which in some cases can give the illusion of tiny flickering and, depending on how much it is shifted, can be the cause of excess blurring.
So there it is.
I think a TL;DR would be something like "because tiny details can fall 'between' pixels which will cause them to appear and disappear as they cross pixels which in motion looks as shimmering". This is independent of geometry (especially when complex shaders that try to add extra detail is concerned) and is even more visible when these small details are in high contrast.
For example from a quick look at screenshots, Trials of Mana doesn't seem to have much in terms of small details and there is even little contrast in there - which makes it very smooth looking. Same with Animal Crossing on Switch. On the other hand a game with tons of foliage, light shining through the leaves with dark areas with bright spots, wind effects (that can cause leaves and foliage to wave and move) will have a lot of shimmering without TAA because those small details will often fall "between the pixels" (or actually, samples, if you read the big wall of text above :-P).