How many years before AI, under the control of an untrained person (i.e. not someone with digital art training of any kind), can produce 240p pixel art animated sprite work at the quality level of Metal Slug?
Estimates please.
My own estimate is maybe another 20-25 years.
In terms of AI being employed by digital artists as part of their process and this being an undeniable productivity boost (as opposed to some idiosyncrasy of the artist) to produce *quality* game art work, I would estimate another 10 years and considerable developer effort to build the tooling.
A bit difficult to estimate as we have never had an "AI" tool for
any application that could produce consistent quality results. Using the current approaches, probably never is my guess.
These tools seem to rely heavily on large amounts of data and regression, undoubtedly there are some CG techniques on top of that(feature extraction, edge detection etc). It does not appear to have any grasp of form, pose or motion, etc. Although I suspect there is some hard coding around humanoid generation (it knows a frontal face has two eyes a mouth etc).
This does not scale up too well, as we can see. Ask it to draw a cube, rotate it 25 degrees left and 15 up, and redraw. Trivial for a human with understanding of form. But this is a very weak approach "AI" cannot even manage the basics. Instead it stupidly falls back to make a composition based off a regression of white cubes images.
"White 3d cube on a black background, lighting top left corner. Cube rotated 15 degrees on the up axis. Top side is red colour."
Its worth noting that its been trained(?) or algorithmically modelled such that it can make a guess at where "top" is. I believe this to be a few tricks employed to make it seem smarter than it is. They have obviously had a stab at trying to do "holding" as well.
But using the same prompt but extending...
''White 3d cube on a black background, lighting top left corner. Cube rotated 15 degrees on the up axis. Top side is red colour. There is a monkey in a tracksuit holding a watermelon sitting on top."
Can form really be taught from images and tags alone? I would imagine you would need skin to sense through touch, as well as a muscle system to be engaged for handling weight. Underpinning that is a sense of space, and objects in space. I suspect it may not be possible to encode this in a regression model using only images and tags.