Apr 24, 2024

Alex Baker

A rapping Mona Lisa? Microsoft's latest AI has gone too far

One thing I definitely have not needed to hear in my life so far is the Mona Lisa gangsta rapping. I’m not even sure that Leonardo da Vinci would have wanted to either, had the tech existed back in Renaissance times. Certainly, it makes the woman a lot less enigmatic. But despite not wanting or needing it, AI has obliged in the form of Microsoft’s latest offering VASA-1.

VASA-1 creates “life like audio-driven talking faces generated in real time”. In simple terms, it can take a still photo and an audio voice recording and create an incredibly convincing video of them talking (or rapping). Microsoft has essentially opened a portal into deep-fake hell.

Any photo can be a talking head

Now, you may be wondering what’s so interesting and new about yet another AI thing. As Microsoft states in their tldr statement, VASA-1 has “precise lip-audio sync, lifelike facial behaviour, and naturalistic head movements, generated in real-time.”

It’s that little detail that sends a few shivers, honestly. Not only is it impeccably precise, it also creates different expressions, deliveries, and it can happen in an instant. Essentially, then, any photo that you upload online could have words put into your mouth.

Realism and controlled output

The examples shown by Microsoft have an incredible array of options in terms of eye gaze, head tilt and expressions. Users can convey a wide variety of nuanced facial expressions and natural head motions. Coupled with the accurate lip-syncing, this is pretty impressive.

The Mona Lisa rapping is a great example of what is possible. They used an audio performance of Anne Hathaway performing an impromptu rap that went viral in 2011.

The generation can also change the subject’s mood, ranging between happy, neutral, angry, and surprised. The other examples show just how lifelike they are.

The examples given also show that the generator can handle non-English speech, singing, and artistic images (not photographs).

Real-time

Let’s go back to that real-time aspect. These videos can be generated rapidly, at a streaming rate of 45fps with just a 170ms lag time. Sure, it’s not HD resolution at just 512x512px; however, the fact that this can be streamed in almost real time is impressive and also terrifying.

The future and deepfakes

We’ve already seen how destabilising fake images can be when circulated online. Imagine how much disruption something like this could cause, both on a wider political stage and on a more personal level.

Any photo that we take, from now on, can be turned into a ‘talking head’ in a mere instance. We are quickly entering a world where photographic and video evidence cannot be trusted. How is this going to impact criminal justice proceedings? Intimate relationships?

Microsoft loosely acknowledges that there could be severe consequences for this technology, saying, “It is not intended to create content that is used to mislead or deceive. However, like other related content generation techniques, it could still potentially be misused for impersonating humans.”

However, even in mentioning the potential for misuse, Microsoft is focussing on the “substantial positive potential of our technique”, and is “aiming for positive applications.”

They may well be. However, I still feel that letting something like VASA-1 loose in our supercharged online political landscape could open a door into a world that none of us want. Rapping Mona Lisas and all.

https://www.diyphotography.net/a-rapping-mona-lisa-microsofts-latest-ai-has-gone-too-far/

Filed Under:

News

Tagged With:

SATISH SHARMA - not just Rotigraphy.

Friday, 26 April 2024

A RAPPING MONA LISA? MICROSOFT’S LATEST AI HAS GONE TOO FAR

Apr 24, 2024

Any photo can be a talking head

Realism and controlled output

Real-time

The future and deepfakes

0 Comments:

About Me

Previous Posts