Google’s Gemini Omni turns images, audio, and text into video — and that’s just the beginning

When Google launched Gemini three years agoThe goal was to build a large multimodal language model — a single neural network that was trained on text, images, audio, and video and could generate content in any of these formats.

today, At the Google I/O developer conferenceThe company has taken a concrete step toward that goal with the Gemini Omni, a new family of multimedia models that Google CEO Sundar Pichai says will be able to “create anything from any input.”

The Omni will start the video. Users can now combine images, audio, video and text, and instead of just stringing these inputs together, Omni considers them all to produce a consistent output. The result is high-quality videos that reflect an understanding of physics, culture, history, and science.

Omni also allows users to edit images using plain text commands instead of complex editing software, similar to Google Nano Banana.

Google already has a dedicated video template, ShowWhich allows users to convert text and images into video clips, and even… Direct and customize avatars. But Nicole Breshtova, director of product management at Google DeepMind, says today’s release is more than just an update to Veo: “It’s the next step toward progress in combining Gemini’s intelligence with the rendering capabilities of our media models.”

One example that Koray Cavukoglu, DeepMind’s chief technologist, gave to reporters during a press conference on Monday: When Omni was given a prompt as simple as “Explain protein folding with clay,” it quickly showed a stop-motion demonstration video with a voiceover saying: “Proteins start out as chains of amino acids. They fold into patterns like alpha helices and flat parts called beta sheets, forming a perfect 3D shape.”

Omni’s long-term vision is broader, including the model used to do things like create images from audio, or audio from video.

“When we first announced Gemini, our first AI model was natively multimodal,” Pichai said during the press conference. “We knew that training it on a combination of text, code, audio, images and video would give it a deeper understanding of the world. With global models, AI goes from predicting text to simulating reality. Gemini Omni is the next step in this direction.”

As part of the release, users will also be able to create videos using their digital avatars – something OpenAI popularized in the now-defunct Sora app with Cameos. To prevent deepfakes, users would have to go through a dedicated product, which involves registering themselves and speaking publicly with a series of numbers, according to Prishtova. The avatar is then stored for future use.

Additionally, all videos created with Omni will include Google’s SynthID digital watermark, which allows users to verify whether videos were created via Gemini products.

The first model in the family is the Gemini Omni Flash, which launches today on the Gemini app, YouTube Shorts, and AI Creative Studio Flow. Flash will be able to display 10 seconds of video, which Prishtova says is not a limitation of the model, but rather a decision based on the desire to get it into more hands and the expectation that most users won’t want to create much longer videos yet. However, longer video durations are in the works in the near future.

Google appears to be promoting Omni Flash as more of a consumer tool. The examples Breshtova and Gabi Barth-Maron, a research engineer at DeepMind, gave on a call with TechCrunch about the uses of digital avatars were all personal: creating a video of yourself winning an award or going to the moon, or removing a bystander from the background of a video you took while on vacation.

Parth Maroon put it more simply: “It’s like personal memes.”

“We definitely focused on making this easy to use for consumers,” Breshtova said. “Not many video models have been able to bridge that gap with consumers, so this is our game to do that.”

Ease of use comes with a caveat: Prishtova and Parth Maron point out that editing prompts need to be very specific, otherwise Omni risks over-editing or inadvertently changing elements the user wants to keep — a problem Nano Banana users may encounter.

Despite the near-term consumer focus, Omni and Creative effects Clear, Google will make Omni available via the API in the coming weeks. The avatar creation tool — a capability available today in Shorts — is something Google expects content creators to pick up. But on a larger scale, end-to-end multimedia workflows could be transformative for advertisers and filmmakers.

Startup Luma AI is building something similar, Proxy tool It can create an entire advertising campaign based on a short synopsis and product image, supported by its own “unified” template.

“We’re actually very proud of the text display capabilities of the model, which is really useful for things like ads,” Breshtova said. “If you want a product out there, or even just a logo, it has to be precise… We definitely expect filmmakers and other creatives to use this model as well.”

More professional use cases may be better served by the Omni Pro model, which should perform better across all Omni tasks. Google hasn’t said when it will launch Pro yet, but Brichtova said it will happen when “we feel like we’re at a point where we have a step change over Flash.”

Follow the rest of the important news for Google IO 2026

Google search is over, you know

Google updates Gemini to address ChatGPT and Cloud

Google offers Gemini Spark, a 24/7 proxy assistant with Gmail integration

How to use the new Google information agents

When you make a purchase through the links in our articles, We may earn a small commission. This does not affect our editorial independence.

Follow the rest of the important news for Google IO 2026

Leave a ReplyCancel Reply

Trending now