A woman stands in a futuristic urban setting, surrounded by digital displays and the large letters "Tipp" in the foreground.
Just text? Not at all!

Omni-modal AI can do more than just text – much more!

Last updated: 09.04.2025 11:30

Technological developments are happening one after the other, and this also applies to the further development of LLMs. What were previously only text-based language models are now omni-modal. This means that they can not only understand and generate text, but also recognize and generate images, audio and videos. What are the specific benefits of this?

But what does "omni-modal" actually mean? Originally, LLMs were trained to analyze language: Grammar, meaning, context. However, these models have evolved thanks to new algorithms and huge amounts of training data. Today, they can also process visual and acoustic information. In other words, artificial intelligence no longer just thinks in words – it sees, hears and speaks. This opens up new dimensions of interaction, as complex information can now be interpreted and used across different media. How do we benefit from this in everyday life? Five practical examples:

Who won?
Evenings of Yatzy are sociable – but keeping score at the end? Rather annoying. All it takes is a photo of the completed score sheet and the AI takes over: it recognizes the numbers, automatically adds up all the categories and shows who has won. No more arguments about miscalculations, no more discussions about bonus points – just clarity. And more time for the rematch.

Recognizing sights on vacation
Are you standing in front of an impressive building and wondering what it's all about? A photo is all it takes – the AI provides you with historical background, cultural facts and anecdotes.

Language barriers are a thing of the past
Asking for directions abroad or understanding the menu in a restaurant? No longer a problem. Simply speak the question into your cell phone – the AI recognizes your voice, translates into the desired language in real time and pronounces the translation clearly. So "Where is the nearest train station?" in Spanish becomes a fluent ¿Dónde está la estación de tren más cercana? – perfectly pronounced. Communication without a language course – and with a smile on both sides.

What am I cooking today?
You take a look in the fridge, take a photo of your ingredients – and the AI suggests suitable recipes. Without any additional shopping.

Digitize business cards
After a business meeting, simply take a photo of the business card – the AI reads the data and transfers it directly to your contacts or displays the LinkedIn profiles.

Conclusion: AI now speaks all languages including sound and vision
Omni-modal AI is more than just a technological milestone – it is changing the way we communicate with digital systems. From everyday work to leisure, from shopping advice to travel assistance: AI models that can process images, sounds and speech are opening up completely new possibilities. And this is just the beginning. In the future, artificial intelligence will merge even more closely with us and our environment. Of course, these are all very simple examples that you can (and perhaps should) use to occupy your own mind. But it shows the immense possibilities that omni-modal AI opens up. All just text? Certainly not anymore!

    Author:

    A smiling man wearing a black hoodie with the name "VIER" printed on it, standing in front of an office.

    Steffen Eichenberg

    Head of Software Engineering

    VIER

    Back to the blog