(Editor’s Note: This article in “Real words or buzzwords? » (The series examines how real words become empty words and stifle technological progress.)
In the first two paragraphs of his comprehensive technical blog post, Multimodality and large multimodal models (LMM)Chip Huyen, computer scientist and author of books on AI, wrote the perfect introduction to this article. I have inserted the words in square brackets.
“For a long time, each ML (machine learning) model operated in a single data mode: text (translation, language modeling), image (object detection, image classification), or audio (speech recognition).
“However, natural intelligence is not limited to a single modality. Humans can read, speak and see. We listen to music to relax and pay attention to strange noises to detect danger. Being able to work with multimodal data is essential for us, or any AI, to function in the real world.
The combination of large linguistic models and large multimodal models creates a step change (it has already begun) in the predictive, proactive and preventative operational capabilities of physical security. This change will also significantly change physical security system design practices, including evolving how we use scenario-based security system design.
Solutions capable of achieving much of what I describe below already exist. Based on current trends in the application of these AI models, this article may be considered “old news” in a year. The security industry and its customers (security practitioners and their organizations) intend to learn how to best use emerging AI. As I noted in my article on Watershed moment in physical security as we live, our thinking is the only limit to our risk mitigation capabilities.
Better, but not good enough
For 50 years, electronic security systems have been constrained by technological limitations. To compensate, we relied on people and processes. Most readers know that humans are both expensive and subject to performance limitations. Adding additional staff does not eliminate all security vulnerabilities.
The latest generation of AI-powered video analytics has significantly reduced false alarms in video streams. However, humans must still evaluate whether each detected person, object, or activity violates security policies or poses a risk. For example:
- A controlled access door alarm still requires someone to review the triggered video feed and decide whether or not to dispatch an agent.
- Detection of tailgating often occurs after the fact, requiring officer intervention.
- Officers generally must investigate, determine whether further action is necessary (e.g., locate the offender, escort them), and assess whether an employee facilitated the unauthorized entry.
Some security systems now use AI-based analytics to predict and prevent tailgating at specific doors. For example, a system might temporarily disable access and announce “One entry at a time, please,” deterring further attempts. Although effective, these systems operate at the individual door level without correlating repeated attempts across multiple doors. This lack of correlation can leave threat models unanswered in real-time and post-incident.
New sensemaking and communication capabilities
Modern LLMs and LMMs can process and analyze large amounts of sequential or related data almost instantaneously, surpassing human capabilities for historical and real-time analysis. These capabilities enable revolutionary advances in incident response, including:
- Correlation between multiple inputs: Integrate and analyze diverse data sources (e.g., text, video, audio, and sensor alarm logs) instantly as they occur to quickly develop a coherent understanding of a situation.
- Understanding a situation: Synthesize and contextualize a wide range of inputs to infer patterns, relationships, and the broader context of historical unfolding or scenarios.
- Application of rules or policies: Use predefined safety and security rules, learned models, or programmed policies to assess situations, identify anomalies, evaluate policy applicability, and detect threats.
- Provide explanations in plain language: Translate complex, multi-source data into clear stories that describe the situation and its dynamics, generate notifications based on templates, and deliver reports tailored to the intended recipients (including native language).