Data Machina #246 – Data Machina

New Trends in Vision-Language Models (VLMs.) The evolution of VLMs in recent months has been pretty impressive. Today VLMs exhibit some amazing capabilities. See the two links below on what VLMs can do and how they work:

But still VLMs are facing some challenges for example in terms of: multimodal training datasets, resolution, long-form modality, vision-language integration, and concept understanding. Somewhat along those lines, I see 5 trends happening in VLMs: 1) VLMs run on local environment 2) Emerging VLM videoagents 3) Unified structure learning for VLMs 4) Personalisation of VLMs and 5) Fixing the VLM resolution curse. Let’s see…

VLMs on local environment. In this blogpost, an independent AI researcher writes about playing around with VLMs using only a local environment. Inspired by Phi-2: The surprising power of small LMs– and using Facebook AI AnyMAL multimodality method, the researcher describes in detail the challenges and different architectures until achieving some decent results in a local environment, which are not close to academic SOTA. Blogpost: Findings on VLMs

New SOTA in long-form video understanding. Researchers at Standford, introduced a new approach for video understanding. The approach combines an LLM agent, a vision-language model (VLM), and contrastive language-image model (CLIP). The researchers claim this approach is superior to current SOTA in video understanding. Paper: VideoAgent: Long-form Video Understanding with LLM as Agent

New SOTA in UI & infographics understanding. Researchers at Google, recently introduced a novel vision-language model that specialises in UI and infographics understanding. The model was trained on a unique mixture of datasets containing novel screen annotations, and types and location of UI elements. The researchers claim the model achieves SOTA in UI & infographics understanding. Paper: ScreenAI: A Vision-Language Model for UI and Infographics Understanding

New SOTA in visual document understanding. Researchers at Alibaba just introduced a new model for visual document understanding that uses Unified Structure Learning (USL). The USL model learns on structure-aware parsing tasks and multi-grained text localisation tasks across 5 domains: document, webpage, table, chart, and natural image. The researchers claim the model achieves SOTA. Paper: mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

Personalisation of VLMs. Most VLMs lack an understanding of user-concepts. Researchers at Snap et al. just introduced MyVLM, a new way to personalise VLMs. Given a set of images depicting user-specific concepts, the researchers augmented a pretrained vision-language model (VLM) and used concept embeddings to understand and reason over these user concepts. Th researchers applied MyVLM to BLIP-2, LlaVA 1.6 and MiniGPT-v2 models for personalised captioning, visual question-answering, and referring expression comprehension. Checkout the project page, code, data and demos here: MyVLM: Personalising VLMs for User-Specific Queries

Fixing the resolution curse in VLMs. Resolution is a key problem in VLMs. VLMs can’t zoom. They are limited by the resolution of the vision encoder, and usually, it is not super large based on the pre-trained vision encoder. In this blogpost, Alex explains how you can use Visual Search, Visual Cropping and MC-LLaVA to fix this problem. Blogpost: Breaking resolution curse of vision-language models.

Have a nice week.

Evolutionary Model Merge: A New Way to Automate Model Dev
A Visual Guide to Mamba and State Space Models
DeepMind TacticAI: An AI-Assitant for Football Tactics
How I Use Claude 3 and ChatGPT for Ad-hoc Tasks
Visualisation of Large-scale Multimodal Datasets with Nomic Atlas
New Embedding Quantisation for Faster, Cheaper Retrieval
Introducing RAG 2.0: New SOTA Contextual Language Models
Berkeley AIR – A New Approach to Modelling Extremely Large Images
Mistral + Snowflake: The New Frontier in SQL Copilot Products
Cosmopedia: How to Create Large-scale Synthetic Data for Pre-training

Share Data Machina with your friends

LaVague: A Large Action Model for Automating Automation
Claude-investor: Generative Stocks Investment Recommendations
Devika – An Agentic AI Software Engineer that Follows Human Instructions

An Overview of Contextual Bandits & RL
The Bayesian Learning Rule & Adaptation in ML
Reversible Residual Nets: How To Train NNs with Less GPU Memory

DenseFormer: Faster Transformer Inference with Depth Weighted Averaging
LlamaFactory v.2: Unified Efficient Fine-Tuning of 100+ Language Models
Google Research- A Bag of Tricks for Few-Shot Class-Incremental Learning

Autonomous Agents for Production Ready LLMs
Predictive Scoring with MLOps and KubeDDR
High-quality MLOps with Python’s ABC & Pydantic

Announcing the 2024 Waymo Open Dataset Challenges
Common Corpus: The Largest Public Domain Dataset, 500 Billion Words
DROID (Distributed Robot Interaction Dataset), 76K Demonstrations

Enjoyed this post? Tell your friends about Data Machina. Thanks for reading.

Tips? Suggestions? Feedback? email Carlos

Curated by @ds_ldn in the middle of the night.

Data Machina #246 – Data Machina

10 AI Events to Check in Fall & Winter 2021

Introductory time-series forecasting with torch

Does GPT-4 Pass the Turing Test?

Vapi Secures $20M Series A to Redefine Enterprise AI Voice Agents

Electronic health records (EHR) management with AI

Related articles

AI for the board game Diplomacy

10 AI Events to Check in Fall & Winter 2021

Introductory time-series forecasting with torch

Does GPT-4 Pass the Turing Test?

Latest news

AI for the board game Diplomacy

10 AI Events to Check in Fall & Winter 2021

Introductory time-series forecasting with torch

Popular news

AI for the board game Diplomacy

10 AI Events to Check in Fall & Winter 2021

Introductory time-series forecasting with torch