Published on June 14, 2024
In AI News

Transformers Can Now Work Pixel by Pixel, Says Meta AI’s New Study

The study, exploring “Transformers on Individual Pixels," challenges the long-held belief that locality – the notion that neighboring pixels are more related than distant ones – is a fundamental requirement for vision tasks.

by Gopika Raj

A latest research by Meta AI and the University of Amsterdam have shown that transformers, a popular neural network architecture, can operate directly on individual pixels of an image without relying on the locality inductive bias present in most modern computer vision models.

The study, exploring “Transformers on Individual Pixels,” challenges the long-held belief that locality – the notion that neighboring pixels are more related than distant ones – is a fundamental requirement for vision tasks.

Traditionally, computer vision architectures like Convolutional Neural Networks (ConvNets) and Vision Transformers (ViTs) have incorporated locality bias through techniques such as convolutional kernels, pooling operations, and patchification, assuming neighboring pixels are more related.

However, researchers introduced Pixel Transformers (PiTs), which treat each pixel as an individual token, removing any assumptions about the 2D grid structure of images. Surprisingly, PiTs achieved highly performant results across various tasks.

Following the architecture of Diffusion Transformers (DiTs), PiTs operating on latent token spaces from VQGAN achieved better quality metrics like Fréchet Inception Distance (FID) and Inception Score (IS) than their locality-biased counterparts.

Perceiver IO Transformers (PiTs) are computationally expensive due to longer sequences, but they challenge the need for locality bias in vision models. Advances in handling large sequence lengths may make PiTs more practical.

The study highlights reducing inductive biases in neural architectures, potentially leading to more versatile and capable systems for diverse vision tasks and data modalities.

Image generation using transformers

There are different models for image generation, such as Midjourney, Stable Diffusion, and Invoke, whose images can be reimagined with these technologies. Recently Midjourney has released the new feature “Character Reference” claiming to generate consistent characters across multiple reference images.

Stability AI announced Stable Diffusion 3, the most capable text-to-image model, featuring significantly enhanced performance in multi-subject prompts, image quality, and spelling abilities.

📣 Want to advertise in AIM? Book here

Gopika Raj

With a Master's degree in Journalism & Mass Communication, Gopika Raj infuses her technical writing with a distinctive flair. Intrigued by advancements in AI technology and its future prospects, her writing offers a fresh perspective in the tech domain, captivating readers along the way.

9 Must-Know Open Source Models From Meta in 2023

Meta’s New Research Begins Decoding Thoughts from Brain Using AI

Meta’s New Report Shows How to Prevent ‘Catastrophic Risks’ from AI

Sakana.ai Introduces Transformer2, a Self-Adaptive AI

Sakana.ai Introduces Transformer², a Self-Adaptive AI

Google’s New AI Architecture ‘Titans’ Can Remember Long-Term Data

Llama 3.3 Just Made Synthetic Data Generation Effortless

‘To Build an AI Startup in India, You Should Have a PhD,’ says Yann LeCun

Imagine Meta Without Llama

The Breakthrough AI Scaling Desperately Needed

Association of Data Scientists

GenAI Corporate Training Programs

Our Upcoming Conference

Rising 2025

India's Biggest Women in Tech Summit

Mar 20 and 21, 2025 | 📍 J N Tata Auditorium, Bengaluru

Download the easiest way to
stay informed

This Indian Father-Son Duo is Challenging AI Giants from the West

Vandana Nair

“The world didn’t want to believe in us, so open-sourcing was a great way to tell them, ‘Look, we’re building our own technology’,” said Akshat Prakash, co-founder and CTO of