Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.

Visual Prompt Generation: Cross-Attention in Q-Former

2025/11/20 00:00
2 min read
For feedback or concerns regarding this content, please contact us at [email protected]

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

\ Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

\ Figure 7. Overview of QFormer

A. Detailed Architecture of QFormer

The architecture overview is depicted in Figure 7. Specifically, QFormer is initialized as a BERT-based model[8] comprising a total of L = 12 layers. In contrast to typical BERT models that process textual inputs, QFormer takes R = 32 learnable query embeddings as inputs. These embeddings are utilized to extract visual information from the input visual data during Stage-1 pretraining in BLIP2[22]. Subsequently, they serve as visual prompt embeddings for the LLM inputs after projection.

\ Inside the QFormer, each layer includes a self-attention module composed of a Multi-Head Attention component and a Forward module (consisting of Linear, LayerNorm, and Residual Connection). The cross-attention module, initialized with random values, is inserted every G layers, where learnable query embeddings interact with visual embeddings. In the main paper, for the sake of conciseness, we condensed the representation of the multi-head attention and forward modules into self(cross) attention modules. Furthermore, we exclusively illustrated the modifications made to the cross-attention module in MIVPG, as the self-attention modules remain unchanged. The final QFormer output is represented by the last layer’s query embeddings.

\ For a more comprehensive understanding, readers are encouraged to refer to [22].

\

:::info Authors:

(1) Wenliang Zhong, The University of Texas at Arlington ([email protected]);

(2) Wenyi Wu, Amazon ([email protected]);

(3) Qi Li, Amazon ([email protected]);

(4) Rob Barton, Amazon ([email protected]);

(5) Boxin Du, Amazon ([email protected]);

(6) Shioulin Sam, Amazon ([email protected]);

(7) Karim Bouyarmane, Amazon ([email protected]);

(8) Ismail Tutar, Amazon ([email protected]);

(9) Junzhou Huang, The University of Texas at Arlington ([email protected]).

:::


:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Market Opportunity
Prompt Logo
Prompt Price(PROMPT)
$0.04343
$0.04343$0.04343
-1.02%
USD
Prompt (PROMPT) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Tags:

You May Also Like

Landmark Court Ruling Rejects Terrorism Financing Claims

Landmark Court Ruling Rejects Terrorism Financing Claims

The post Landmark Court Ruling Rejects Terrorism Financing Claims appeared on BitcoinEthereumNews.com. Binance Lawsuit Dismissed: Landmark Court Ruling Rejects
Share
BitcoinEthereumNews2026/03/07 10:27
The U.S. Commodity Futures Trading Commission unveiled a new logo, claiming it will usher in a "golden age" of innovation.

The U.S. Commodity Futures Trading Commission unveiled a new logo, claiming it will usher in a "golden age" of innovation.

PANews reported on March 7 that the U.S. Commodity Futures Trading Commission (CFTC) today unveiled a new logo, stating that it symbolizes the agency's commitment
Share
PANews2026/03/07 10:08
MetaMask’s Polymarket Integration May Make LINEA Rewards and Perpetual Trading a New On-Chain Financial Hub

MetaMask’s Polymarket Integration May Make LINEA Rewards and Perpetual Trading a New On-Chain Financial Hub

The post MetaMask’s Polymarket Integration May Make LINEA Rewards and Perpetual Trading a New On-Chain Financial Hub appeared on BitcoinEthereumNews.com. COINOTAG recommends • Exchange signup 💹 Trade with pro tools Fast execution, robust charts, clean risk controls. 👉 Open account → COINOTAG recommends • Exchange signup 🚀 Smooth orders, clear control Advanced order types and market depth in one view. 👉 Create account → COINOTAG recommends • Exchange signup 📈 Clarity in volatile markets Plan entries & exits, manage positions with discipline. 👉 Sign up → COINOTAG recommends • Exchange signup ⚡ Speed, depth, reliability Execute confidently when timing matters. 👉 Open account → COINOTAG recommends • Exchange signup 🧭 A focused workflow for traders Alerts, watchlists, and a repeatable process. 👉 Get started → COINOTAG recommends • Exchange signup ✅ Data‑driven decisions Focus on process—not noise. 👉 Sign up → The MetaMask Polymarket integration brings decentralized prediction markets directly into MetaMask, enabling users to trade event outcomes while retaining full self-custody. The update, paired with in-app perpetuals and a Rewards program, transforms MetaMask into a multi‑product on‑chain trading hub. (Published Oct 14, 2025) MetaMask adds Polymarket prediction markets natively Users can trade outcomes on crypto, politics and global events while keeping custody of private keys. Polymarket has seen nearly $20B in trading volume (TokenTerminal); MetaMask also launches Rewards and in‑app perpetuals. MetaMask Polymarket integration: trade predictions inside MetaMask while keeping custody — explore in‑app perps, earn rewards, and access new trading tools today. The world’s largest self-custodial wallet adds perpetual trading, a rewards system, and a Polymarket integration, signaling its transformation into a full financial hub. COINOTAG recommends • Professional traders group 💎 Join a professional trading community Work with senior traders, research‑backed setups, and risk‑first frameworks. 👉 Join the group → COINOTAG recommends • Professional traders group 📊 Transparent performance, real process Spot strategies with documented months of triple‑digit runs during strong trends; futures plans use defined R:R…
Share
BitcoinEthereumNews2025/10/15 05:19