Publications

* denotes equal contribution and joint lead authorship.


2025

  1. MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization

    In ArXiv Preprint 2025. In submission to CVPR 2026.

    Vision-Language-Action (VLA) models inherit strong priors from pretrained Vision-Language Models (VLMs), but naïve fine-tuning often disrupts these representations and harms generalization. Existing fixes -- freezing modules or applying uniform regularization -- either overconstrain adaptation or ignore the differing roles of VLA components. We present MAPS (Module-Wise Proximity Scheduling), the first robust fine-tuning framework for VLAs. Through systematic analysis, we uncover an empirical order in which proximity constraints should be relaxed to balance stability and flexibility. MAPS linearly schedules this relaxation, enabling visual encoders to stay close to their pretrained priors while action-oriented language layers adapt more freely. MAPS is parameter-free, data-free, and plug-and-play with existing architectures. Across MiniVLA-VQ, MiniVLA-OFT, OpenVLA-OFT, and benchmarks like LIBERO, CALVIN, and SimplerEnv, MAPS boosts both in- and out-of-distribution performance (up to +30%). Our findings highlight empirically guided proximity to pretrained VLMs as a simple yet powerful principle for scalable VLA adaptation.
  2. Towards Streaming LiDAR Object Detection with Point Clouds as Egocentric Sequences
    Mellon M. Zhang, Glen Chou, Saibal Mukhopadhyay,

    In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026.

    Accurate and low-latency 3D object detection is essential for autonomous driving, where safety hinges on both rapid response and reliable perception. While rotating LiDAR sensors are widely adopted for their robustness and fidelity, current detectors face a trade-off: streaming methods process partial polar sectors on the fly for fast updates but suffer from limited visibility, cross-sector dependencies, and distortions from retrofitted Cartesian designs, whereas full-scan methods achieve higher accuracy but are bottlenecked by the inherent latency of a LiDAR revolution. We propose \textbf{Polar-Fast-Cartesian-Full (PFCF)}, a hybrid detector that combines fast polar processing for intra-sector feature extraction with accurate Cartesian reasoning for full-scene understanding. Central to PFCF is a custom Mamba SSM-based streaming backbone with dimensionally-decomposed convolutions that avoids distortion-heavy planes, enabling parameter-efficient, translation-invariant, and distortion-robust polar representation learning. Local sector features are extracted via this backbone, then accumulated into a sector feature buffer to enable efficient inter-sector communication through a full-scan backbone. PFCF establishes a new Pareto frontier on the Waymo Open dataset, surpassing prior streaming baselines by 10% mAP and matching full-scan accuracy at twice the update rate.
  3. Polar Hierarchical Mamba
    Mellon M. Zhang, Glen Chou

    In Computer Vision and Pattern Recognition Workshop on 4D Vision: Modeling the Dynamic World 2025.

    Accurate and efficient object detection is a crucial component for fully autonomous self-driving. LiDAR sensors are employed to augment or replace cameras for more robustness in diverse driving situations, making object detection on LiDAR point clouds a critical area of research and improvement. Traditional approaches to LiDAR object detection wait for a full 360 degree turn of the scanning sensor before processing the entire point cloud in one go, introducing significant latency and lowering throughput. Previous streaming approaches use the raw LiDAR polar coordinate system to process egocentric partial scans of point clouds, but rely on translation-invariant convolutions, which are incompatible with polar coordinates and lead to performance degradation. In this paper, we show that the reliance on convolutions is not necessary and propose a Mamba-only backbone with Polar Hierarchical Mamba (PHiM) blocks, aggregating per-point features within each partial scan with a local bidirectional state space model and capturing higher-level global features in a streaming fashion with a global forward state space model. Our model on the Waymo Open dataset demonstrates 10% performance improvement from the previous leading polar-based detector, featuring state of the art performance among all polar-based methods while being competitive with existing Cartesian-based detectors with a 2x improvement in processing throughput evaluated as predictions per second.

2024

  1. DFDNet: Directional Feature Diffusion for Efficient Fully-Sparse LiDAR Object Detection.
    Mellon M. Zhang, Hemant Kumawat, and Saibal Mukhopadhyay,

    In submission to TMLR.

    LiDAR-based object detection is essential for autonomous driving but remains computationally demanding. Conventional methods use dense feature map representations, leading to significant computational overhead and underutilizing the inherent sparsity of LiDAR data. Recent fully sparse detectors show promise but suffers from missing central object features due to the surface-dominant distribution of LiDAR points. Sparse feature diffusion methods attempt to address this by expanding features within object bounding boxes to cover neighboring regions prior to detection head. However, these approaches have excessive computational demands due to need of large diffusion range for larger objects. In this paper, we propose DFDNet, a fully sparse directional feature diffusion network that introduces a novel adaptive sparse feature realignment module that dynamically projects sparse features along object centerlines prior to feature diffusion. This realignment enables efficient, directional feature diffusion along object centerline. The resulting diffused features are then aggregated via max-pooling to construct a refined feature representation for each object. Our method reduces redundant sparse feature computations, achieving a two-fold reduction in computational load while improving performance over state-of-the-art detectors on the Waymo and nuScenes benchmarks.