Transforming VLM development with Project Hafnia, powered by NVIDIA NeMo Curator & DGX Cloud

White Paper

March 18, 2025

Shape the Future with Project Hafnia. Join the waitlist now!

Announced at GTC 2025, Hafnia is a Milestone Systems project dedicated to offer services centered around its vision of delivering the largest real-world video library for computer vision. By drastically (x30) reducing the time and cost required to develop compliant and high-performance AI solutions, this initiative enables computer vision developers to access vast amounts of high-quality, curated and annotated video data. With thousands of hours of footage coming into its library daily, searchable and refined through Milestone technology, the project ensures that AI models can be trained on the most relevant and high-quality datasets. To prove the datasets’ capabilities in a demanding case study, we worked with NVIDIA’s NeMo Curator and DGX Cloud in fine-tuning state-of-the-art vision language models (VLMs) for vehicle applications such as traffic and transportation.

VLMs have unlocked new AI capabilities. Visual AI agents like those built from NVIDIA AI blueprint for video search and summarization (VSS) relies on accurate VLMs. All-purpose models have proven effective in diverse scenarios, but when those scenarios become too specific or complex the outcomes are less usefull. While RAG-like approaches help in adding information from those contexts, general models could struggle to grasp the relevant elements and dynamics of a scene. There are video inference gaps that can't be filled with external databases. Fine-tuning remains essential, but its effectiveness depends on access to large-scale, well-curated, and labeled video datasets, which together with managing complex hardware infrastructures, have become major bottlenecks in AI development.

We retrieved 750,000+ hours of real-world traffic footage. Around 90% of them enriched wither with contextual, instance, and/or scene-level annotations from the Project Hafnia library, to support VLM fine-tuning for vehicle applications. However, even with pre-annotations, generating the specific labels and meeting the scale and quality required for high-performance fine-tuning remained a massive challenge.

To tackle this, Project Hafnia integrated NVIDIA NeMo Curator into its data pipeline, leveraging AI-driven video selection and annotation to build a high-quality fine-tuning dataset for traffic-specific VLMs. This post explores how NeMo Curator on DGX Cloud transformed our curation workflow, processing 10,000+ curated hours of video in under five days, dramatically reducing manual effort while keeping the annotation quality.

Project Hafnia: The largest real-world computer vision Data Library

Developing high-performance computer vision AI has traditionally been a slow, resource-intensive process. Project Hafnia was designed to change that by offering a vast, AI-powered video data ecosystem that reduces the time needed to build reliable and compliant AI solutions by up to a factor of x30.

Instead of spending months or even years gathering, filtering, and annotating video data before even training a model, developers can retrieve, refine, and fine-tune AI models in days using Hafnia’s data and automated pipeline—from identifying the AI need to having a production-ready model.

Figure 1: Homepage of Project Hafnia.

The Scale of Hafnia’s Library

Project Hafnia curates and processes one of the largest real-world video data libraries in the world, capturing diverse environments, scenarios, and contextual conditions essential for training robust computer vision models. For this specific VLM fine-tuning use case, we selected 750,000+ hours of real-world traffic footage, a subset of Hafnia’s data sources, to build a high-performance model for vehicle-related applications.

Why AI Developers Choose Project Hafnia

Access to vast real-world video data: eliminates the need for expensive, manual data collection.
Scalable and compliant retrieval and annotation pipelines: ensures high-quality, traceable datasets that meet regulatory standards.
Seamless data-to-model pipeline: reduces AI development time from months to weeks.

With Hafnia’s AI-driven data library, developers no longer need to struggle with data bottlenecks. Instead, they can focus on innovation and model deployment, confident that they are working with the most relevant, high-quality, and regulation-compliant video data available.

Proving Hafnia’s Value: VLM Fine-Tuning for Traffic and Transportation

Fine-tuning Vision-Language Models (VLMs) for traffic and transportation applications presents a significant challenge: generalist models struggle with domain-specific nuances that impact real-world performance. From varying road conditions to complex urban environments, traditional models often lack the fine-grained understanding needed for high-stakes applications like autonomous driving, traffic monitoring, and smart infrastructure management.

The Problem: When Generalist VLMs Fall Short

While VLMs have revolutionized AI in image/video by learning from vast multimodal datasets, they struggle with scene-specific complexities when deployed in real-world, high-precision scenarios like traffic analysis.

Limitations of Off-the-Shelf VLMs:

Lack of domain-specific knowledge: general models aren’t optimized for traffic behavior, vehicle interactions, or infrastructure-specific elements.
Contextual blind spots: VLMs can miss critical real-world conditions like road closures, variable lighting, or occlusions from traffic congestion.
Inference gaps: while Retrieval-Augmented Generation (RAG) can add missing knowledge, it can’t compensate for missing visual understanding in video-based AI applications.

Fine-tuning remains the only way to bridge these gaps, but it requires vast, well-annotated datasets, which are difficult to build manually at scale.

The Solution: AI-Powered Fine-Tuning with Project Hafnia

For this use case, we curated and processed over 750,000+ hours of real-world traffic footage from the Hafnia library, enriched with:

Scene-level annotations – Weather conditions, lighting, road type, traffic flow.
Instance-level metadata – Vehicle types, lane occupancy, occlusions.
Camera-specific information – Angles, location, motion dynamics.

Figure 2: Distribution of the labels from the library used in the fine-tuning.

Even with pre-annotated metadata, refining and structuring this vast dataset into a VLM fine-tuning-ready corpus required an AI-driven approach. This is where NVIDIA’s NeMo Curator and DGX Cloud came into play, allowing us to intelligently filter, annotate, and optimize the dataset at scale.

The Results: Transforming Fine-Tuning from a Bottleneck to an AI-Driven Pipeline

Dataset Selection Acceleration: curated 10,000 high-value hours in days instead of months.
Automated Annotation at Scale: VLM-powered scene understanding produced structured, fine-tuning-ready labels with minimal human intervention.
Seamless AI-Powered Processing: DGX Cloud infrastructure enabled petabyte-scale dataset transformation without computational slowdowns.

By leveraging Project Hafnia’s AI-ready data and NVIDIA’s AI-powered curation tools, we removed the traditional roadblocks in VLM fine-tuning, proving that large-scale, high-performance computer vision models can be trained efficiently and reliably when built on the right foundation.

How We Built a VLM Fine-Tuning Dataset with Project Hafnia + NeMo Curator

Figure 3: Diagram flow of VLM fine-tuning by Project Hafnia powered by NVIDIA and AI blueprint for VSS.

Fine-tuning a VLM for traffic and transportation requires more than just raw video data, it demands a highly curated, well-annotated dataset that captures the right contextual, spatial, and temporal information. Manually creating such a dataset at scale would take months, if not years.

By integrating NVIDIA’s NeMo Curator into Project Hafnia’s AI-powered data pipeline, we automated video selection, annotation, and refinement, reducing dataset creation time from months to days while ensuring high annotation quality and compliance.

Phase 1: Identifying High-Value Fine-tuning Footage

With 750,000+ hours of raw traffic video available for this project, we needed a way to quickly extract the most relevant clips for VLM training. Instead of manual filtering, we leveraged NeMo Curator’s AI-driven selection process to identify the best data for fine-tuning.

Figure 4: Diagram of Project Hafnia relevance pipeline powered by NeMo Curator.

Generating embeddings with NeMo Curator
- Ran NeMo Curator to extract embeddings from the 750,000-hour video library.
- Embedded data enabled semantic search and similarity-based retrieval.
Leveraging Project Hafnia’s metadata for targeted filtering
- Used existing metadata (e.g., weather, camera type) to identify key scenarios, i.e. "golden clips"
Retrieving additional relevant clips via embedding similarity
- Identified “golden clips” and expanded dataset using nearest-neighbour search in the embedding space.
Final dataset selection
- Reduced 750,000 hours → 10,000 high-value hours for fine-tuning through embeddings search.

Figure 5: Static views of golden clips retrieved.

Phase 2: VLM Labeling with NeMo Curator

Once the most relevant video clips were selected, we needed to generate detailed, structured annotations that a VLM could learn from. Manually labeling 10,000+ hours of traffic video would have been impractical, so we automated the process with NeMo Curator’s multi-stage annotation pipeline.

Figure 6: Diagram of Project Hafnia curation pipeline powered by Nemo Curator.

Metadata-assisted + Human Annotations Prompting
- Prompts based on prompt engineering merged with annotations (from the Project Hafnia data library) and camera metadata (e.g., "This is a highway camera capturing three lanes…").
VLM-Powered Scene Understanding
- NeMo Curator used different configurations of high-performant VLMs to generate captions and structured scene descriptions.
Post-processing for Accuracy & Consistency

Applied LLMs for caption refinement, ensuring temporal coherence and stability and bringing in additional manual annotations for improving the quality of the outcoming captions.

Scaling Hafnia’s Processing with NeMo Curator and DGX

Integrating NeMo Curator into Hafnia’s Pipeline

To efficiently process large-scale video data, we integrated NVIDIA NeMo Curator into the Project Hafnia pipeline. The Curator was built to receive video files and prompts and generate high-quality captions through a multi-stage AI pipeline.

NVIDIA's video curation pipeline offers a set of configuration parameters designed to streamline the video analysis process. The pipeline can be fine-tuned through several key options:

Toggle embedding generation to create vector representations of video content.
Enable preview clip generation for quick visual summaries.
Activate caption generation using either the Qwen or Vila models for detailed video descriptions.
The splitting algorithm parameter offers flexibility with either "panda70m" or "fixed-stride" options, determining how videos are segmented for processing.
- When using the panda70m splitting algorithm, you can select either "radio" or "clip" for the stitching embedding algorithm to control how segments are reconnected.
For optimizing throughput, adjust the chunk_size parameter to process multiple videos together for shorter clips.
The limit parameter allows you to cap the number of processed videos.

Figure 7: Diagram flow for NVIDIA NeMo Curator Video.

Each processing run was configured via JSON files specifying prompts, clip durations, and VLM parameters.

Seamless Integration with DGX Cloud

To scale this workflow, we leveraged DGX Cloud with direct integration to S3 storage:

Video Storage: 5 minutes duration MP4 video files were stored in structured directories within an S3 bucket.
DGX Cloud Processing: Curator was deployed across up to 4 DGX nodes, each equipped with 8x A100 GPUs. This setup enabled parallel processing, drastically improving throughput.
Automated Invocation: The Curator function was triggered per directory, ensuring that all videos from the same camera, requiring identical context, were processed together. Prompts were dynamically extracted from Langfuse’s prompt registry.

Figure 8: Snapshot on Langfuse used for prompt registry.

Output Management: The processed videos, embeddings, and captions were retrieved and stored back into S3 for post-processing.

Performance Impact: Scaling Curation with DGX Cloud

The combined NeMo Curator + DGX Cloud setup significantly improved both efficiency and annotation quality:

Metric	Value	Per Day	Per Hour	Per Minute
Fine-tuning Base Volume	18.5TB	-	-	-
Fine-tuning Hours (Base)	10,000 hours	-	-	-
Number of 5-minute Clips	~120,000 files	-	-	-
Processing Iterations	6 complete cycles	-	-	-
Total Processing Volume	~100TB	-	-	-
Total Hours Processed	60,000+ hours	12,000+ hours	500+ hours	8.33+ hours
Processing Rate	60,000+ hours in 5 days	-	-	-
Clip Processing Rate	720,000+ clips in 5 days	144,000+ clips	6,000+ clips	100+ clips
Data Processing Rate	100TB in 5 days	20TB	833GB	13.9GB

Scalability: The Curator function’s ability to leverage multiple DGX Cloud nodes removed the complexity of managing large-scale GPU clusters.
Annotation Accuracy: Initial metadata (e.g., road type, lane count, weather conditions) was integrated into prompts, significantly improving the relevance of generated captions. Additional metadata was injected during post-processing to enhance coherence. Also, sequential clips and its captions were put together through the post-processing to again enhance the coherence.
Validation & Quality Control:
- Speed Gains: Running this process manually or on a self-managed infrastructure was not feasible.
- Accuracy Checks: We performed manual golden clip reviews and sample-based caption validation to confirm improvements over traditional approaches.

Figure 9: Snapshot of the output of the curation.

The results demonstrated that NeMo Curator, combined with DGX Cloud, effectively scaled dataset curation while maintaining high annotation quality—proving Project Hafnia’s dataset capabilities for VLM fine-tuning.

Processing Speed: Reduced dataset generation time from months to days.
Annotation Quality: Improved completeness and temporal consistency.
Scalability: Enabled automated dataset curation at petabyte scale.

The Impact: How Project Hafnia + NeMo Curator can transform VLM Development

Fine-tuning Vision-Language Models (VLMs) for real-world applications has traditionally been a slow and costly process, requiring massive, high-quality datasets and extensive computational resources. Project Hafnia, powered by NVIDIA’s NeMo Curator and DGX Cloud, has fundamentally transformed this workflow, turning a months-long, manual bottleneck into an AI-driven, automated pipeline.

By automating dataset selection, annotation, and large-scale processing, we have made high-performance VLM fine-tuning not just feasible, but scalable and efficient.

Dataset Curation Acceleration: From Months to Days

Traditional dataset curation for VLM fine-tuning involves manual video retrieval, filtering, and annotation—a process that can take months or even years. With Hafnia + NeMo Curator, this timeline has been compressed into days.

Key Results:

750,000+ hours of real-world traffic video processed at scale.
10,000 high-value hours selected through AI-driven semantic search.
Fine-tuning-ready dataset generated 100x faster than traditional methods.

What this means for AI teams: Developers can now bypass the data bottleneck and move directly to model training with a high-quality, domain-specific dataset.

Annotation Quality & Consistency: AI-Powered Labeling

VLM fine-tuning requires precise, structured labels—but manual annotation is slow, inconsistent, and expensive. By integrating NeMo Curator’s AI-powered scene understanding, Hafnia has automated the video labeling process while maintaining human-level accuracy.

Key Results:

Scene-level and instance-level annotations auto-generated at scale.
Metadata-assisted prompting improved annotation context and precision.
LLM-powered caption refinement enhanced coherence and consistency.

What this means for AI teams: No more inconsistent, fragmented datasets—Hafnia ensures that every clip is accurately labeled and ready for fine-tuning.

Scaling AI-Powered Processing: DGX Cloud Integration

Processing millions of video segments with AI models requires massive computational power. By running NeMo Curator on NVIDIA DGX Cloud, Hafnia unlocked the ability to process vast datasets at unprecedented speeds.

Key Results:

60,000+ hours of video processed in just 5 days.
DGX Cloud enabled parallel processing across 4 DGX nodes (8x A100 GPUs per node).
Automated data ingestion, annotation, and storage, reducing manual effort by 90%.

What this means for AI teams: No need to build and maintain expensive GPU clusters—Hafnia provides scalable AI-powered data processing on demand.

Compliance & Traceability: AI Development Without Risk

One of the biggest challenges in AI development is ensuring compliance with data privacy, security, and regulatory requirements. Project Hafnia’s library is built for full traceability, providing auditable, regulation-ready datasets for AI training.

Key Results:

Data provenance tracking for every selected and annotated clip.
Regulatory-compliant dataset curation for AI applications in sensitive industries.
Transparent metadata pipelines ensuring ethical and explainable AI.

What this means for AI teams: No more uncertainty about dataset origins—Hafnia ensures every piece of training data is compliant, traceable, and ready for deployment.

The Bottom Line: Unlocking Scalable, Reliable VLM Development

Without P. Hafnia + NeMo Curator	With P. Hafnia + NeMo Curator
Months of manual dataset preparation	Access high-quality, AI-ready video data instantly
Expensive, error-prone human annotation	Curate and annotate massive datasets at 100x speed
Computational infrastructure challenges	Scale dataset processing without infrastructure limitations
Uncertainty in compliance and traceability	Ensure compliance while reducing development time by a factor of x30

By removing the data bottleneck, automating annotation, and enabling scalable processing, Hafnia + NeMo Curator has made fine-tuning VLMs not just possible—but truly efficient, scalable, and ready for real-world deployment.

The Future: Making VLM Fine-Tuning Accessible to AI

As Vision-Language Models (VLMs) continue to advance, the demand for domain-specific fine-tuning is only growing. Generalist models are powerful, but real-world AI applications require precision, adaptability, and compliance, especially in sectors like transportation, security, healthcare, and smart infrastructure.

With Project Hafnia + NeMo Curator + DGX Cloud, we’ve proven that large-scale, high-quality dataset curation is no longer a bottleneck. But this is just the beginning.

The Next Evolution of AI-Ready Video Data

We are expanding Hafnia’s capabilities to enable seamless, scalable, and regulation-compliant AI development across multiple domains.

Continuous Dataset Expansion & AI-Powered Updates

Hafnia will continuously grow its AI-ready video library, integrating new real-world data and improving annotation accuracy through AI-driven refinement loops.
More industries and use cases will benefit from Hafnia’s automated data pipelines, from smart cities to industrial automation.

VLM Fine-Tuning as a Service

Fine-tuned VLMs will soon be available on demand, allowing AI teams to skip dataset curation entirely and deploy pre-trained, domain-specific models.
This will dramatically reduce AI development time, giving companies access to state-of-the-art models trained on high-quality, traceable data.

Seamless Integration with NVIDIA’s AI Ecosystem

Hafnia is working to integrate fine-tuned VLMs into the NVIDIA AI Blueprint for video search and summarization (VSS) that makes it easy to build interactive visual AI agents to extract valuable insights from massive volumes of video data. The VSS blueprint combines NVIDIA Metropolis technologies with NIM microservices, allowing best-in-class reasoning and computer vision capabilities.
This means end-to-end AI video intelligence, from data collection to real-time AI inference, all powered by Hafnia’s library and NVIDIA

Democratizing AI: Removing the Barriers to Scalable Model Training

With Hafnia, NeMo Curator, and DGX Cloud, we are making domain-specific VLM fine-tuning accessible at scale.

No more months of dataset prep: Hafnia reduces development time by a factor of x30.

No more infrastructure complexity: NeMo Curator + DGX Cloud scale AI workflows automatically.
No more compliance concerns: every dataset is traceable, auditable, and ready for production AI.

Next Steps: Unlock the Power of AI-Ready Video Data

Explore NeMo Curator – Test AI-powered dataset curation and annotation at scale.
Try Hafnia’s VLM as a Service – Fine-tune models on high-quality, compliant datasets.
Get in touch with Project Hafnia – Access AI-ready video data for your applications today.

By removing data bottlenecks, automating annotation, and providing scalable fine-tuning, Project Hafnia is setting the standard for AI-driven video intelligence—and making high-performance, domain-specific AI a reality for everyone.

Authors: Fulgencio Navarro, Edward Mauser, Juan Manuel Perero and Danilo Dresen