Spaces:

imageomics
/

bioclip-image-search-lite

Running

Net Zhang Claude Opus 4.6 Elizabeth Campolongo commited on Mar 25

Commit

3b98575

unverified ·

1 Parent(s): b49de53

Reduce DuckDB metadata from 25.8 GB to 13.5 GB (#23)

* Add DuckDB optimization scripts (#11)

Reduce metadata.duckdb from 25.8 GB to 13.5 GB (47.8%) via column
pruning, ENUM types, taxonomy sort order, URL prefix splitting, and
type downcasting. Includes cleanup of 47 corrupted rows from GBIF
column-shift misalignment.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Adapt app to optimized DuckDB schema

Update METADATA_COLUMNS for split URL (url_prefix_id + identifier_suffix)
and add basisOfRecord. Reconstruct full URLs via in-memory prefix dict
lookup (410 entries loaded at startup) instead of SQL JOIN for zero
query latency impact. Falls back to direct identifier column for
legacy DB compatibility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update convert script as stage 1 of two-stage pipeline

convert_duckdb_lite.py now handles raw import only (SQLite/DuckDB →
base DB with has_url column). Removed idx_scope creation (handled by
optimize_duckdb.py in stage 2). SLURM script chains both stages.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Use temp file for intermediate DB in conversion pipeline

Avoid leaving a duplicate base DB on disk. The intermediate file is
cleaned up automatically via trap after the optimize step completes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add BioCLIP 2 Training scope and backfill metadata pipeline

- Add `in_bioclip2_training` boolean column to DuckDB pipeline
(convert, optimize, validate) from training catalog parquet
- Add "BioCLIP 2 Training" scope to app dropdown, config, and
search service
- Switch scope filtering from SQL WHERE to Python post-filter
(benchmarked ~370x faster for ID-based lookups)
- Fix `src.metadata` reference bug in optimize_duckdb.py validation
- Update README scope table and add filtering rationale
- Update HF data card with new column, backfill details, and
revised data coverage numbers
- Add scripts/data/README.md documenting the optimized schema

Closes #24

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Clarify in_bioclip2_training count discrepancy

* Address review feedback

- update inline url to direct to the exact catalog file
- add inline documentation in search_service.py to clarify URL suffix
and prefix format

* Address review feedbback

- Link in_bioclip2_training to catalog.parquet file instead of repo tree
- Document url_prefix_id and identifier_suffix columns in HF data card
- Add url_prefixes table schema and URL reconstruction section with SQL
& Python examples
- Update column mapping table for the prefix/suffix split
- Add inline comment in search_service.py clarifying prefix/suffix convention

* Apply suggestions from code review

Co-authored-by: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Net Zhang <48858129+NetZissou@users.noreply.github.com>

* Add `.gitattributes` and normalize line endings to LF

GH web UI introduced CRLF line endings in `hf-data-card-README.md`
casuing noisy full-fil diffs.

This commit normalized line endings in that file to pure LF. And the
future commits will be automatically normalized.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com>

Files changed (14) hide show

.gitattributes +2 -0
.gitignore +3 -0
README.md +8 -3
app.py +1 -1
docs/hf-data-card-README.md +65 -18
scripts/data/README.md +65 -0
scripts/data/convert_duckdb_lite.py +78 -35
scripts/data/convert_duckdb_lite.slurm +10 -1
scripts/data/optimize_duckdb.py +487 -0
scripts/data/optimize_duckdb.slurm +31 -0
scripts/data/validate_optimized_duckdb.py +405 -0
scripts/data/validate_optimized_duckdb.slurm +31 -0
src/bioclip_lite/config.py +10 -4
src/bioclip_lite/services/search_service.py +51 -12

.gitattributes ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # Normalize line endings to LF in the repo, native on checkout
2	+ * text=auto

.gitignore CHANGED Viewed

@@ -7,3 +7,6 @@ build/
 *.duckdb
 *.index
 logs/

 *.duckdb
 *.index
 logs/
+# SLURM output files
+*.out

README.md CHANGED Viewed

@@ -54,7 +54,7 @@ Everything runs in a single Gradio process. No microservices, no HDF5 files.
 | Component | Size |
 |-----------|------|
 | FAISS index | 5.8 GB |
-| DuckDB metadata | 25.8 GB |
 | Model weights | ~2.5 GB (downloaded on first run) |
 | Image storage | 0 (fetched from source URLs) |
@@ -105,16 +105,21 @@ Then open `http://<hostname>:7860` in your browser.
 ## Scope filtering
-Not all 234M images have source URLs. Use the scope dropdown to control which results appear:
 | Scope | Images | Description |
 |-------|--------|-------------|
 | All Sources | 234M | Everything, including results without images |
-| URL-Available Only | 207M (88%) | Only results with fetchable source URLs |
 | iNaturalist Only | 135M (58%) | iNaturalist observations via AWS Open Data |
 The app over-fetches from FAISS (3x by default) and filters post-search, so you still get the requested number of results after filtering.
 ## Architecture
 ```

 | Component | Size |
 |-----------|------|
 | FAISS index | 5.8 GB |
+| DuckDB metadata | ~14 GB (optimized) |
 | Model weights | ~2.5 GB (downloaded on first run) |
 | Image storage | 0 (fetched from source URLs) |
 ## Scope filtering
+Use the scope dropdown to control which results appear:
 | Scope | Images | Description |
 |-------|--------|-------------|
 | All Sources | 234M | Everything, including results without images |
+| URL-Available Only | 234M (99.99%) | Only results with fetchable source URLs |
 | iNaturalist Only | 135M (58%) | iNaturalist observations via AWS Open Data |
+| BioCLIP 2 Training | 206M (88%) | Records used in BioCLIP 2 model training |
 The app over-fetches from FAISS (3x by default) and filters post-search, so you still get the requested number of results after filtering.
+### Why scope filtering is done in Python
+Scope filters (`has_url`, `in_bioclip2_training`, etc.) are applied in Python after the DuckDB query, not as SQL WHERE clauses. Benchmarking showed that adding boolean WHERE clauses to ID-based lookups causes a ~370x slowdown (4ms to 1500ms for 50 IDs) because DuckDB scans the full boolean column rather than using the index for small IN-list queries. Since the majority of rows pass these filters (e.g., 100% have URLs, 88% are in training), fetching all results and filtering in Python adds negligible overhead (~3ms) while keeping query latency low.
 ## Architecture
 ```

app.py CHANGED Viewed

@@ -38,7 +38,7 @@ CSS = """
 .app-footer a { color: #f0a030 !important; }
 """
-SCOPE_CHOICES = ["All Sources", "URL-Available Only", "iNaturalist Only"]
 def _image_hash(img: Image.Image) -> str:

 .app-footer a { color: #f0a030 !important; }
 """
+SCOPE_CHOICES = ["All Sources", "URL-Available Only", "iNaturalist Only", "BioCLIP 2 Training"]
 def _image_hash(img: Image.Image) -> str:

docs/hf-data-card-README.md CHANGED Viewed

@@ -2,7 +2,7 @@
 license: cc0-1.0
 language:
 - en
-pretty_name: BioCLIP Image Search Lite
 task_categories:
 - image-feature-extraction
 tags:
@@ -55,7 +55,7 @@ The **FAISS index** enables sub-second approximate nearest-neighbor search over
 ### Dataset Description
-- **Curated by:** Net Zhang, Sreejith Menon, Elizabeth Campolongo, Matthew Thompson, Arnab Nandi, Hilmar Lapp, Jianyang Gu <!-- TODO: confirm full author list -->
 - **Demo:** [BioCLIP Image Search Lite Space](https://huggingface.co/spaces/imageomics/bioclip-image-search-lite)
 - **Repository:** [Imageomics/bioclip-image-search-lite](https://github.com/Imageomics/bioclip-image-search-lite)
 - **Paper:** [BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning](https://arxiv.org/abs/2505.23883)
@@ -88,7 +88,7 @@ imageomics/bioclip-image-search-lite/
     faiss/
         index.index          # FAISS IVF+PQ index (~5.8 GB, ~200M vectors)
     duckdb/
-        metadata.duckdb      # DuckDB metadata database (~27 GB, 234M rows)
 ```
 ### FAISS Index
@@ -127,8 +127,48 @@ imageomics/bioclip-image-search-lite/
 | `source_id` | `VARCHAR` | Unique identifier from source (e.g., GBIF `gbifID`, EOL content/page ID). |
 | `publisher` | `VARCHAR` | Organization that published the data (GBIF records only, e.g., `iNaturalist`). |
 | `img_type` | `VARCHAR` | Image type (e.g., `Citizen Science`, `Museum Specimen: Fungi`, `Camera-trap`). GBIF only; others are `Unidentified`. |
-| `identifier` | `VARCHAR` | URL to the original image, or `NULL` if unavailable. Corresponds to `source_url` in TreeOfLife-200M catalog. |
-| `has_url` | `BOOLEAN` | Materialized flag: `TRUE` if `identifier` is not null/empty. Used for scope filtering. |
 **Column name mapping from [TreeOfLife-200M](https://huggingface.co/datasets/imageomics/TreeOfLife-200M) catalog:**
@@ -137,8 +177,10 @@ imageomics/bioclip-image-search-lite/
 | `id` | — | New; FAISS vector position index |
 | `common_name` | `common` | Renamed |
 | `source_dataset` | `data_source` | Renamed |
-| `identifier` | `source_url` | Renamed |
 | `has_url` | — | Derived; materialized boolean |
 | All others | Same name | Direct mapping |
 **Columns from TreeOfLife-200M catalog not included:** `scientific_name`, `basis_of_record`, `shard_filename`, `shard_file_path`, `base_dataset_file_path`, `resolution_status`.
@@ -147,16 +189,19 @@ For more background on these columns, please see the [data field descriptions fr
 **Indexes:**
 - `idx_id` on `id` (primary lookup for FAISS result mapping)
-- `idx_scope` on `(source_dataset, has_url)` (scope filtering)
 **Data coverage:**
 | Scope | Count | Percentage |
 |-------|-------|------------|
 | Total rows | 234,391,308 | 100% |
-| With URL (`has_url = TRUE`) | ~207M | 88.4% |
-| iNaturalist (`source_dataset = 'gbif' AND publisher = 'iNaturalist'`) | ~136M | 58% |
-| Without URL | ~27M | 11.6% |
 ### Data Splits
@@ -252,7 +297,7 @@ for _, row in results.iterrows():
 The full [BioCLIP Vector DB](https://github.com/Imageomics/bioclip-vector-db) stores 234M images totaling ~92 TB — far too large for lightweight deployment. [BioCLIP Image Search Lite](https://huggingface.co/spaces/imageomics/bioclip-image-search-lite) was created to make the similarity search capability accessible on constrained infrastructure (e.g., Hugging Face Spaces free tier: 2 vCPU, 16 GB RAM, 50 GB disk) by:
 1. Replacing local image storage with on-demand URL fetching from publicly accessible external sources (primarily [iNaturalist AWS Open Data](https://github.com/inaturalist/inaturalist-open-data) S3).
-2. Compressing the metadata from an 80 GB SQLite database to a ~27 GB DuckDB database (optimized via columnar storage and compression).
 3. Packaging the FAISS index (~5.8 GB) and DuckDB metadata as the only deployment artifacts.
 This approach trades occasional missing thumbnails (when source URLs are unavailable) for a >1000x reduction in storage requirements. See [Imageomics/bioclip-vector-db#47](https://github.com/Imageomics/bioclip-vector-db/issues/47#issuecomment-3927846723) for the full design rationale.
@@ -269,7 +314,7 @@ These URLs are **reasonably persistent but not guaranteed stable**:
 - **AWS sponsorship is renewable.** The AWS Open Data Sponsorship runs on a [2-year renewable term](https://aws.amazon.com/opendata/open-data-sponsorship-program/terms/) with no uptime SLA.
 - **No explicit S3 rate limit.** The iNaturalist [API Recommended Practices](https://www.inaturalist.org/pages/api+recommended+practices) recommend <5 GB/hour and <24 GB/day for media downloads, though it is unclear whether this applies to direct S3 access. The [BioCLIP Image Search Lite application](https://github.com/Imageomics/bioclip-image-search-lite) respects these limits regardless.
-The remaining URLs point to other biodiversity platforms ([EOL](https://eol.org/), [BIOSCAN-5M](https://biodiversitygenomics.net/projects/5m-insects/), [FathomNet](https://www.fathomnet.org/)), each with their own availability characteristics. The ~11.6% of records without any URL are still searchable via the FAISS index but cannot display a source image.
 ### Source Data
@@ -297,9 +342,12 @@ The DuckDB metadata database was assembled from two sources produced by the [Bio
 The Lite repo merged these into a single DuckDB database ([`convert_duckdb_lite.py`](https://github.com/Imageomics/bioclip-image-search-lite/blob/main/scripts/data/convert_duckdb_lite.py)) with the following optimizations:
-- Added a materialized `has_url` boolean column for efficient scope filtering.
-- Created indexes: `idx_id` on `id` (primary FAISS lookup) and `idx_scope` on `(source_dataset, has_url)` (scope filtering).
-- Leveraged DuckDB's columnar storage and compression, reducing the database from ~80 GB (SQLite) to ~27 GB.
 #### Source Data Producers
@@ -322,8 +370,8 @@ This dataset does not include annotations created specifically for this reposito
 This dataset inherits biases and considerations from [TreeOfLife-200M](https://huggingface.co/datasets/imageomics/TreeOfLife-200M#considerations-for-using-the-data). The following are exaggerated in this instance (BioCLIP Image Search Lite) due to available image representation (those readily fetched by URL):
 - **Taxonomic coverage is uneven.** Despite including 952K+ unique taxa, coverage is heavily biased toward well-photographed organisms. Citizen science observations (primarily iNaturalist) comprise ~58% of the data, skewing representation toward charismatic species and regions where citizen science is most active (Western/developed countries).
-- **Incomplete taxonomic labels.** As inherited from TreeOfLife-200M, only ~89% of records have full species-level taxonomy. ~11% lack complete labels due to biodiversity data complexities (`NULL` values at lower ranks).
-- **URL availability is not guaranteed.** ~11.6% of records have no source URL. For records with URLs, images may become unavailable over time due to URL rot, server changes, or content removal.
 - **FAISS approximation.** The IVF+PQ index trades exactness for speed. Results are approximate nearest neighbors — some true nearest neighbors may be missed depending on the `nprobe` setting. Higher `nprobe` values improve recall at the cost of latency.
 - **Embedding bias.** Similarity is determined by BioCLIP 2 embeddings, which may encode biases from the training data.
@@ -346,7 +394,6 @@ We ask that you cite this dataset and associated papers if you make use of it in
 ## Citation
-<!-- TODO: confirm full author list and add DOI once generated -->
 **Data:**
 ```bibtex
 @misc{zhang2026biocliplite,

 license: cc0-1.0
 language:
 - en
+pretty_name: BioCLIP Image Search Lite FAISS Index
 task_categories:
 - image-feature-extraction
 tags:
 ### Dataset Description
+- **Curated by:** Net Zhang, Sreejith Menon, Elizabeth Campolongo, Matthew Thompson, Arnab Nandi, Hilmar Lapp, Jianyang Gu
 - **Demo:** [BioCLIP Image Search Lite Space](https://huggingface.co/spaces/imageomics/bioclip-image-search-lite)
 - **Repository:** [Imageomics/bioclip-image-search-lite](https://github.com/Imageomics/bioclip-image-search-lite)
 - **Paper:** [BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning](https://arxiv.org/abs/2505.23883)
     faiss/
         index.index          # FAISS IVF+PQ index (~5.8 GB, ~200M vectors)
     duckdb/
+        metadata.duckdb      # DuckDB metadata database (~14 GB optimized, 234M rows)
 ```
 ### FAISS Index
 | `source_id` | `VARCHAR` | Unique identifier from source (e.g., GBIF `gbifID`, EOL content/page ID). |
 | `publisher` | `VARCHAR` | Organization that published the data (GBIF records only, e.g., `iNaturalist`). |
 | `img_type` | `VARCHAR` | Image type (e.g., `Citizen Science`, `Museum Specimen: Fungi`, `Camera-trap`). GBIF only; others are `Unidentified`. |
+| `url_prefix_id` | `USMALLINT` | Foreign key into the `url_prefixes` lookup table. Together with `identifier_suffix`, reconstructs the full image URL as `<prefix><suffix>`. See [URL reconstruction](#url-reconstruction) below. |
+| `identifier_suffix` | `VARCHAR` | Path portion of the image URL (always starts with `/`, e.g., `/photos/12345/original.jpg`). `NULL` if no URL is available. |
+| `has_url` | `BOOLEAN` | Materialized flag: `TRUE` if a URL is available. Used for scope filtering. |
+| `in_bioclip2_training` | `BOOLEAN` | `TRUE` if the record's UUID appears in the BioCLIP 2 training data — TreeOfLife-200M (Revision [a8f38b4](https://huggingface.co/datasets/imageomics/TreeOfLife-200M/tree/a8f38b4388579862c56ae57d6f094c2ac0e92e12)). |
+**Table:** `url_prefixes` — 411 rows
+| Column | Type | Description |
+|--------|------|-------------|
+| `prefix_id` | `USMALLINT` | Primary key. |
+| `prefix` | `VARCHAR` | URL domain prefix (e.g., `https://inaturalist-open-data.s3.amazonaws.com`). Does not include a trailing `/`. |
+#### URL reconstruction
+The original `identifier` (full image URL) column from TreeOfLife-200M is split into a shared domain prefix and a per-row path suffix to reduce storage overhead. To reconstruct the full URL:
+```sql
+SELECT p.prefix || m.identifier_suffix AS url
+FROM metadata m
+JOIN url_prefixes p ON m.url_prefix_id = p.prefix_id
+WHERE m.identifier_suffix IS NOT NULL
+```
+```python
+import duckdb
+conn = duckdb.connect("metadata.duckdb", read_only=True)
+# Load prefix lookup table into a dict
+prefixes = dict(conn.execute("SELECT prefix_id, prefix FROM url_prefixes").fetchall())
+# Query metadata and reconstruct URLs
+rows = conn.execute("SELECT url_prefix_id, identifier_suffix FROM metadata LIMIT 5").fetchall()
+for prefix_id, suffix in rows:
+    url = prefixes.get(prefix_id, "") + (suffix or "")
+    print(url)
+# https://inaturalist-open-data.s3.amazonaws.com/photos/12345/original.jpg
+# https://content.eol.org/data/media/17/a6/537.jpg
+# ...
+```
+Prefixes are bare domains (e.g., `https://content.eol.org`) and suffixes always start with `/` (e.g., `/data/media/17/a6/537.jpg`), so simple concatenation produces a valid URL. This split saves ~40% storage compared to storing the full URL per row.
 **Column name mapping from [TreeOfLife-200M](https://huggingface.co/datasets/imageomics/TreeOfLife-200M) catalog:**
 | `id` | — | New; FAISS vector position index |
 | `common_name` | `common` | Renamed |
 | `source_dataset` | `data_source` | Renamed |
+| `url_prefix_id` | `source_url` | Split from `source_url`; foreign key to `url_prefixes` |
+| `identifier_suffix` | `source_url` | Split from `source_url`; path portion of URL |
 | `has_url` | — | Derived; materialized boolean |
+| `in_bioclip2_training` | — | Derived; matched against [training catalog revision `a8f38b4`](https://huggingface.co/datasets/imageomics/TreeOfLife-200M/blob/a8f38b4388579862c56ae57d6f094c2ac0e92e12/dataset/catalog.parquet) |
 | All others | Same name | Direct mapping |
 **Columns from TreeOfLife-200M catalog not included:** `scientific_name`, `basis_of_record`, `shard_filename`, `shard_file_path`, `base_dataset_file_path`, `resolution_status`.
 **Indexes:**
 - `idx_id` on `id` (primary lookup for FAISS result mapping)
+- `idx_scope` on `(source_dataset, has_url, in_bioclip2_training)` (scope filtering)
 **Data coverage:**
 | Scope | Count | Percentage |
 |-------|-------|------------|
 | Total rows | 234,391,308 | 100% |
+| With URL (`has_url = TRUE`) | ~234M | 99.99% |
+| iNaturalist (`source_dataset = 'gbif' AND publisher LIKE '%iNaturalist%'`) | ~136M | 58% |
+| In BioCLIP 2 training (`in_bioclip2_training = TRUE`) | ~206M | 87.9% |
+| With taxonomy (`kingdom IS NOT NULL`) | ~228M | 97.2% |
+> **Note on `in_bioclip2_training`:** This column identifies records whose UUID matches the BioCLIP 2 training catalog from [TreeOfLife-200M revision `a8f38b4`](https://huggingface.co/datasets/imageomics/TreeOfLife-200M/tree/a8f38b4388579862c56ae57d6f094c2ac0e92e12). The original BioCLIP 2 training set contained ~214M images. Of these, ~206M match records in the search corpus. The remaining ~8M were excluded from the FAISS index because they were identified as invalid after training (e.g., document scans, specimen labels, images with detected human faces) and removed during a post-training data cleanup before the embeddings were generated.
 ### Data Splits
 The full [BioCLIP Vector DB](https://github.com/Imageomics/bioclip-vector-db) stores 234M images totaling ~92 TB — far too large for lightweight deployment. [BioCLIP Image Search Lite](https://huggingface.co/spaces/imageomics/bioclip-image-search-lite) was created to make the similarity search capability accessible on constrained infrastructure (e.g., Hugging Face Spaces free tier: 2 vCPU, 16 GB RAM, 50 GB disk) by:
 1. Replacing local image storage with on-demand URL fetching from publicly accessible external sources (primarily [iNaturalist AWS Open Data](https://github.com/inaturalist/inaturalist-open-data) S3).
+2. Compressing the metadata from an 80 GB SQLite database to a ~14 GB DuckDB database (optimized via ENUM types, URL prefix deduplication, taxonomy sorting, and columnar compression).
 3. Packaging the FAISS index (~5.8 GB) and DuckDB metadata as the only deployment artifacts.
 This approach trades occasional missing thumbnails (when source URLs are unavailable) for a >1000x reduction in storage requirements. See [Imageomics/bioclip-vector-db#47](https://github.com/Imageomics/bioclip-vector-db/issues/47#issuecomment-3927846723) for the full design rationale.
 - **AWS sponsorship is renewable.** The AWS Open Data Sponsorship runs on a [2-year renewable term](https://aws.amazon.com/opendata/open-data-sponsorship-program/terms/) with no uptime SLA.
 - **No explicit S3 rate limit.** The iNaturalist [API Recommended Practices](https://www.inaturalist.org/pages/api+recommended+practices) recommend <5 GB/hour and <24 GB/day for media downloads, though it is unclear whether this applies to direct S3 access. The [BioCLIP Image Search Lite application](https://github.com/Imageomics/bioclip-image-search-lite) respects these limits regardless.
+The remaining URLs point to other biodiversity platforms ([EOL](https://eol.org/), [BIOSCAN-5M](https://biodiversitygenomics.net/projects/5m-insects/), [FathomNet](https://www.fathomnet.org/)), each with their own availability characteristics.
 ### Source Data
 The Lite repo merged these into a single DuckDB database ([`convert_duckdb_lite.py`](https://github.com/Imageomics/bioclip-image-search-lite/blob/main/scripts/data/convert_duckdb_lite.py)) with the following optimizations:
+- Added materialized boolean columns `has_url` and `in_bioclip2_training` for scope filtering.
+- Created indexes: `idx_id` on `id` (primary FAISS lookup) and `idx_scope` on `(source_dataset, has_url, in_bioclip2_training)`.
+- Applied ENUM types for low-cardinality columns, URL prefix deduplication, and taxonomy-based row sorting for better compression.
+- Leveraged DuckDB's columnar storage and compression, reducing the database from ~80 GB (SQLite) to ~14 GB.
+**Metadata backfill (March 2026):** 28.3M rows (12.1%) originally had NULL metadata because the entire `observation.org` GBIF server (27.2M rows) was missing from the metadata parquets used during ingestion. Taxonomy was recovered for ~21.7M rows from the resolved taxa pipeline, and source URLs were recovered for all 27.2M rows from the GBIF data parquets. An additional 1.1M EOL rows with failed taxonomy resolution had their `source_dataset` and `source_id` recovered. UUIDs were also normalized from mixed formats (non-hyphenated for observation.org rows) to a consistent hyphenated format. After backfill, only 2,973 rows remain with NULL `source_dataset`.
 #### Source Data Producers
 This dataset inherits biases and considerations from [TreeOfLife-200M](https://huggingface.co/datasets/imageomics/TreeOfLife-200M#considerations-for-using-the-data). The following are exaggerated in this instance (BioCLIP Image Search Lite) due to available image representation (those readily fetched by URL):
 - **Taxonomic coverage is uneven.** Despite including 952K+ unique taxa, coverage is heavily biased toward well-photographed organisms. Citizen science observations (primarily iNaturalist) comprise ~58% of the data, skewing representation toward charismatic species and regions where citizen science is most active (Western/developed countries).
+- **Incomplete taxonomic labels.** As inherited from TreeOfLife-200M, ~97% of records now have kingdom-level taxonomy after the March 2026 backfill. The remaining ~3% lack complete labels due to biodiversity data complexities (`NULL` values at lower ranks).
+- **URL availability is not guaranteed.** After the metadata backfill, nearly all records (99.99%) have source URLs, though images may become unavailable over time due to URL rot, server changes, or content removal.
 - **FAISS approximation.** The IVF+PQ index trades exactness for speed. Results are approximate nearest neighbors — some true nearest neighbors may be missed depending on the `nprobe` setting. Higher `nprobe` values improve recall at the cost of latency.
 - **Embedding bias.** Similarity is determined by BioCLIP 2 embeddings, which may encode biases from the training data.
 ## Citation
 **Data:**
 ```bibtex
 @misc{zhang2026biocliplite,

scripts/data/README.md ADDED Viewed

	@@ -0,0 +1,65 @@

+# Data Pipeline
+Two-stage pipeline to build the optimized DuckDB metadata database from source.
+## Pipeline
+```
+Source (SQLite or DuckDB)
+  → convert_duckdb_lite.py   # Stage 1: import, add has_url + in_bioclip2_training
+  → optimize_duckdb.py       # Stage 2: ENUM types, URL split, sort, index
+  → validate_optimized_duckdb.py  # Verify correctness
+```
+## Optimized Schema
+**Table:** `metadata` — 234,391,308 rows
+| Column | Type | Notes |
+|--------|------|-------|
+| `id` | `INTEGER` | FAISS vector index (downcast from BIGINT) |
+| `uuid` | `UUID` | Native 16-byte UUID (normalized hyphenated format) |
+| `kingdom`..`family` | `ENUM` | Low-cardinality taxonomy columns as ENUM types |
+| `genus`, `species` | `VARCHAR` | Too many distinct values for ENUM |
+| `common_name` | `VARCHAR` | |
+| `source_dataset` | `ENUM` | `gbif`, `eol`, `bioscan`, `fathomnet` |
+| `publisher` | `ENUM` | GBIF publisher (e.g., `iNaturalist`, `observation.org`) |
+| `img_type` | `ENUM` | Image type category |
+| `basisOfRecord` | `ENUM` | GBIF basis of record |
+| `source_id` | `VARCHAR` | Source-specific identifier |
+| `url_prefix_id` | `USMALLINT` | FK to `url_prefixes` table |
+| `identifier_suffix` | `VARCHAR` | URL path after domain prefix |
+| `has_url` | `BOOLEAN` | `TRUE` if image URL available |
+| `in_bioclip2_training` | `BOOLEAN` | `TRUE` if UUID in BioCLIP 2 training catalog |
+**Indexes:** `idx_id(id)`, `idx_scope(source_dataset, has_url, in_bioclip2_training)`
+## Optimizations Applied
+1. **ENUM types** — Low-cardinality columns (`kingdom`, `phylum`, `class`, `order`, `family`, `source_dataset`, `publisher`, `img_type`, `basisOfRecord`) stored as ENUM for ~10x compression.
+2. **URL prefix deduplication** — `identifier` split into a shared prefix table (`url_prefixes`) + per-row suffix, eliminating repeated domain strings.
+3. **Taxonomy sort** — Rows sorted by `source_dataset, kingdom, ..., species, common_name` for long runs of identical values and better compression.
+4. **Type downcasting** — `id` BIGINT→INTEGER, `uuid` VARCHAR→native UUID (16 bytes).
+5. **Corruption cleanup** — 44 rows with column-shift metadata corruption have taxonomy NULLed.
+Result: **80 GB (SQLite) → 14 GB (optimized DuckDB)**, 57% smaller than the unoptimized DuckDB.
+## Usage
+```bash
+# Stage 1: Import + add boolean columns
+python scripts/data/convert_duckdb_lite.py \
+    --from-duckdb /path/to/source.duckdb \
+    --output /path/to/base.duckdb \
+    --catalog-parquet /path/to/training/catalog.parquet
+# Stage 2: Optimize
+python scripts/data/optimize_duckdb.py \
+    --source /path/to/base.duckdb \
+    --output /path/to/metadata_optimized.duckdb
+# Validate
+python scripts/data/validate_optimized_duckdb.py \
+    --source /path/to/base.duckdb \
+    --optimized /path/to/metadata_optimized.duckdb
+```

scripts/data/convert_duckdb_lite.py CHANGED Viewed

@@ -1,13 +1,20 @@
-"""Convert SQLite metadata to optimized DuckDB for BioCLIP Lite.
-Copies the existing research DuckDB and adds Lite-specific enhancements:
-  1. Materialized has_url BOOLEAN column
-  2. Compound index on (source_dataset, has_url) for scope filtering
-  3. URL coverage validation
 Usage:
-    python scripts/data/convert_duckdb_lite.py --from-duckdb SOURCE --output OUT
     python scripts/data/convert_duckdb_lite.py --from-sqlite SOURCE --output OUT
 """
 import argparse
@@ -20,7 +27,7 @@ import duckdb
 EXPECTED_ROW_COUNT = 234_391_308
-def convert_from_sqlite(sqlite_path: str, output_path: str):
     """Full conversion from the 80 GB SQLite source."""
     print(f"Converting from SQLite: {sqlite_path}")
     print(f"Output: {output_path}")
@@ -43,13 +50,17 @@ def convert_from_sqlite(sqlite_path: str, output_path: str):
     print("Creating index on id...")
     conn.execute("CREATE INDEX idx_id ON metadata (id)")
-    _add_lite_enhancements(conn)
     _validate(conn, output_path)
     conn.close()
-def convert_from_existing_duckdb(source_path: str, output_path: str):
-    """Copy existing research DuckDB and add Lite-specific enhancements."""
     print(f"Copying from: {source_path}")
     print(f"         to: {output_path}")
@@ -60,17 +71,18 @@ def convert_from_existing_duckdb(source_path: str, output_path: str):
     print(f"Copy complete ({os.path.getsize(output_path) / 1024**3:.1f} GB)")
     conn = duckdb.connect(output_path)
-    _add_lite_enhancements(conn)
     _validate(conn, output_path)
     conn.close()
-def _add_lite_enhancements(conn: duckdb.DuckDBPyConnection):
-    """Add has_url column and compound index for scope filtering."""
-    # Check if has_url already exists
     cols = [r[0] for r in conn.execute("DESCRIBE metadata").fetchall()]
     if "has_url" in cols:
-        print("has_url column already exists, skipping ALTER")
     else:
         print("Adding has_url column...")
         t0 = time.time()
@@ -81,22 +93,37 @@ def _add_lite_enhancements(conn: duckdb.DuckDBPyConnection):
         )
         print(f"has_url column populated in {time.time() - t0:.0f}s")
-    # Compound index for scope queries
-    existing_indexes = [
-        r[0] for r in conn.execute(
-            "SELECT index_name FROM duckdb_indexes()"
-        ).fetchall()
-    ]
-    if "idx_scope" not in existing_indexes:
-        print("Creating compound index idx_scope(source_dataset, has_url)...")
-        t0 = time.time()
-        conn.execute(
-            "CREATE INDEX idx_scope ON metadata (source_dataset, has_url)"
-        )
-        print(f"Index created in {time.time() - t0:.0f}s")
-    else:
-        print("idx_scope already exists, skipping")
 def _validate(conn: duckdb.DuckDBPyConnection, output_path: str):
@@ -119,15 +146,27 @@ def _validate(conn: duckdb.DuckDBPyConnection, output_path: str):
     print(f"iNaturalist:    {inat_count:>15,}  ({inat_count/total*100:.1f}%)")
     print(f"Without URL:    {total - with_url:>15,}  ({(total-with_url)/total*100:.1f}%)")
     if total != EXPECTED_ROW_COUNT:
         print(f"WARNING: Expected {EXPECTED_ROW_COUNT:,} rows, got {total:,}")
     size_gb = os.path.getsize(output_path) / 1024**3
     print(f"DuckDB size:    {size_gb:.1f} GB")
 def main():
-    parser = argparse.ArgumentParser(description="DuckDB Lite conversion")
     group = parser.add_mutually_exclusive_group(required=True)
     group.add_argument(
         "--from-sqlite", type=str, metavar="PATH",
@@ -135,20 +174,24 @@ def main():
     )
     group.add_argument(
         "--from-duckdb", type=str,
-        help="Copy from existing DuckDB and add Lite enhancements"
     )
     parser.add_argument(
         "--output", type=str, required=True,
-        help="Output DuckDB path"
     )
     args = parser.parse_args()
     os.makedirs(os.path.dirname(args.output), exist_ok=True)
     if args.from_sqlite:
-        convert_from_sqlite(args.from_sqlite, args.output)
     else:
-        convert_from_existing_duckdb(args.from_duckdb, args.output)
     print("\nDone.")

+"""Convert source metadata to optimized DuckDB for BioCLIP Lite.
+Two-stage pipeline:
+  Stage 1 (this script): Import raw metadata from SQLite or DuckDB source,
+           add has_url column, and create a base DuckDB.
+  Stage 2 (optimize_duckdb.py): Apply size optimizations (ENUM types, taxonomy
+           sort, URL prefix split, type downcasting, corruption cleanup).
 Usage:
+    # From SQLite source (slow, ~1-2 hours):
     python scripts/data/convert_duckdb_lite.py --from-sqlite SOURCE --output OUT
+    # From existing research DuckDB:
+    python scripts/data/convert_duckdb_lite.py --from-duckdb SOURCE --output OUT
+    # Then optimize:
+    python scripts/data/optimize_duckdb.py --source OUT --output OPTIMIZED
 """
 import argparse
 EXPECTED_ROW_COUNT = 234_391_308
+def convert_from_sqlite(sqlite_path: str, output_path: str, catalog_parquet: str = None):
     """Full conversion from the 80 GB SQLite source."""
     print(f"Converting from SQLite: {sqlite_path}")
     print(f"Output: {output_path}")
     print("Creating index on id...")
     conn.execute("CREATE INDEX idx_id ON metadata (id)")
+    _add_has_url(conn)
+    if catalog_parquet:
+        _add_in_bioclip2_training(conn, catalog_parquet)
     _validate(conn, output_path)
     conn.close()
+def convert_from_existing_duckdb(
+    source_path: str, output_path: str, catalog_parquet: str = None
+):
+    """Copy existing research DuckDB and add has_url if missing."""
     print(f"Copying from: {source_path}")
     print(f"         to: {output_path}")
     print(f"Copy complete ({os.path.getsize(output_path) / 1024**3:.1f} GB)")
     conn = duckdb.connect(output_path)
+    _add_has_url(conn)
+    if catalog_parquet:
+        _add_in_bioclip2_training(conn, catalog_parquet)
     _validate(conn, output_path)
     conn.close()
+def _add_has_url(conn: duckdb.DuckDBPyConnection):
+    """Add has_url BOOLEAN column if not present."""
     cols = [r[0] for r in conn.execute("DESCRIBE metadata").fetchall()]
     if "has_url" in cols:
+        print("has_url column already exists, skipping")
     else:
         print("Adding has_url column...")
         t0 = time.time()
         )
         print(f"has_url column populated in {time.time() - t0:.0f}s")
+def _add_in_bioclip2_training(conn: duckdb.DuckDBPyConnection, catalog_parquet: str):
+    """Add in_bioclip2_training BOOLEAN column from training catalog.
+    Marks rows whose UUID appears in the BioCLIP 2 training catalog.
+    """
+    cols = [r[0] for r in conn.execute("DESCRIBE metadata").fetchall()]
+    if "in_bioclip2_training" in cols:
+        print("in_bioclip2_training column already exists, skipping")
+        return
+    print("Adding in_bioclip2_training column...")
+    t0 = time.time()
+    conn.execute("ALTER TABLE metadata ADD COLUMN in_bioclip2_training BOOLEAN DEFAULT false")
+    # The catalog uses UUID column; join on normalized UUID
+    conn.execute(f"""
+        UPDATE metadata m SET in_bioclip2_training = true
+        FROM (
+            SELECT DISTINCT uuid FROM read_parquet('{catalog_parquet}')
+        ) c
+        WHERE CAST(m.uuid AS VARCHAR) = CAST(c.uuid AS VARCHAR)
+    """)
+    total = conn.execute("SELECT COUNT(*) FROM metadata").fetchone()[0]
+    matched = conn.execute(
+        "SELECT COUNT(*) FROM metadata WHERE in_bioclip2_training = true"
+    ).fetchone()[0]
+    elapsed = time.time() - t0
+    print(f"in_bioclip2_training populated in {elapsed:.0f}s")
+    print(f"  Matched: {matched:,} / {total:,} ({matched/total*100:.1f}%)")
 def _validate(conn: duckdb.DuckDBPyConnection, output_path: str):
     print(f"iNaturalist:    {inat_count:>15,}  ({inat_count/total*100:.1f}%)")
     print(f"Without URL:    {total - with_url:>15,}  ({(total-with_url)/total*100:.1f}%)")
+    # Check for in_bioclip2_training column
+    cols = [r[0] for r in conn.execute("DESCRIBE metadata").fetchall()]
+    if "in_bioclip2_training" in cols:
+        training_count = conn.execute(
+            "SELECT COUNT(*) FROM metadata WHERE in_bioclip2_training = true"
+        ).fetchone()[0]
+        print(f"In training:    {training_count:>15,}  ({training_count/total*100:.1f}%)")
     if total != EXPECTED_ROW_COUNT:
         print(f"WARNING: Expected {EXPECTED_ROW_COUNT:,} rows, got {total:,}")
     size_gb = os.path.getsize(output_path) / 1024**3
     print(f"DuckDB size:    {size_gb:.1f} GB")
+    print(f"\nNext step: run optimize_duckdb.py --source {output_path} --output <optimized.duckdb>")
 def main():
+    parser = argparse.ArgumentParser(
+        description="Stage 1: Import metadata into base DuckDB. "
+        "Run optimize_duckdb.py afterward for size optimization."
+    )
     group = parser.add_mutually_exclusive_group(required=True)
     group.add_argument(
         "--from-sqlite", type=str, metavar="PATH",
     )
     group.add_argument(
         "--from-duckdb", type=str,
+        help="Copy from existing DuckDB and add has_url column"
     )
     parser.add_argument(
         "--output", type=str, required=True,
+        help="Output DuckDB path (base DB, not yet optimized)"
+    )
+    parser.add_argument(
+        "--catalog-parquet", type=str, default=None,
+        help="Path to BioCLIP 2 training catalog parquet (adds in_bioclip2_training column)"
     )
     args = parser.parse_args()
     os.makedirs(os.path.dirname(args.output), exist_ok=True)
     if args.from_sqlite:
+        convert_from_sqlite(args.from_sqlite, args.output, args.catalog_parquet)
     else:
+        convert_from_existing_duckdb(args.from_duckdb, args.output, args.catalog_parquet)
     print("\nDone.")

scripts/data/convert_duckdb_lite.slurm CHANGED Viewed

@@ -23,8 +23,17 @@ echo ""
 source "$VENV/bin/activate"
-python "$REPO_ROOT/scripts/data/convert_duckdb_lite.py" \
     --from-duckdb "${1:?Usage: sbatch convert_duckdb_lite.slurm <source.duckdb>}" \
     --output "$DATA_DIR/metadata.duckdb"
 echo ""

 source "$VENV/bin/activate"
+BASE_DB=$(mktemp "$DATA_DIR/metadata_base_XXXXXX.duckdb")
+trap 'rm -f "$BASE_DB"' EXIT
+# Stage 1: Import from source into temp file
+python -u "$REPO_ROOT/scripts/data/convert_duckdb_lite.py" \
     --from-duckdb "${1:?Usage: sbatch convert_duckdb_lite.slurm <source.duckdb>}" \
+    --output "$BASE_DB"
+# Stage 2: Optimize into final output
+python -u "$REPO_ROOT/scripts/data/optimize_duckdb.py" \
+    --source "$BASE_DB" \
     --output "$DATA_DIR/metadata.duckdb"
 echo ""

scripts/data/optimize_duckdb.py ADDED Viewed

	@@ -0,0 +1,487 @@

+"""Experiment: rebuild DuckDB with size optimizations.
+Optimizations applied:
+  1. Drop unused columns (scientific_name, basisOfRecord, resolution_status)
+  2. Cast id BIGINT → INTEGER, uuid VARCHAR → UUID (native 16-byte)
+  3. Sort rows by source_dataset, taxonomy (kingdom→species), scientific_name, common_name
+     for better compression via long runs of identical values
+  4. Split identifier URLs into prefix (domain) + suffix for dictionary compression
+  5. Cast low-cardinality VARCHAR columns to ENUM types
+Usage:
+    python scripts/data/optimize_duckdb.py \
+        --source /path/to/metadata.duckdb \
+        --output /path/to/metadata_optimized.duckdb
+"""
+import argparse
+import os
+import re
+import time
+import duckdb
+EXPECTED_ROW_COUNT = 234_391_308
+# Columns to drop (not used by the app — can be re-added later from source)
+DROP_COLUMNS = {"scientific_name", "resolution_status"}
+# Low-cardinality columns to convert to ENUM (column → max distinct values observed)
+ENUM_CANDIDATES = {
+    "source_dataset": 5,       # 2 + NULL
+    "kingdom": 50,             # 42 (some dirty data)
+    "phylum": 200,             # 135
+    "class": 500,              # 383
+    "order": 2000,             # 1,531
+    "family": 15000,           # 13,088
+    "publisher": 600,          # 472
+    "img_type": 20,            # 13
+    "basisOfRecord": 15,       # 8
+}
+# Valid biological kingdom values
+VALID_KINGDOMS = {
+    'Animalia', 'Plantae', 'Fungi', 'Chromista', 'Protozoa',
+    'Bacteria', 'Archaea', 'Viruses', 'Metazoa',
+    'Archaeplastida', 'incertae sedis',
+}
+def find_corrupted_ids(conn: duckdb.DuckDBPyConnection) -> set[int]:
+    """Find rows with column-shift metadata corruption.
+    These are GBIF records where taxonomy columns contain timestamps, UUIDs,
+    country names, boolean strings, or scientific names with authority citations
+    due to column misalignment during original ingestion.
+    """
+    placeholders = ",".join(f"'{k}'" for k in VALID_KINGDOMS)
+    # Rows with invalid kingdom values
+    kingdom_rows = conn.execute(f"""
+        SELECT id FROM metadata
+        WHERE kingdom IS NOT NULL AND kingdom NOT IN ({placeholders})
+    """).fetchall()
+    ids = {r[0] for r in kingdom_rows}
+    # Rows with valid kingdom but corrupted phylum
+    phylum_rows = conn.execute(f"""
+        SELECT id FROM metadata
+        WHERE (kingdom IS NULL OR kingdom IN ({placeholders}))
+          AND phylum IS NOT NULL
+          AND (phylum LIKE '2024-%%'
+               OR phylum IN ('true', 'false', 'US', 'bracteatum')
+               OR phylum LIKE '%%Wall.%%' OR phylum LIKE '%%Pers.%%'
+               OR phylum LIKE '%% L.' OR phylum LIKE '%%Makino%%'
+               OR phylum LIKE '%%subsp.%%' OR phylum LIKE '%%var.%%'
+               OR phylum LIKE '%%Stokes%%' OR phylum LIKE '%%Reveal%%'
+               OR phylum LIKE '%%E.Wolf%%')
+    """).fetchall()
+    ids |= {r[0] for r in phylum_rows}
+    # Rows with valid kingdom+phylum but corrupted class
+    class_rows = conn.execute(f"""
+        SELECT id FROM metadata
+        WHERE (kingdom IS NULL OR kingdom IN ({placeholders}))
+          AND (phylum NOT LIKE '2024-%%' OR phylum IS NULL)
+          AND class IS NOT NULL
+          AND (class LIKE '2024-%%'
+               OR class LIKE '%%INVALID%%'
+               OR class LIKE '%%MATCH%%'
+               OR (class LIKE '%% var. %%' AND class LIKE '%%.%%'))
+    """).fetchall()
+    ids |= {r[0] for r in class_rows}
+    return ids
+def build_enum_types(source_conn: duckdb.DuckDBPyConnection) -> dict[str, str]:
+    """Query source DB to discover distinct values and build ENUM type DDL.
+    Returns a dict of column_name → enum_type_name.
+    """
+    enum_types = {}
+    for col, max_card in ENUM_CANDIDATES.items():
+        quoted = f'"{col}"' if col in ("order", "class") else col
+        rows = source_conn.execute(
+            f"SELECT DISTINCT {quoted} FROM metadata "
+            f"WHERE {quoted} IS NOT NULL "
+            f"ORDER BY {quoted}"
+        ).fetchall()
+        values = [r[0] for r in rows]
+        if len(values) > max_card:
+            print(f"  SKIP ENUM for {col}: {len(values)} distinct > {max_card} limit")
+            continue
+        type_name = f"enum_{col}"
+        enum_types[col] = type_name
+        print(f"  ENUM {type_name}: {len(values)} distinct values")
+    return enum_types
+def build_url_prefix_table(source_conn: duckdb.DuckDBPyConnection) -> list[tuple[int, str]]:
+    """Extract top URL domain prefixes from identifier column.
+    Returns list of (prefix_id, prefix_string) tuples.
+    """
+    print("  Extracting URL domain prefixes...")
+    rows = source_conn.execute("""
+        SELECT
+            regexp_extract(identifier, '^(https?://[^/]+)', 1) AS domain,
+            COUNT(*) AS cnt
+        FROM metadata
+        WHERE identifier IS NOT NULL AND identifier != ''
+        GROUP BY domain
+        ORDER BY cnt DESC
+    """).fetchall()
+    prefixes = [(i, row[0]) for i, row in enumerate(rows) if row[0]]
+    print(f"  Found {len(prefixes)} distinct URL domains")
+    for domain, cnt in rows[:10]:
+        print(f"    {domain}: {cnt:,}")
+    return prefixes
+def create_optimized_db(source_path: str, output_path: str):
+    """Rebuild the DuckDB with all optimizations."""
+    print(f"Source: {source_path} ({os.path.getsize(source_path) / 1024**3:.1f} GB)")
+    print(f"Output: {output_path}")
+    if os.path.exists(output_path):
+        os.remove(output_path)
+    # Also remove WAL file if present
+    wal_path = output_path + ".wal"
+    if os.path.exists(wal_path):
+        os.remove(wal_path)
+    # Open source read-only
+    src = duckdb.connect(source_path, read_only=True)
+    src_count = src.execute("SELECT COUNT(*) FROM metadata").fetchone()[0]
+    print(f"Source rows: {src_count:,}")
+    # Open destination
+    dst = duckdb.connect(output_path)
+    # Allow more memory for sorting 234M rows
+    dst.execute("SET memory_limit = '100GB'")
+    dst.execute("SET threads = 8")
+    # Attach source
+    dst.execute(f"ATTACH '{source_path}' AS src (READ_ONLY)")
+    # ── Step 0: Identify corrupted rows ────────────────────────────
+    print("\n=== Step 0: Identifying corrupted rows ===")
+    corrupted_ids = find_corrupted_ids(src)
+    print(f"  Found {len(corrupted_ids)} rows with column-shift corruption")
+    if corrupted_ids:
+        for cid in sorted(corrupted_ids):
+            print(f"    id={cid}")
+        # Register as a temp table so we can use it in the CREATE TABLE query
+        id_list = ",".join(str(i) for i in corrupted_ids)
+        dst.execute(f"CREATE TEMP TABLE corrupted_ids AS SELECT unnest([{id_list}]) AS id")
+    # ── Step 1: Build ENUM types ─────────────────────────────────────
+    print("\n=== Step 1: Building ENUM types ===")
+    # Exclude corrupted rows from ENUM value discovery
+    exclude_clause = ""
+    if corrupted_ids:
+        exclude_clause = f" AND id NOT IN ({id_list})"
+    enum_types = build_enum_types(src)
+    for col, type_name in enum_types.items():
+        quoted = f'"{col}"' if col in ("order", "class") else col
+        values = src.execute(
+            f"SELECT DISTINCT {quoted} FROM metadata "
+            f"WHERE {quoted} IS NOT NULL{exclude_clause} ORDER BY {quoted}"
+        ).fetchall()
+        value_list = ", ".join(f"'{v[0].replace(chr(39), chr(39)+chr(39))}'" for v in values)
+        dst.execute(f"CREATE TYPE {type_name} AS ENUM ({value_list})")
+    # ── Step 2: Build URL prefix lookup ──────────────────────────────
+    print("\n=== Step 2: Building URL prefix table ===")
+    prefixes = build_url_prefix_table(src)
+    dst.execute("""
+        CREATE TABLE url_prefixes (
+            prefix_id USMALLINT,
+            prefix VARCHAR
+        )
+    """)
+    dst.executemany(
+        "INSERT INTO url_prefixes VALUES (?, ?)",
+        prefixes
+    )
+    # Build a lookup for the SQL CASE expression
+    prefix_map = {prefix: pid for pid, prefix in prefixes}
+    # ── Step 3: Create optimized metadata table ──────────────────────
+    print("\n=== Step 3: Creating optimized metadata table ===")
+    print("  Sorting by source_dataset, taxonomy, common_name...")
+    print("  Splitting identifier into prefix_id + suffix...")
+    # Build column expressions
+    col_exprs = []
+    # id: BIGINT → INTEGER
+    col_exprs.append("CAST(s.id AS INTEGER) AS id")
+    # uuid: VARCHAR → UUID native type
+    col_exprs.append("CAST(s.uuid AS UUID) AS uuid")
+    # Taxonomy columns — NULL out corrupted rows, ENUM cast the rest
+    # For corrupted rows, all taxonomy + common_name are garbage from column shift
+    has_corrupt = len(corrupted_ids) > 0
+    for col in ["kingdom", "phylum", "class", "order", "family", "genus", "species"]:
+        quoted_src = f's."{col}"' if col in ("order", "class") else f"s.{col}"
+        if has_corrupt:
+            clean_expr = (
+                f"CASE WHEN s.id IN (SELECT id FROM corrupted_ids) "
+                f"THEN NULL ELSE {quoted_src} END"
+            )
+        else:
+            clean_expr = quoted_src
+        if col in enum_types:
+            col_exprs.append(
+                f"TRY_CAST({clean_expr} AS {enum_types[col]}) AS \"{col}\""
+            )
+        else:
+            col_exprs.append(f"{clean_expr} AS \"{col}\"")
+    # common_name stays VARCHAR (177K distinct — too high for ENUM)
+    if has_corrupt:
+        col_exprs.append(
+            "CASE WHEN s.id IN (SELECT id FROM corrupted_ids) "
+            "THEN NULL ELSE s.common_name END AS common_name"
+        )
+    else:
+        col_exprs.append("s.common_name")
+    # source_dataset, publisher, img_type, basisOfRecord → ENUM
+    for col in ["source_dataset", "publisher", "img_type", "basisOfRecord"]:
+        if col in enum_types:
+            col_exprs.append(
+                f"TRY_CAST(s.{col} AS {enum_types[col]}) AS {col}"
+            )
+        else:
+            col_exprs.append(f"s.{col}")
+    col_exprs.append("s.source_id")
+    # identifier → split into prefix_id + identifier_suffix
+    # Build a CASE expression to map domain → prefix_id
+    case_parts = []
+    for prefix, pid in sorted(prefix_map.items(), key=lambda x: -len(x[0])):
+        escaped = prefix.replace("'", "''")
+        case_parts.append(
+            f"WHEN s.identifier LIKE '{escaped}%' THEN {pid}"
+        )
+    case_expr = "CASE " + " ".join(case_parts) + " ELSE NULL END"
+    col_exprs.append(f"{case_expr} AS url_prefix_id")
+    # suffix: strip the matched domain prefix
+    suffix_parts = []
+    for prefix, pid in sorted(prefix_map.items(), key=lambda x: -len(x[0])):
+        escaped = prefix.replace("'", "''")
+        suffix_parts.append(
+            f"WHEN s.identifier LIKE '{escaped}%' "
+            f"THEN substr(s.identifier, {len(prefix) + 1})"
+        )
+    suffix_expr = "CASE " + " ".join(suffix_parts) + " ELSE s.identifier END"
+    col_exprs.append(f"{suffix_expr} AS identifier_suffix")
+    col_exprs.append("s.has_url")
+    # in_bioclip2_training: carry through if present in source
+    src_cols = [r[0] for r in src.execute("DESCRIBE metadata").fetchall()]
+    has_training_col = "in_bioclip2_training" in src_cols
+    if has_training_col:
+        col_exprs.append("s.in_bioclip2_training")
+        print("  Including in_bioclip2_training column")
+    select_clause = ",\n    ".join(col_exprs)
+    # Sort order: source_dataset, taxonomy hierarchy, common_name
+    sort_order = (
+        'source_dataset, kingdom, phylum, class, "order", family, genus, species, '
+        "common_name"
+    )
+    t0 = time.time()
+    create_sql = f"""
+        CREATE TABLE metadata AS
+        SELECT
+            {select_clause}
+        FROM src.metadata s
+        ORDER BY {sort_order}
+    """
+    print("  Executing CREATE TABLE ... ORDER BY (this will take a while)...")
+    dst.execute(create_sql)
+    elapsed = time.time() - t0
+    print(f"  Table created in {elapsed:.0f}s ({elapsed/60:.1f} min)")
+    # ── Step 4: Create indexes ───────────────────────────────────────
+    print("\n=== Step 4: Creating indexes ===")
+    t0 = time.time()
+    dst.execute("CREATE INDEX idx_id ON metadata (id)")
+    print(f"  idx_id created in {time.time() - t0:.0f}s")
+    t0 = time.time()
+    if has_training_col:
+        dst.execute(
+            "CREATE INDEX idx_scope ON metadata (source_dataset, has_url, in_bioclip2_training)"
+        )
+    else:
+        dst.execute("CREATE INDEX idx_scope ON metadata (source_dataset, has_url)")
+    print(f"  idx_scope created in {time.time() - t0:.0f}s")
+    # ── Step 5: Validate ─────────────────────────────────────────────
+    print("\n=== Step 5: Validation ===")
+    validate(dst, src, output_path)
+    src.close()
+    dst.close()
+def validate(dst: duckdb.DuckDBPyConnection, src: duckdb.DuckDBPyConnection, output_path: str):
+    """Validate the optimized DB against the source."""
+    dst_count = dst.execute("SELECT COUNT(*) FROM metadata").fetchone()[0]
+    src_count = src.execute("SELECT COUNT(*) FROM metadata").fetchone()[0]
+    print(f"  Source rows:    {src_count:>15,}")
+    print(f"  Output rows:    {dst_count:>15,}")
+    if dst_count != src_count:
+        print(f"  ERROR: Row count mismatch!")
+    # Check a few random IDs match
+    sample_ids = src.execute(
+        "SELECT id FROM metadata ORDER BY random() LIMIT 20"
+    ).fetchall()
+    id_list = ",".join(str(r[0]) for r in sample_ids)
+    # Compare key fields
+    src_rows = src.execute(
+        f"SELECT id, uuid, kingdom, species, has_url FROM metadata "
+        f"WHERE id IN ({id_list}) ORDER BY id"
+    ).fetchall()
+    dst_rows = dst.execute(
+        f"SELECT id, uuid, kingdom, species, has_url FROM metadata "
+        f"WHERE id IN ({id_list}) ORDER BY id"
+    ).fetchall()
+    # Cast for comparison (uuid type differs in format: no hyphens vs hyphens)
+    mismatches = 0
+    for s, d in zip(src_rows, dst_rows):
+        s_uuid = str(s[1]).replace("-", "")
+        d_uuid = str(d[1]).replace("-", "")
+        if str(s[0]) != str(d[0]) or s_uuid != d_uuid or \
+           str(s[2]) != str(d[2]) or str(s[3]) != str(d[3]) or \
+           s[4] != d[4]:
+            print(f"  MISMATCH: src={s} dst={d}")
+            mismatches += 1
+    if mismatches == 0:
+        print(f"  Spot check: {len(src_rows)} random rows OK")
+    else:
+        print(f"  ERROR: {mismatches} mismatches in spot check!")
+    # URL reconstruction check
+    print("  Checking URL reconstruction...")
+    sample_urls = src.execute(
+        f"SELECT id, identifier FROM metadata "
+        f"WHERE id IN ({id_list}) AND identifier IS NOT NULL "
+        f"ORDER BY id"
+    ).fetchall()
+    dst_urls = dst.execute(
+        f"SELECT m.id, COALESCE(p.prefix, '') || COALESCE(m.identifier_suffix, '') "
+        f"FROM metadata m "
+        f"LEFT JOIN url_prefixes p ON m.url_prefix_id = p.prefix_id "
+        f"WHERE m.id IN ({id_list}) AND m.identifier_suffix IS NOT NULL "
+        f"ORDER BY m.id"
+    ).fetchall()
+    url_mismatches = 0
+    dst_url_map = {r[0]: r[1] for r in dst_urls}
+    for sid, surl in sample_urls:
+        durl = dst_url_map.get(sid)
+        if durl != surl:
+            print(f"  URL MISMATCH id={sid}: src={surl[:80]} dst={durl[:80] if durl else None}")
+            url_mismatches += 1
+    if url_mismatches == 0:
+        print(f"  URL reconstruction: {len(sample_urls)} URLs OK")
+    else:
+        print(f"  ERROR: {url_mismatches} URL mismatches!")
+    # Size report
+    size_gb = os.path.getsize(output_path) / 1024**3
+    print(f"\n  Output size: {size_gb:.2f} GB")
+    # Per-column storage estimate (count distinct blocks × 256 KB block size)
+    print("\n  Column storage breakdown:")
+    storage = dst.execute("""
+        SELECT column_name,
+               COUNT(DISTINCT block_id) * 256.0 / 1024 AS mb
+        FROM pragma_storage_info('metadata')
+        WHERE block_id IS NOT NULL
+        GROUP BY column_name
+        ORDER BY mb DESC
+    """).fetchall()
+    for col, mb in storage:
+        print(f"    {col:<25s} {mb:>8.1f} MB")
+    # Query performance sanity check
+    print("\n  Query performance check:")
+    test_ids = ",".join(str(r[0]) for r in sample_ids[:10])
+    t0 = time.time()
+    for _ in range(100):
+        dst.execute(
+            f"SELECT id, uuid, kingdom, phylum, class, \"order\", family, genus, species, "
+            f"common_name, source_dataset, source_id, publisher, img_type, "
+            f"COALESCE(p.prefix, '') || COALESCE(m.identifier_suffix, '') AS identifier, "
+            f"has_url "
+            f"FROM metadata m "
+            f"LEFT JOIN url_prefixes p ON m.url_prefix_id = p.prefix_id "
+            f"WHERE m.id IN ({test_ids})"
+        ).fetchall()
+    avg_ms = (time.time() - t0) / 100 * 1000
+    print(f"  Avg query time (10 IDs, 100 runs): {avg_ms:.2f} ms")
+    t0 = time.time()
+    for _ in range(100):
+        src.execute(
+            f"SELECT id, uuid, kingdom, phylum, class, \"order\", family, genus, species, "
+            f"common_name, source_dataset, source_id, publisher, img_type, identifier, has_url "
+            f"FROM metadata WHERE id IN ({test_ids})"
+        ).fetchall()
+    avg_ms_src = (time.time() - t0) / 100 * 1000
+    print(f"  Avg query time ORIGINAL (10 IDs, 100 runs): {avg_ms_src:.2f} ms")
+def main():
+    parser = argparse.ArgumentParser(
+        description="Optimize DuckDB: drop columns, ENUM types, sort, split URLs"
+    )
+    parser.add_argument(
+        "--source", required=True,
+        help="Path to source metadata.duckdb"
+    )
+    parser.add_argument(
+        "--output", required=True,
+        help="Path for optimized output .duckdb"
+    )
+    args = parser.parse_args()
+    os.makedirs(os.path.dirname(os.path.abspath(args.output)), exist_ok=True)
+    create_optimized_db(args.source, args.output)
+    print("\nDone.")
+if __name__ == "__main__":
+    main()

scripts/data/optimize_duckdb.slurm ADDED Viewed

	@@ -0,0 +1,31 @@

+#!/bin/bash
+#SBATCH --job-name=duckdb_optimize
+#SBATCH --time=04:00:00
+#SBATCH --nodes=1
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=8
+#SBATCH --mem=128G
+#SBATCH --partition=cpu
+#SBATCH --account=<YOUR_ACCOUNT>   # TODO: set your SLURM account
+set -euo pipefail
+# ── Config ──────────────────────────────────────────────────────────
+REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
+VENV="${BIOCLIP_VENV:?Set BIOCLIP_VENV to your virtualenv path}"
+DATA_DIR="${BIOCLIP_DATA_DIR:?Set BIOCLIP_DATA_DIR to your data directory}"
+echo "=== DuckDB Optimization ==="
+echo "Job ID: $SLURM_JOB_ID"
+echo "Node:   $(hostname)"
+echo "Start:  $(date)"
+echo ""
+source "$VENV/bin/activate"
+python -u "$REPO_ROOT/scripts/data/optimize_duckdb.py" \
+    --source "${1:?Usage: sbatch optimize_duckdb.slurm <source.duckdb>}" \
+    --output "$DATA_DIR/metadata_optimized.duckdb"
+echo ""
+echo "End: $(date)"

scripts/data/validate_optimized_duckdb.py ADDED Viewed

	@@ -0,0 +1,405 @@

+"""Validate optimized DuckDB against the original source.
+Checks:
+  1. Row count matches
+  2. Random row spot-checks (id, uuid, taxonomy, has_url)
+  3. URL reconstruction (prefix table + suffix == original identifier)
+  4. Corrupted rows have NULLed taxonomy
+  5. Per-column storage breakdown
+  6. Query performance comparison (optimized vs original)
+  7. Schema and index verification
+Usage:
+    python scripts/data/validate_optimized_duckdb.py \
+        --source /path/to/metadata.duckdb \
+        --optimized /path/to/metadata_optimized.duckdb
+"""
+import argparse
+import os
+import time
+import duckdb
+VALID_KINGDOMS = {
+    'Animalia', 'Plantae', 'Fungi', 'Chromista', 'Protozoa',
+    'Bacteria', 'Archaea', 'Viruses', 'Metazoa',
+    'Archaeplastida', 'incertae sedis',
+}
+TAXONOMY_COLS = ["kingdom", "phylum", "class", "order", "family", "genus", "species"]
+# Columns the app selects (from config.py METADATA_COLUMNS)
+APP_COLUMNS = [
+    "id", "uuid", "kingdom", "phylum", "class", '"order"', "family", "genus",
+    "species", "common_name", "source_dataset", "source_id", "publisher",
+    "img_type", "identifier", "has_url", "in_bioclip2_training",
+]
+def validate(source_path: str, optimized_path: str):
+    passed = 0
+    failed = 0
+    src = duckdb.connect(source_path, read_only=True)
+    opt = duckdb.connect(optimized_path, read_only=True)
+    # ── 1. Row count ─────────────────────────────────────────────────
+    print("=== 1. Row Count ===")
+    src_count = src.execute("SELECT COUNT(*) FROM metadata").fetchone()[0]
+    opt_count = opt.execute("SELECT COUNT(*) FROM metadata").fetchone()[0]
+    print(f"  Source:    {src_count:>15,}")
+    print(f"  Optimized: {opt_count:>15,}")
+    if src_count == opt_count:
+        print("  PASS")
+        passed += 1
+    else:
+        print("  FAIL: row count mismatch")
+        failed += 1
+    # ── 2. Random spot-check ─────────────────────────────────────────
+    print("\n=== 2. Random Spot-Check (100 rows) ===")
+    sample_ids = src.execute(
+        "SELECT id FROM metadata ORDER BY random() LIMIT 100"
+    ).fetchall()
+    id_list = ",".join(str(r[0]) for r in sample_ids)
+    src_rows = src.execute(
+        f"SELECT id, uuid, kingdom, species, has_url, source_dataset "
+        f"FROM metadata WHERE id IN ({id_list}) ORDER BY id"
+    ).fetchall()
+    opt_rows = opt.execute(
+        f"SELECT id, uuid, kingdom, species, has_url, source_dataset "
+        f"FROM metadata WHERE id IN ({id_list}) ORDER BY id"
+    ).fetchall()
+    mismatches = 0
+    for s, o in zip(src_rows, opt_rows):
+        s_uuid = str(s[1]).replace("-", "")
+        o_uuid = str(o[1]).replace("-", "")
+        # kingdom/species may be NULL in optimized if row was corrupted
+        s_kingdom = str(s[2]) if s[2] else None
+        o_kingdom = str(o[2]) if o[2] else None
+        s_species = str(s[3]) if s[3] else None
+        o_species = str(o[3]) if o[3] else None
+        id_ok = s[0] == o[0]
+        uuid_ok = s_uuid == o_uuid
+        has_url_ok = s[4] == o[4]
+        source_ok = str(s[5]) == str(o[5])
+        # Taxonomy may differ if row was corrupted (NULLed in optimized)
+        taxonomy_ok = (o_kingdom == s_kingdom and o_species == s_species) or \
+                      (o_kingdom is None and s_kingdom not in VALID_KINGDOMS)
+        if not (id_ok and uuid_ok and has_url_ok and source_ok and taxonomy_ok):
+            print(f"  MISMATCH id={s[0]}:")
+            print(f"    src: uuid={s[1]}, kingdom={s[2]}, species={s[3]}, has_url={s[4]}")
+            print(f"    opt: uuid={o[1]}, kingdom={o[2]}, species={o[3]}, has_url={o[4]}")
+            mismatches += 1
+    if mismatches == 0:
+        print(f"  PASS ({len(src_rows)} rows checked)")
+        passed += 1
+    else:
+        print(f"  FAIL: {mismatches} mismatches")
+        failed += 1
+    # ── 3. URL reconstruction ────────────────────────────────────────
+    print("\n=== 3. URL Reconstruction ===")
+    has_prefix_table = opt.execute(
+        "SELECT COUNT(*) FROM information_schema.tables "
+        "WHERE table_name = 'url_prefixes'"
+    ).fetchone()[0] > 0
+    if has_prefix_table:
+        # Sample 200 rows with URLs
+        url_sample = src.execute(
+            f"SELECT id, identifier FROM metadata "
+            f"WHERE id IN ({id_list}) AND identifier IS NOT NULL "
+            f"ORDER BY id"
+        ).fetchall()
+        opt_urls = opt.execute(
+            f"SELECT m.id, "
+            f"  COALESCE(p.prefix, '') || COALESCE(m.identifier_suffix, '') "
+            f"FROM metadata m "
+            f"LEFT JOIN url_prefixes p ON m.url_prefix_id = p.prefix_id "
+            f"WHERE m.id IN ({id_list}) AND "
+            f"  (m.identifier_suffix IS NOT NULL OR m.url_prefix_id IS NOT NULL) "
+            f"ORDER BY m.id"
+        ).fetchall()
+        opt_url_map = {r[0]: r[1] for r in opt_urls}
+        url_mismatches = 0
+        for sid, surl in url_sample:
+            ourl = opt_url_map.get(sid)
+            if ourl != surl:
+                print(f"  MISMATCH id={sid}:")
+                print(f"    src: {surl[:100]}")
+                print(f"    opt: {ourl[:100] if ourl else None}")
+                url_mismatches += 1
+        if url_mismatches == 0:
+            print(f"  PASS ({len(url_sample)} URLs checked)")
+            passed += 1
+        else:
+            print(f"  FAIL: {url_mismatches} URL mismatches")
+            failed += 1
+    else:
+        print("  SKIP: no url_prefixes table found")
+    # ── 4. Corrupted row cleanup ─────────────────────────────────────
+    print("\n=== 4. Corrupted Row Cleanup ===")
+    placeholders_str = ",".join(f"'{k}'" for k in VALID_KINGDOMS)
+    # Find corrupted IDs from source
+    corrupt_src = src.execute(f"""
+        SELECT id FROM metadata
+        WHERE kingdom IS NOT NULL AND kingdom NOT IN ({placeholders_str})
+    """).fetchall()
+    corrupt_ids = [r[0] for r in corrupt_src]
+    if corrupt_ids:
+        corrupt_id_list = ",".join(str(i) for i in corrupt_ids)
+        # Check that these rows have NULL taxonomy in optimized
+        opt_corrupt = opt.execute(f"""
+            SELECT id, kingdom, phylum, class, "order", family, genus, species, common_name
+            FROM metadata
+            WHERE id IN ({corrupt_id_list})
+        """).fetchall()
+        not_cleaned = 0
+        for row in opt_corrupt:
+            # All taxonomy cols (index 1-8) should be NULL
+            for i, col in enumerate(TAXONOMY_COLS + ["common_name"], 1):
+                if row[i] is not None:
+                    print(f"  NOT CLEANED id={row[0]}: {col}={row[i]}")
+                    not_cleaned += 1
+                    break
+        if not_cleaned == 0:
+            print(f"  PASS ({len(corrupt_ids)} corrupted rows have NULLed taxonomy)")
+            passed += 1
+        else:
+            print(f"  FAIL: {not_cleaned} rows still have non-NULL taxonomy")
+            failed += 1
+    else:
+        print("  SKIP: no corrupted rows found in source")
+    # ── 5. No new corruption introduced ──────────────────────────────
+    print("\n=== 5. No New Corruption ===")
+    # Check that all non-NULL kingdom values in optimized are valid
+    opt_kingdoms = opt.execute("""
+        SELECT DISTINCT kingdom FROM metadata WHERE kingdom IS NOT NULL
+    """).fetchall()
+    invalid = [r[0] for r in opt_kingdoms if str(r[0]) not in VALID_KINGDOMS]
+    if not invalid:
+        print(f"  PASS (all {len(opt_kingdoms)} distinct kingdoms are valid)")
+        passed += 1
+    else:
+        print(f"  FAIL: invalid kingdoms found: {invalid[:10]}")
+        failed += 1
+    # ── 5b. in_bioclip2_training column ─────────────────────────────
+    print("\n=== 5b. in_bioclip2_training Column ===")
+    opt_cols = [r[0] for r in opt.execute("DESCRIBE metadata").fetchall()]
+    src_cols = [r[0] for r in src.execute("DESCRIBE metadata").fetchall()]
+    if "in_bioclip2_training" in src_cols and "in_bioclip2_training" in opt_cols:
+        src_training = src.execute(
+            "SELECT COUNT(*) FROM metadata WHERE in_bioclip2_training = true"
+        ).fetchone()[0]
+        opt_training = opt.execute(
+            "SELECT COUNT(*) FROM metadata WHERE in_bioclip2_training = true"
+        ).fetchone()[0]
+        print(f"  Source training count:    {src_training:>15,}")
+        print(f"  Optimized training count: {opt_training:>15,}")
+        if src_training == opt_training:
+            print("  PASS")
+            passed += 1
+        else:
+            print("  FAIL: training count mismatch")
+            failed += 1
+        # Spot-check: verify a sample of training rows match
+        sample_training = src.execute(
+            "SELECT id FROM metadata WHERE in_bioclip2_training = true "
+            "ORDER BY random() LIMIT 50"
+        ).fetchall()
+        if sample_training:
+            training_ids = ",".join(str(r[0]) for r in sample_training)
+            opt_check = opt.execute(
+                f"SELECT COUNT(*) FROM metadata "
+                f"WHERE id IN ({training_ids}) AND in_bioclip2_training = true"
+            ).fetchone()[0]
+            if opt_check == len(sample_training):
+                print(f"  PASS (spot-check: {len(sample_training)} training rows verified)")
+                passed += 1
+            else:
+                print(f"  FAIL: only {opt_check}/{len(sample_training)} training rows found")
+                failed += 1
+    elif "in_bioclip2_training" not in src_cols:
+        print("  SKIP: column not in source DB")
+    else:
+        print("  FAIL: column missing from optimized DB")
+        failed += 1
+    # ── 6. Schema and indexes ────────────────────────────────────────
+    print("\n=== 6. Schema & Indexes ===")
+    schema = opt.execute("DESCRIBE metadata").fetchall()
+    col_types = {r[0]: r[1] for r in schema}
+    print("  Columns:")
+    for name, dtype in col_types.items():
+        # Truncate long ENUM type strings
+        dtype_str = str(dtype)
+        if len(dtype_str) > 60:
+            dtype_str = dtype_str[:57] + "..."
+        print(f"    {name:<25s} {dtype_str}")
+    indexes = opt.execute(
+        "SELECT index_name FROM duckdb_indexes()"
+    ).fetchall()
+    idx_names = {r[0] for r in indexes}
+    print(f"\n  Indexes: {', '.join(sorted(idx_names))}")
+    required_indexes = {"idx_id", "idx_scope"}
+    if required_indexes.issubset(idx_names):
+        print("  PASS (required indexes present)")
+        passed += 1
+    else:
+        missing = required_indexes - idx_names
+        print(f"  FAIL: missing indexes: {missing}")
+        failed += 1
+    # Check id type is INTEGER (not BIGINT)
+    if "INTEGER" in str(col_types.get("id", "")):
+        print("  PASS (id is INTEGER)")
+        passed += 1
+    else:
+        print(f"  FAIL: id type is {col_types.get('id')}, expected INTEGER")
+        failed += 1
+    # Check uuid type is UUID (not VARCHAR)
+    if "UUID" in str(col_types.get("uuid", "")):
+        print("  PASS (uuid is native UUID)")
+        passed += 1
+    else:
+        print(f"  FAIL: uuid type is {col_types.get('uuid')}, expected UUID")
+        failed += 1
+    # ── 7. Column storage breakdown ──────────────────────────────────
+    print("\n=== 7. Storage Breakdown ===")
+    src_size = os.path.getsize(source_path) / 1024**3
+    opt_size = os.path.getsize(optimized_path) / 1024**3
+    print(f"  Source:    {src_size:.2f} GB")
+    print(f"  Optimized: {opt_size:.2f} GB")
+    print(f"  Reduction: {(1 - opt_size/src_size)*100:.1f}%")
+    storage = opt.execute("""
+        SELECT column_name,
+               COUNT(DISTINCT block_id) * 256.0 / 1024 AS mb
+        FROM pragma_storage_info('metadata')
+        WHERE block_id IS NOT NULL
+        GROUP BY column_name
+        ORDER BY mb DESC
+    """).fetchall()
+    total = 0
+    print(f"\n  {'Column':<25s} {'Size (MB)':>10s}")
+    print(f"  {'-'*25} {'-'*10}")
+    for col, mb in storage:
+        print(f"  {col:<25s} {mb:>10.1f}")
+        total += mb
+    print(f"  {'-'*25} {'-'*10}")
+    print(f"  {'TOTAL':<25s} {total:>10.1f}")
+    # ── 8. Query performance ─────────────────────────────────────────
+    print("\n=== 8. Query Performance ===")
+    test_ids = ",".join(str(r[0]) for r in sample_ids[:10])
+    # Optimized query (with URL join)
+    opt_query = (
+        f"SELECT m.id, m.uuid, m.kingdom, m.phylum, m.class, m.\"order\", "
+        f"m.family, m.genus, m.species, m.common_name, m.source_dataset, "
+        f"m.source_id, m.publisher, m.img_type, "
+        f"COALESCE(p.prefix, '') || COALESCE(m.identifier_suffix, '') AS identifier, "
+        f"m.has_url "
+        f"FROM metadata m "
+        f"LEFT JOIN url_prefixes p ON m.url_prefix_id = p.prefix_id "
+        f"WHERE m.id IN ({test_ids})"
+    )
+    # Source query (direct)
+    src_query = (
+        f'SELECT id, uuid, kingdom, phylum, class, "order", family, genus, '
+        f"species, common_name, source_dataset, source_id, publisher, "
+        f"img_type, identifier, has_url "
+        f"FROM metadata WHERE id IN ({test_ids})"
+    )
+    # Warmup
+    opt.execute(opt_query).fetchall()
+    src.execute(src_query).fetchall()
+    iterations = 500
+    t0 = time.time()
+    for _ in range(iterations):
+        opt.execute(opt_query).fetchall()
+    opt_ms = (time.time() - t0) / iterations * 1000
+    t0 = time.time()
+    for _ in range(iterations):
+        src.execute(src_query).fetchall()
+    src_ms = (time.time() - t0) / iterations * 1000
+    print(f"  Optimized (10 IDs, {iterations} runs): {opt_ms:.2f} ms avg")
+    print(f"  Original  (10 IDs, {iterations} runs): {src_ms:.2f} ms avg")
+    ratio = opt_ms / src_ms if src_ms > 0 else float('inf')
+    if ratio < 2.0:
+        print(f"  PASS (ratio: {ratio:.2f}x)")
+        passed += 1
+    else:
+        print(f"  WARN: optimized is {ratio:.1f}x slower than original")
+        failed += 1
+    # Also test scope-filtered queries
+    t0 = time.time()
+    for _ in range(iterations):
+        opt.execute(
+            f"{opt_query} AND m.has_url = true"
+        ).fetchall()
+    opt_scope_ms = (time.time() - t0) / iterations * 1000
+    t0 = time.time()
+    for _ in range(iterations):
+        src.execute(
+            f"{src_query} AND has_url = true"
+        ).fetchall()
+    src_scope_ms = (time.time() - t0) / iterations * 1000
+    print(f"  Optimized scoped (url_only): {opt_scope_ms:.2f} ms avg")
+    print(f"  Original  scoped (url_only): {src_scope_ms:.2f} ms avg")
+    # ── Summary ──────────────────────────────────────────────────────
+    print(f"\n{'='*50}")
+    print(f"PASSED: {passed}  FAILED: {failed}")
+    if failed == 0:
+        print("ALL CHECKS PASSED")
+    else:
+        print("SOME CHECKS FAILED — review above")
+    src.close()
+    opt.close()
+def main():
+    parser = argparse.ArgumentParser(description="Validate optimized DuckDB")
+    parser.add_argument("--source", required=True, help="Original metadata.duckdb")
+    parser.add_argument("--optimized", required=True, help="Optimized metadata.duckdb")
+    args = parser.parse_args()
+    validate(args.source, args.optimized)
+if __name__ == "__main__":
+    main()

scripts/data/validate_optimized_duckdb.slurm ADDED Viewed

	@@ -0,0 +1,31 @@

+#!/bin/bash
+#SBATCH --job-name=duckdb_validate
+#SBATCH --time=01:00:00
+#SBATCH --nodes=1
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=4
+#SBATCH --mem=64G
+#SBATCH --partition=cpu
+#SBATCH --account=<YOUR_ACCOUNT>   # TODO: set your SLURM account
+set -euo pipefail
+# ── Config ──────────────────────────────────────────────────────────
+REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
+VENV="${BIOCLIP_VENV:?Set BIOCLIP_VENV to your virtualenv path}"
+DATA_DIR="${BIOCLIP_DATA_DIR:?Set BIOCLIP_DATA_DIR to your data directory}"
+echo "=== DuckDB Validation ==="
+echo "Job ID: $SLURM_JOB_ID"
+echo "Node:   $(hostname)"
+echo "Start:  $(date)"
+echo ""
+source "$VENV/bin/activate"
+python -u "$REPO_ROOT/scripts/data/validate_optimized_duckdb.py" \
+    --source "${1:?Usage: sbatch validate_optimized_duckdb.slurm <source.duckdb>}" \
+    --optimized "$DATA_DIR/metadata_optimized.duckdb"
+echo ""
+echo "End: $(date)"

src/bioclip_lite/config.py CHANGED Viewed

@@ -27,7 +27,7 @@ class LiteConfig:
     default_nprobe: int = 16
     over_fetch_factor: int = 3
-    # Scope: "all" | "url_only" | "inaturalist"
     scope: str = "all"
     # Server
@@ -46,10 +46,13 @@ class LiteConfig:
     image_fetch_max_workers: int = 8
     thumbnail_max_dim: int = 256
-    # Metadata columns to SELECT (15 of 18 — excludes resolution_status, basisOfRecord, scientific_name)
     METADATA_COLUMNS: str = (
         'id, uuid, kingdom, phylum, class, "order", family, genus, species, '
-        "common_name, source_dataset, source_id, publisher, img_type, identifier, has_url"
     )
@@ -138,7 +141,10 @@ def parse_args() -> LiteConfig:
     )
     p.add_argument("--device", default="cpu", choices=["cpu", "cuda", "mps"])
     p.add_argument("--model-str", default=None, help="Model identifier")
-    p.add_argument("--scope", default="all", choices=["all", "url_only", "inaturalist"])
     p.add_argument("--host", default="0.0.0.0")
     p.add_argument("--port", type=int, default=7860)
     p.add_argument("--enable-export", action="store_true")

     default_nprobe: int = 16
     over_fetch_factor: int = 3
+    # Scope: "all" | "url_only" | "inaturalist" | "bioclip2_training"
     scope: str = "all"
     # Server
     image_fetch_max_workers: int = 8
     thumbnail_max_dim: int = 256
+    # Metadata columns to SELECT from optimized DB.
+    # URL is split into url_prefix_id + identifier_suffix; reconstructed in Python.
     METADATA_COLUMNS: str = (
         'id, uuid, kingdom, phylum, class, "order", family, genus, species, '
+        "common_name, source_dataset, source_id, publisher, img_type, "
+        "basisOfRecord, url_prefix_id, identifier_suffix, has_url, "
+        "in_bioclip2_training"
     )
     )
     p.add_argument("--device", default="cpu", choices=["cpu", "cuda", "mps"])
     p.add_argument("--model-str", default=None, help="Model identifier")
+    p.add_argument(
+        "--scope", default="all",
+        choices=["all", "url_only", "inaturalist", "bioclip2_training"],
+    )
     p.add_argument("--host", default="0.0.0.0")
     p.add_argument("--port", type=int, default=7860)
     p.add_argument("--enable-export", action="store_true")

src/bioclip_lite/services/search_service.py CHANGED Viewed

@@ -19,6 +19,7 @@ SCOPE_MAP = {
     "All Sources": "all",
     "URL-Available Only": "url_only",
     "iNaturalist Only": "inaturalist",
 }
@@ -62,6 +63,10 @@ class SearchService:
         row_count = self.conn.execute("SELECT COUNT(*) FROM metadata").fetchone()[0]
         logger.info(f"DuckDB connected: {row_count:,} rows")
     @_timer
     def search(
         self,
@@ -76,7 +81,7 @@ class SearchService:
             query_vector: 1-D embedding vector (768-dim for BioCLIP-2).
             top_n: Number of results to return after scope filtering.
             nprobe: Number of IVF partitions to search.
-            scope: "all", "url_only", or "inaturalist".
         Returns:
             List of result dicts ordered by distance, each containing
@@ -126,28 +131,34 @@ class SearchService:
         distances: List[float],
         scope: str,
     ) -> List[Dict[str, Any]]:
-        """Query DuckDB for metadata, applying scope filter."""
-        id_list = ",".join(str(i) for i in ids)
-        where = [f"id IN ({id_list})"]
-        if scope == "url_only":
-            where.append("has_url = true")
-        elif scope == "inaturalist":
-            where.append("has_url = true")
-            where.append("source_dataset = 'gbif'")
-            where.append("publisher LIKE '%iNaturalist%'")
         query = (
             f"SELECT {self.metadata_columns} FROM metadata "
-            f"WHERE {' AND '.join(where)}"
         )
         rows = self.conn.execute(query).fetchall()
         col_names = [desc[0] for desc in self.conn.description]
-        # Build lookup keyed by id
         meta_map: Dict[int, Dict] = {}
         for row in rows:
             d = dict(zip(col_names, row))
             meta_map[d["id"]] = d
         # Merge with distances, preserving FAISS ranking
@@ -155,6 +166,20 @@ class SearchService:
         for fid, dist in zip(ids, distances):
             if fid in meta_map:
                 results.append({"distance": dist, **meta_map[fid]})
         return results
     @property
@@ -165,5 +190,19 @@ class SearchService:
     def total_vectors(self) -> int:
         return self.index.ntotal
     def close(self):
         self.conn.close()

     "All Sources": "all",
     "URL-Available Only": "url_only",
     "iNaturalist Only": "inaturalist",
+    "BioCLIP 2 Training": "bioclip2_training",
 }
         row_count = self.conn.execute("SELECT COUNT(*) FROM metadata").fetchone()[0]
         logger.info(f"DuckDB connected: {row_count:,} rows")
+        # Load URL prefix lookup (410 entries, ~50 KB in memory).
+        # Reconstructs full URLs in Python instead of a SQL JOIN.
+        self._url_prefixes = self._load_url_prefixes()
     @_timer
     def search(
         self,
             query_vector: 1-D embedding vector (768-dim for BioCLIP-2).
             top_n: Number of results to return after scope filtering.
             nprobe: Number of IVF partitions to search.
+            scope: "all", "url_only", "inaturalist", or "bioclip2_training".
         Returns:
             List of result dicts ordered by distance, each containing
         distances: List[float],
         scope: str,
     ) -> List[Dict[str, Any]]:
+        """Query DuckDB for metadata, filtering by scope in Python.
+        Scope filtering via SQL WHERE clauses causes ~370x slowdown on
+        ID-based lookups (4ms → 1600ms) because DuckDB scans the full
+        column even when nearly all rows match. Since has_url and
+        in_bioclip2_training are true for >87% of rows, post-filtering
+        in Python is far more efficient.
+        """
+        id_list = ",".join(str(i) for i in ids)
         query = (
             f"SELECT {self.metadata_columns} FROM metadata "
+            f"WHERE id IN ({id_list})"
         )
         rows = self.conn.execute(query).fetchall()
         col_names = [desc[0] for desc in self.conn.description]
+        # Build lookup keyed by id, reconstructing full URL from prefix + suffix
         meta_map: Dict[int, Dict] = {}
         for row in rows:
             d = dict(zip(col_names, row))
+            if self._url_prefixes and "url_prefix_id" in d:
+                # Prefixes are domains (e.g. "https://content.eol.org"),
+                # suffixes always start with "/" (e.g. "/data/media/...").
+                # Split is guaranteed by optimize_duckdb.py's substr().
+                prefix = self._url_prefixes.get(d.pop("url_prefix_id"), "")
+                suffix = d.pop("identifier_suffix", "") or ""
+                d["identifier"] = prefix + suffix if (prefix or suffix) else None
             meta_map[d["id"]] = d
         # Merge with distances, preserving FAISS ranking
         for fid, dist in zip(ids, distances):
             if fid in meta_map:
                 results.append({"distance": dist, **meta_map[fid]})
+        # Apply scope filter in Python (much faster than SQL WHERE)
+        if scope == "url_only":
+            results = [r for r in results if r.get("has_url")]
+        elif scope == "inaturalist":
+            results = [
+                r for r in results
+                if r.get("has_url")
+                and r.get("source_dataset") == "gbif"
+                and "iNaturalist" in (r.get("publisher") or "")
+            ]
+        elif scope == "bioclip2_training":
+            results = [r for r in results if r.get("in_bioclip2_training")]
         return results
     @property
     def total_vectors(self) -> int:
         return self.index.ntotal
+    def _load_url_prefixes(self) -> Dict[int, str]:
+        """Load url_prefixes table into a dict for fast in-Python URL reconstruction."""
+        try:
+            rows = self.conn.execute(
+                "SELECT prefix_id, prefix FROM url_prefixes"
+            ).fetchall()
+            prefixes = {row[0]: row[1] for row in rows}
+            logger.info(f"Loaded {len(prefixes)} URL prefixes")
+            return prefixes
+        except duckdb.CatalogException:
+            # Legacy DB without url_prefixes table — identifier is a direct column
+            logger.info("No url_prefixes table found, using direct identifier column")
+            return {}
     def close(self):
         self.conn.close()