Net Zhang Claude Opus 4.6 Elizabeth Campolongo commited on
Commit
3b98575
·
unverified ·
1 Parent(s): b49de53

Reduce DuckDB metadata from 25.8 GB to 13.5 GB (#23)

Browse files

* Add DuckDB optimization scripts (#11)

Reduce metadata.duckdb from 25.8 GB to 13.5 GB (47.8%) via column
pruning, ENUM types, taxonomy sort order, URL prefix splitting, and
type downcasting. Includes cleanup of 47 corrupted rows from GBIF
column-shift misalignment.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Adapt app to optimized DuckDB schema

Update METADATA_COLUMNS for split URL (url_prefix_id + identifier_suffix)
and add basisOfRecord. Reconstruct full URLs via in-memory prefix dict
lookup (410 entries loaded at startup) instead of SQL JOIN for zero
query latency impact. Falls back to direct identifier column for
legacy DB compatibility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update convert script as stage 1 of two-stage pipeline

convert_duckdb_lite.py now handles raw import only (SQLite/DuckDB →
base DB with has_url column). Removed idx_scope creation (handled by
optimize_duckdb.py in stage 2). SLURM script chains both stages.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Use temp file for intermediate DB in conversion pipeline

Avoid leaving a duplicate base DB on disk. The intermediate file is
cleaned up automatically via trap after the optimize step completes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add BioCLIP 2 Training scope and backfill metadata pipeline

- Add `in_bioclip2_training` boolean column to DuckDB pipeline
(convert, optimize, validate) from training catalog parquet
- Add "BioCLIP 2 Training" scope to app dropdown, config, and
search service
- Switch scope filtering from SQL WHERE to Python post-filter
(benchmarked ~370x faster for ID-based lookups)
- Fix `src.metadata` reference bug in optimize_duckdb.py validation
- Update README scope table and add filtering rationale
- Update HF data card with new column, backfill details, and
revised data coverage numbers
- Add scripts/data/README.md documenting the optimized schema

Closes #24

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Clarify in_bioclip2_training count discrepancy

* Address review feedback

- update inline url to direct to the exact catalog file
- add inline documentation in search_service.py to clarify URL suffix
and prefix format

* Address review feedbback

- Link in_bioclip2_training to catalog.parquet file instead of repo tree
- Document url_prefix_id and identifier_suffix columns in HF data card
- Add url_prefixes table schema and URL reconstruction section with SQL
& Python examples
- Update column mapping table for the prefix/suffix split
- Add inline comment in search_service.py clarifying prefix/suffix convention

* Apply suggestions from code review

Co-authored-by: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Net Zhang <48858129+NetZissou@users.noreply.github.com>

* Add `.gitattributes` and normalize line endings to LF

GH web UI introduced CRLF line endings in `hf-data-card-README.md`
casuing noisy full-fil diffs.

This commit normalized line endings in that file to pure LF. And the
future commits will be automatically normalized.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com>

.gitattributes ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ # Normalize line endings to LF in the repo, native on checkout
2
+ * text=auto
.gitignore CHANGED
@@ -7,3 +7,6 @@ build/
7
  *.duckdb
8
  *.index
9
  logs/
 
 
 
 
7
  *.duckdb
8
  *.index
9
  logs/
10
+
11
+ # SLURM output files
12
+ *.out
README.md CHANGED
@@ -54,7 +54,7 @@ Everything runs in a single Gradio process. No microservices, no HDF5 files.
54
  | Component | Size |
55
  |-----------|------|
56
  | FAISS index | 5.8 GB |
57
- | DuckDB metadata | 25.8 GB |
58
  | Model weights | ~2.5 GB (downloaded on first run) |
59
  | Image storage | 0 (fetched from source URLs) |
60
 
@@ -105,16 +105,21 @@ Then open `http://<hostname>:7860` in your browser.
105
 
106
  ## Scope filtering
107
 
108
- Not all 234M images have source URLs. Use the scope dropdown to control which results appear:
109
 
110
  | Scope | Images | Description |
111
  |-------|--------|-------------|
112
  | All Sources | 234M | Everything, including results without images |
113
- | URL-Available Only | 207M (88%) | Only results with fetchable source URLs |
114
  | iNaturalist Only | 135M (58%) | iNaturalist observations via AWS Open Data |
 
115
 
116
  The app over-fetches from FAISS (3x by default) and filters post-search, so you still get the requested number of results after filtering.
117
 
 
 
 
 
118
  ## Architecture
119
 
120
  ```
 
54
  | Component | Size |
55
  |-----------|------|
56
  | FAISS index | 5.8 GB |
57
+ | DuckDB metadata | ~14 GB (optimized) |
58
  | Model weights | ~2.5 GB (downloaded on first run) |
59
  | Image storage | 0 (fetched from source URLs) |
60
 
 
105
 
106
  ## Scope filtering
107
 
108
+ Use the scope dropdown to control which results appear:
109
 
110
  | Scope | Images | Description |
111
  |-------|--------|-------------|
112
  | All Sources | 234M | Everything, including results without images |
113
+ | URL-Available Only | 234M (99.99%) | Only results with fetchable source URLs |
114
  | iNaturalist Only | 135M (58%) | iNaturalist observations via AWS Open Data |
115
+ | BioCLIP 2 Training | 206M (88%) | Records used in BioCLIP 2 model training |
116
 
117
  The app over-fetches from FAISS (3x by default) and filters post-search, so you still get the requested number of results after filtering.
118
 
119
+ ### Why scope filtering is done in Python
120
+
121
+ Scope filters (`has_url`, `in_bioclip2_training`, etc.) are applied in Python after the DuckDB query, not as SQL WHERE clauses. Benchmarking showed that adding boolean WHERE clauses to ID-based lookups causes a ~370x slowdown (4ms to 1500ms for 50 IDs) because DuckDB scans the full boolean column rather than using the index for small IN-list queries. Since the majority of rows pass these filters (e.g., 100% have URLs, 88% are in training), fetching all results and filtering in Python adds negligible overhead (~3ms) while keeping query latency low.
122
+
123
  ## Architecture
124
 
125
  ```
app.py CHANGED
@@ -38,7 +38,7 @@ CSS = """
38
  .app-footer a { color: #f0a030 !important; }
39
  """
40
 
41
- SCOPE_CHOICES = ["All Sources", "URL-Available Only", "iNaturalist Only"]
42
 
43
 
44
  def _image_hash(img: Image.Image) -> str:
 
38
  .app-footer a { color: #f0a030 !important; }
39
  """
40
 
41
+ SCOPE_CHOICES = ["All Sources", "URL-Available Only", "iNaturalist Only", "BioCLIP 2 Training"]
42
 
43
 
44
  def _image_hash(img: Image.Image) -> str:
docs/hf-data-card-README.md CHANGED
@@ -2,7 +2,7 @@
2
  license: cc0-1.0
3
  language:
4
  - en
5
- pretty_name: BioCLIP Image Search Lite
6
  task_categories:
7
  - image-feature-extraction
8
  tags:
@@ -55,7 +55,7 @@ The **FAISS index** enables sub-second approximate nearest-neighbor search over
55
 
56
  ### Dataset Description
57
 
58
- - **Curated by:** Net Zhang, Sreejith Menon, Elizabeth Campolongo, Matthew Thompson, Arnab Nandi, Hilmar Lapp, Jianyang Gu <!-- TODO: confirm full author list -->
59
  - **Demo:** [BioCLIP Image Search Lite Space](https://huggingface.co/spaces/imageomics/bioclip-image-search-lite)
60
  - **Repository:** [Imageomics/bioclip-image-search-lite](https://github.com/Imageomics/bioclip-image-search-lite)
61
  - **Paper:** [BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning](https://arxiv.org/abs/2505.23883)
@@ -88,7 +88,7 @@ imageomics/bioclip-image-search-lite/
88
  faiss/
89
  index.index # FAISS IVF+PQ index (~5.8 GB, ~200M vectors)
90
  duckdb/
91
- metadata.duckdb # DuckDB metadata database (~27 GB, 234M rows)
92
  ```
93
 
94
  ### FAISS Index
@@ -127,8 +127,48 @@ imageomics/bioclip-image-search-lite/
127
  | `source_id` | `VARCHAR` | Unique identifier from source (e.g., GBIF `gbifID`, EOL content/page ID). |
128
  | `publisher` | `VARCHAR` | Organization that published the data (GBIF records only, e.g., `iNaturalist`). |
129
  | `img_type` | `VARCHAR` | Image type (e.g., `Citizen Science`, `Museum Specimen: Fungi`, `Camera-trap`). GBIF only; others are `Unidentified`. |
130
- | `identifier` | `VARCHAR` | URL to the original image, or `NULL` if unavailable. Corresponds to `source_url` in TreeOfLife-200M catalog. |
131
- | `has_url` | `BOOLEAN` | Materialized flag: `TRUE` if `identifier` is not null/empty. Used for scope filtering. |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
132
 
133
  **Column name mapping from [TreeOfLife-200M](https://huggingface.co/datasets/imageomics/TreeOfLife-200M) catalog:**
134
 
@@ -137,8 +177,10 @@ imageomics/bioclip-image-search-lite/
137
  | `id` | — | New; FAISS vector position index |
138
  | `common_name` | `common` | Renamed |
139
  | `source_dataset` | `data_source` | Renamed |
140
- | `identifier` | `source_url` | Renamed |
 
141
  | `has_url` | — | Derived; materialized boolean |
 
142
  | All others | Same name | Direct mapping |
143
 
144
  **Columns from TreeOfLife-200M catalog not included:** `scientific_name`, `basis_of_record`, `shard_filename`, `shard_file_path`, `base_dataset_file_path`, `resolution_status`.
@@ -147,16 +189,19 @@ For more background on these columns, please see the [data field descriptions fr
147
 
148
  **Indexes:**
149
  - `idx_id` on `id` (primary lookup for FAISS result mapping)
150
- - `idx_scope` on `(source_dataset, has_url)` (scope filtering)
151
 
152
  **Data coverage:**
153
 
154
  | Scope | Count | Percentage |
155
  |-------|-------|------------|
156
  | Total rows | 234,391,308 | 100% |
157
- | With URL (`has_url = TRUE`) | ~207M | 88.4% |
158
- | iNaturalist (`source_dataset = 'gbif' AND publisher = 'iNaturalist'`) | ~136M | 58% |
159
- | Without URL | ~27M | 11.6% |
 
 
 
160
 
161
  ### Data Splits
162
 
@@ -252,7 +297,7 @@ for _, row in results.iterrows():
252
  The full [BioCLIP Vector DB](https://github.com/Imageomics/bioclip-vector-db) stores 234M images totaling ~92 TB — far too large for lightweight deployment. [BioCLIP Image Search Lite](https://huggingface.co/spaces/imageomics/bioclip-image-search-lite) was created to make the similarity search capability accessible on constrained infrastructure (e.g., Hugging Face Spaces free tier: 2 vCPU, 16 GB RAM, 50 GB disk) by:
253
 
254
  1. Replacing local image storage with on-demand URL fetching from publicly accessible external sources (primarily [iNaturalist AWS Open Data](https://github.com/inaturalist/inaturalist-open-data) S3).
255
- 2. Compressing the metadata from an 80 GB SQLite database to a ~27 GB DuckDB database (optimized via columnar storage and compression).
256
  3. Packaging the FAISS index (~5.8 GB) and DuckDB metadata as the only deployment artifacts.
257
 
258
  This approach trades occasional missing thumbnails (when source URLs are unavailable) for a >1000x reduction in storage requirements. See [Imageomics/bioclip-vector-db#47](https://github.com/Imageomics/bioclip-vector-db/issues/47#issuecomment-3927846723) for the full design rationale.
@@ -269,7 +314,7 @@ These URLs are **reasonably persistent but not guaranteed stable**:
269
  - **AWS sponsorship is renewable.** The AWS Open Data Sponsorship runs on a [2-year renewable term](https://aws.amazon.com/opendata/open-data-sponsorship-program/terms/) with no uptime SLA.
270
  - **No explicit S3 rate limit.** The iNaturalist [API Recommended Practices](https://www.inaturalist.org/pages/api+recommended+practices) recommend <5 GB/hour and <24 GB/day for media downloads, though it is unclear whether this applies to direct S3 access. The [BioCLIP Image Search Lite application](https://github.com/Imageomics/bioclip-image-search-lite) respects these limits regardless.
271
 
272
- The remaining URLs point to other biodiversity platforms ([EOL](https://eol.org/), [BIOSCAN-5M](https://biodiversitygenomics.net/projects/5m-insects/), [FathomNet](https://www.fathomnet.org/)), each with their own availability characteristics. The ~11.6% of records without any URL are still searchable via the FAISS index but cannot display a source image.
273
 
274
  ### Source Data
275
 
@@ -297,9 +342,12 @@ The DuckDB metadata database was assembled from two sources produced by the [Bio
297
 
298
  The Lite repo merged these into a single DuckDB database ([`convert_duckdb_lite.py`](https://github.com/Imageomics/bioclip-image-search-lite/blob/main/scripts/data/convert_duckdb_lite.py)) with the following optimizations:
299
 
300
- - Added a materialized `has_url` boolean column for efficient scope filtering.
301
- - Created indexes: `idx_id` on `id` (primary FAISS lookup) and `idx_scope` on `(source_dataset, has_url)` (scope filtering).
302
- - Leveraged DuckDB's columnar storage and compression, reducing the database from ~80 GB (SQLite) to ~27 GB.
 
 
 
303
 
304
  #### Source Data Producers
305
 
@@ -322,8 +370,8 @@ This dataset does not include annotations created specifically for this reposito
322
  This dataset inherits biases and considerations from [TreeOfLife-200M](https://huggingface.co/datasets/imageomics/TreeOfLife-200M#considerations-for-using-the-data). The following are exaggerated in this instance (BioCLIP Image Search Lite) due to available image representation (those readily fetched by URL):
323
 
324
  - **Taxonomic coverage is uneven.** Despite including 952K+ unique taxa, coverage is heavily biased toward well-photographed organisms. Citizen science observations (primarily iNaturalist) comprise ~58% of the data, skewing representation toward charismatic species and regions where citizen science is most active (Western/developed countries).
325
- - **Incomplete taxonomic labels.** As inherited from TreeOfLife-200M, only ~89% of records have full species-level taxonomy. ~11% lack complete labels due to biodiversity data complexities (`NULL` values at lower ranks).
326
- - **URL availability is not guaranteed.** ~11.6% of records have no source URL. For records with URLs, images may become unavailable over time due to URL rot, server changes, or content removal.
327
  - **FAISS approximation.** The IVF+PQ index trades exactness for speed. Results are approximate nearest neighbors — some true nearest neighbors may be missed depending on the `nprobe` setting. Higher `nprobe` values improve recall at the cost of latency.
328
  - **Embedding bias.** Similarity is determined by BioCLIP 2 embeddings, which may encode biases from the training data.
329
 
@@ -346,7 +394,6 @@ We ask that you cite this dataset and associated papers if you make use of it in
346
 
347
  ## Citation
348
 
349
- <!-- TODO: confirm full author list and add DOI once generated -->
350
  **Data:**
351
  ```bibtex
352
  @misc{zhang2026biocliplite,
 
2
  license: cc0-1.0
3
  language:
4
  - en
5
+ pretty_name: BioCLIP Image Search Lite FAISS Index
6
  task_categories:
7
  - image-feature-extraction
8
  tags:
 
55
 
56
  ### Dataset Description
57
 
58
+ - **Curated by:** Net Zhang, Sreejith Menon, Elizabeth Campolongo, Matthew Thompson, Arnab Nandi, Hilmar Lapp, Jianyang Gu
59
  - **Demo:** [BioCLIP Image Search Lite Space](https://huggingface.co/spaces/imageomics/bioclip-image-search-lite)
60
  - **Repository:** [Imageomics/bioclip-image-search-lite](https://github.com/Imageomics/bioclip-image-search-lite)
61
  - **Paper:** [BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning](https://arxiv.org/abs/2505.23883)
 
88
  faiss/
89
  index.index # FAISS IVF+PQ index (~5.8 GB, ~200M vectors)
90
  duckdb/
91
+ metadata.duckdb # DuckDB metadata database (~14 GB optimized, 234M rows)
92
  ```
93
 
94
  ### FAISS Index
 
127
  | `source_id` | `VARCHAR` | Unique identifier from source (e.g., GBIF `gbifID`, EOL content/page ID). |
128
  | `publisher` | `VARCHAR` | Organization that published the data (GBIF records only, e.g., `iNaturalist`). |
129
  | `img_type` | `VARCHAR` | Image type (e.g., `Citizen Science`, `Museum Specimen: Fungi`, `Camera-trap`). GBIF only; others are `Unidentified`. |
130
+ | `url_prefix_id` | `USMALLINT` | Foreign key into the `url_prefixes` lookup table. Together with `identifier_suffix`, reconstructs the full image URL as `<prefix><suffix>`. See [URL reconstruction](#url-reconstruction) below. |
131
+ | `identifier_suffix` | `VARCHAR` | Path portion of the image URL (always starts with `/`, e.g., `/photos/12345/original.jpg`). `NULL` if no URL is available. |
132
+ | `has_url` | `BOOLEAN` | Materialized flag: `TRUE` if a URL is available. Used for scope filtering. |
133
+ | `in_bioclip2_training` | `BOOLEAN` | `TRUE` if the record's UUID appears in the BioCLIP 2 training data — TreeOfLife-200M (Revision [a8f38b4](https://huggingface.co/datasets/imageomics/TreeOfLife-200M/tree/a8f38b4388579862c56ae57d6f094c2ac0e92e12)). |
134
+
135
+ **Table:** `url_prefixes` — 411 rows
136
+
137
+ | Column | Type | Description |
138
+ |--------|------|-------------|
139
+ | `prefix_id` | `USMALLINT` | Primary key. |
140
+ | `prefix` | `VARCHAR` | URL domain prefix (e.g., `https://inaturalist-open-data.s3.amazonaws.com`). Does not include a trailing `/`. |
141
+
142
+ #### URL reconstruction
143
+
144
+ The original `identifier` (full image URL) column from TreeOfLife-200M is split into a shared domain prefix and a per-row path suffix to reduce storage overhead. To reconstruct the full URL:
145
+
146
+ ```sql
147
+ SELECT p.prefix || m.identifier_suffix AS url
148
+ FROM metadata m
149
+ JOIN url_prefixes p ON m.url_prefix_id = p.prefix_id
150
+ WHERE m.identifier_suffix IS NOT NULL
151
+ ```
152
+
153
+ ```python
154
+ import duckdb
155
+
156
+ conn = duckdb.connect("metadata.duckdb", read_only=True)
157
+
158
+ # Load prefix lookup table into a dict
159
+ prefixes = dict(conn.execute("SELECT prefix_id, prefix FROM url_prefixes").fetchall())
160
+
161
+ # Query metadata and reconstruct URLs
162
+ rows = conn.execute("SELECT url_prefix_id, identifier_suffix FROM metadata LIMIT 5").fetchall()
163
+ for prefix_id, suffix in rows:
164
+ url = prefixes.get(prefix_id, "") + (suffix or "")
165
+ print(url)
166
+ # https://inaturalist-open-data.s3.amazonaws.com/photos/12345/original.jpg
167
+ # https://content.eol.org/data/media/17/a6/537.jpg
168
+ # ...
169
+ ```
170
+
171
+ Prefixes are bare domains (e.g., `https://content.eol.org`) and suffixes always start with `/` (e.g., `/data/media/17/a6/537.jpg`), so simple concatenation produces a valid URL. This split saves ~40% storage compared to storing the full URL per row.
172
 
173
  **Column name mapping from [TreeOfLife-200M](https://huggingface.co/datasets/imageomics/TreeOfLife-200M) catalog:**
174
 
 
177
  | `id` | — | New; FAISS vector position index |
178
  | `common_name` | `common` | Renamed |
179
  | `source_dataset` | `data_source` | Renamed |
180
+ | `url_prefix_id` | `source_url` | Split from `source_url`; foreign key to `url_prefixes` |
181
+ | `identifier_suffix` | `source_url` | Split from `source_url`; path portion of URL |
182
  | `has_url` | — | Derived; materialized boolean |
183
+ | `in_bioclip2_training` | — | Derived; matched against [training catalog revision `a8f38b4`](https://huggingface.co/datasets/imageomics/TreeOfLife-200M/blob/a8f38b4388579862c56ae57d6f094c2ac0e92e12/dataset/catalog.parquet) |
184
  | All others | Same name | Direct mapping |
185
 
186
  **Columns from TreeOfLife-200M catalog not included:** `scientific_name`, `basis_of_record`, `shard_filename`, `shard_file_path`, `base_dataset_file_path`, `resolution_status`.
 
189
 
190
  **Indexes:**
191
  - `idx_id` on `id` (primary lookup for FAISS result mapping)
192
+ - `idx_scope` on `(source_dataset, has_url, in_bioclip2_training)` (scope filtering)
193
 
194
  **Data coverage:**
195
 
196
  | Scope | Count | Percentage |
197
  |-------|-------|------------|
198
  | Total rows | 234,391,308 | 100% |
199
+ | With URL (`has_url = TRUE`) | ~234M | 99.99% |
200
+ | iNaturalist (`source_dataset = 'gbif' AND publisher LIKE '%iNaturalist%'`) | ~136M | 58% |
201
+ | In BioCLIP 2 training (`in_bioclip2_training = TRUE`) | ~206M | 87.9% |
202
+ | With taxonomy (`kingdom IS NOT NULL`) | ~228M | 97.2% |
203
+
204
+ > **Note on `in_bioclip2_training`:** This column identifies records whose UUID matches the BioCLIP 2 training catalog from [TreeOfLife-200M revision `a8f38b4`](https://huggingface.co/datasets/imageomics/TreeOfLife-200M/tree/a8f38b4388579862c56ae57d6f094c2ac0e92e12). The original BioCLIP 2 training set contained ~214M images. Of these, ~206M match records in the search corpus. The remaining ~8M were excluded from the FAISS index because they were identified as invalid after training (e.g., document scans, specimen labels, images with detected human faces) and removed during a post-training data cleanup before the embeddings were generated.
205
 
206
  ### Data Splits
207
 
 
297
  The full [BioCLIP Vector DB](https://github.com/Imageomics/bioclip-vector-db) stores 234M images totaling ~92 TB — far too large for lightweight deployment. [BioCLIP Image Search Lite](https://huggingface.co/spaces/imageomics/bioclip-image-search-lite) was created to make the similarity search capability accessible on constrained infrastructure (e.g., Hugging Face Spaces free tier: 2 vCPU, 16 GB RAM, 50 GB disk) by:
298
 
299
  1. Replacing local image storage with on-demand URL fetching from publicly accessible external sources (primarily [iNaturalist AWS Open Data](https://github.com/inaturalist/inaturalist-open-data) S3).
300
+ 2. Compressing the metadata from an 80 GB SQLite database to a ~14 GB DuckDB database (optimized via ENUM types, URL prefix deduplication, taxonomy sorting, and columnar compression).
301
  3. Packaging the FAISS index (~5.8 GB) and DuckDB metadata as the only deployment artifacts.
302
 
303
  This approach trades occasional missing thumbnails (when source URLs are unavailable) for a >1000x reduction in storage requirements. See [Imageomics/bioclip-vector-db#47](https://github.com/Imageomics/bioclip-vector-db/issues/47#issuecomment-3927846723) for the full design rationale.
 
314
  - **AWS sponsorship is renewable.** The AWS Open Data Sponsorship runs on a [2-year renewable term](https://aws.amazon.com/opendata/open-data-sponsorship-program/terms/) with no uptime SLA.
315
  - **No explicit S3 rate limit.** The iNaturalist [API Recommended Practices](https://www.inaturalist.org/pages/api+recommended+practices) recommend <5 GB/hour and <24 GB/day for media downloads, though it is unclear whether this applies to direct S3 access. The [BioCLIP Image Search Lite application](https://github.com/Imageomics/bioclip-image-search-lite) respects these limits regardless.
316
 
317
+ The remaining URLs point to other biodiversity platforms ([EOL](https://eol.org/), [BIOSCAN-5M](https://biodiversitygenomics.net/projects/5m-insects/), [FathomNet](https://www.fathomnet.org/)), each with their own availability characteristics.
318
 
319
  ### Source Data
320
 
 
342
 
343
  The Lite repo merged these into a single DuckDB database ([`convert_duckdb_lite.py`](https://github.com/Imageomics/bioclip-image-search-lite/blob/main/scripts/data/convert_duckdb_lite.py)) with the following optimizations:
344
 
345
+ - Added materialized boolean columns `has_url` and `in_bioclip2_training` for scope filtering.
346
+ - Created indexes: `idx_id` on `id` (primary FAISS lookup) and `idx_scope` on `(source_dataset, has_url, in_bioclip2_training)`.
347
+ - Applied ENUM types for low-cardinality columns, URL prefix deduplication, and taxonomy-based row sorting for better compression.
348
+ - Leveraged DuckDB's columnar storage and compression, reducing the database from ~80 GB (SQLite) to ~14 GB.
349
+
350
+ **Metadata backfill (March 2026):** 28.3M rows (12.1%) originally had NULL metadata because the entire `observation.org` GBIF server (27.2M rows) was missing from the metadata parquets used during ingestion. Taxonomy was recovered for ~21.7M rows from the resolved taxa pipeline, and source URLs were recovered for all 27.2M rows from the GBIF data parquets. An additional 1.1M EOL rows with failed taxonomy resolution had their `source_dataset` and `source_id` recovered. UUIDs were also normalized from mixed formats (non-hyphenated for observation.org rows) to a consistent hyphenated format. After backfill, only 2,973 rows remain with NULL `source_dataset`.
351
 
352
  #### Source Data Producers
353
 
 
370
  This dataset inherits biases and considerations from [TreeOfLife-200M](https://huggingface.co/datasets/imageomics/TreeOfLife-200M#considerations-for-using-the-data). The following are exaggerated in this instance (BioCLIP Image Search Lite) due to available image representation (those readily fetched by URL):
371
 
372
  - **Taxonomic coverage is uneven.** Despite including 952K+ unique taxa, coverage is heavily biased toward well-photographed organisms. Citizen science observations (primarily iNaturalist) comprise ~58% of the data, skewing representation toward charismatic species and regions where citizen science is most active (Western/developed countries).
373
+ - **Incomplete taxonomic labels.** As inherited from TreeOfLife-200M, ~97% of records now have kingdom-level taxonomy after the March 2026 backfill. The remaining ~3% lack complete labels due to biodiversity data complexities (`NULL` values at lower ranks).
374
+ - **URL availability is not guaranteed.** After the metadata backfill, nearly all records (99.99%) have source URLs, though images may become unavailable over time due to URL rot, server changes, or content removal.
375
  - **FAISS approximation.** The IVF+PQ index trades exactness for speed. Results are approximate nearest neighbors — some true nearest neighbors may be missed depending on the `nprobe` setting. Higher `nprobe` values improve recall at the cost of latency.
376
  - **Embedding bias.** Similarity is determined by BioCLIP 2 embeddings, which may encode biases from the training data.
377
 
 
394
 
395
  ## Citation
396
 
 
397
  **Data:**
398
  ```bibtex
399
  @misc{zhang2026biocliplite,
scripts/data/README.md ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Data Pipeline
2
+
3
+ Two-stage pipeline to build the optimized DuckDB metadata database from source.
4
+
5
+ ## Pipeline
6
+
7
+ ```
8
+ Source (SQLite or DuckDB)
9
+ → convert_duckdb_lite.py # Stage 1: import, add has_url + in_bioclip2_training
10
+ → optimize_duckdb.py # Stage 2: ENUM types, URL split, sort, index
11
+ → validate_optimized_duckdb.py # Verify correctness
12
+ ```
13
+
14
+ ## Optimized Schema
15
+
16
+ **Table:** `metadata` — 234,391,308 rows
17
+
18
+ | Column | Type | Notes |
19
+ |--------|------|-------|
20
+ | `id` | `INTEGER` | FAISS vector index (downcast from BIGINT) |
21
+ | `uuid` | `UUID` | Native 16-byte UUID (normalized hyphenated format) |
22
+ | `kingdom`..`family` | `ENUM` | Low-cardinality taxonomy columns as ENUM types |
23
+ | `genus`, `species` | `VARCHAR` | Too many distinct values for ENUM |
24
+ | `common_name` | `VARCHAR` | |
25
+ | `source_dataset` | `ENUM` | `gbif`, `eol`, `bioscan`, `fathomnet` |
26
+ | `publisher` | `ENUM` | GBIF publisher (e.g., `iNaturalist`, `observation.org`) |
27
+ | `img_type` | `ENUM` | Image type category |
28
+ | `basisOfRecord` | `ENUM` | GBIF basis of record |
29
+ | `source_id` | `VARCHAR` | Source-specific identifier |
30
+ | `url_prefix_id` | `USMALLINT` | FK to `url_prefixes` table |
31
+ | `identifier_suffix` | `VARCHAR` | URL path after domain prefix |
32
+ | `has_url` | `BOOLEAN` | `TRUE` if image URL available |
33
+ | `in_bioclip2_training` | `BOOLEAN` | `TRUE` if UUID in BioCLIP 2 training catalog |
34
+
35
+ **Indexes:** `idx_id(id)`, `idx_scope(source_dataset, has_url, in_bioclip2_training)`
36
+
37
+ ## Optimizations Applied
38
+
39
+ 1. **ENUM types** — Low-cardinality columns (`kingdom`, `phylum`, `class`, `order`, `family`, `source_dataset`, `publisher`, `img_type`, `basisOfRecord`) stored as ENUM for ~10x compression.
40
+ 2. **URL prefix deduplication** — `identifier` split into a shared prefix table (`url_prefixes`) + per-row suffix, eliminating repeated domain strings.
41
+ 3. **Taxonomy sort** — Rows sorted by `source_dataset, kingdom, ..., species, common_name` for long runs of identical values and better compression.
42
+ 4. **Type downcasting** — `id` BIGINT→INTEGER, `uuid` VARCHAR→native UUID (16 bytes).
43
+ 5. **Corruption cleanup** — 44 rows with column-shift metadata corruption have taxonomy NULLed.
44
+
45
+ Result: **80 GB (SQLite) → 14 GB (optimized DuckDB)**, 57% smaller than the unoptimized DuckDB.
46
+
47
+ ## Usage
48
+
49
+ ```bash
50
+ # Stage 1: Import + add boolean columns
51
+ python scripts/data/convert_duckdb_lite.py \
52
+ --from-duckdb /path/to/source.duckdb \
53
+ --output /path/to/base.duckdb \
54
+ --catalog-parquet /path/to/training/catalog.parquet
55
+
56
+ # Stage 2: Optimize
57
+ python scripts/data/optimize_duckdb.py \
58
+ --source /path/to/base.duckdb \
59
+ --output /path/to/metadata_optimized.duckdb
60
+
61
+ # Validate
62
+ python scripts/data/validate_optimized_duckdb.py \
63
+ --source /path/to/base.duckdb \
64
+ --optimized /path/to/metadata_optimized.duckdb
65
+ ```
scripts/data/convert_duckdb_lite.py CHANGED
@@ -1,13 +1,20 @@
1
- """Convert SQLite metadata to optimized DuckDB for BioCLIP Lite.
2
 
3
- Copies the existing research DuckDB and adds Lite-specific enhancements:
4
- 1. Materialized has_url BOOLEAN column
5
- 2. Compound index on (source_dataset, has_url) for scope filtering
6
- 3. URL coverage validation
 
7
 
8
  Usage:
9
- python scripts/data/convert_duckdb_lite.py --from-duckdb SOURCE --output OUT
10
  python scripts/data/convert_duckdb_lite.py --from-sqlite SOURCE --output OUT
 
 
 
 
 
 
11
  """
12
 
13
  import argparse
@@ -20,7 +27,7 @@ import duckdb
20
  EXPECTED_ROW_COUNT = 234_391_308
21
 
22
 
23
- def convert_from_sqlite(sqlite_path: str, output_path: str):
24
  """Full conversion from the 80 GB SQLite source."""
25
  print(f"Converting from SQLite: {sqlite_path}")
26
  print(f"Output: {output_path}")
@@ -43,13 +50,17 @@ def convert_from_sqlite(sqlite_path: str, output_path: str):
43
  print("Creating index on id...")
44
  conn.execute("CREATE INDEX idx_id ON metadata (id)")
45
 
46
- _add_lite_enhancements(conn)
 
 
47
  _validate(conn, output_path)
48
  conn.close()
49
 
50
 
51
- def convert_from_existing_duckdb(source_path: str, output_path: str):
52
- """Copy existing research DuckDB and add Lite-specific enhancements."""
 
 
53
  print(f"Copying from: {source_path}")
54
  print(f" to: {output_path}")
55
 
@@ -60,17 +71,18 @@ def convert_from_existing_duckdb(source_path: str, output_path: str):
60
  print(f"Copy complete ({os.path.getsize(output_path) / 1024**3:.1f} GB)")
61
 
62
  conn = duckdb.connect(output_path)
63
- _add_lite_enhancements(conn)
 
 
64
  _validate(conn, output_path)
65
  conn.close()
66
 
67
 
68
- def _add_lite_enhancements(conn: duckdb.DuckDBPyConnection):
69
- """Add has_url column and compound index for scope filtering."""
70
- # Check if has_url already exists
71
  cols = [r[0] for r in conn.execute("DESCRIBE metadata").fetchall()]
72
  if "has_url" in cols:
73
- print("has_url column already exists, skipping ALTER")
74
  else:
75
  print("Adding has_url column...")
76
  t0 = time.time()
@@ -81,22 +93,37 @@ def _add_lite_enhancements(conn: duckdb.DuckDBPyConnection):
81
  )
82
  print(f"has_url column populated in {time.time() - t0:.0f}s")
83
 
84
- # Compound index for scope queries
85
- existing_indexes = [
86
- r[0] for r in conn.execute(
87
- "SELECT index_name FROM duckdb_indexes()"
88
- ).fetchall()
89
- ]
90
- if "idx_scope" not in existing_indexes:
91
- print("Creating compound index idx_scope(source_dataset, has_url)...")
92
- t0 = time.time()
93
- conn.execute(
94
- "CREATE INDEX idx_scope ON metadata (source_dataset, has_url)"
95
- )
96
- print(f"Index created in {time.time() - t0:.0f}s")
97
- else:
98
- print("idx_scope already exists, skipping")
99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
100
 
101
 
102
  def _validate(conn: duckdb.DuckDBPyConnection, output_path: str):
@@ -119,15 +146,27 @@ def _validate(conn: duckdb.DuckDBPyConnection, output_path: str):
119
  print(f"iNaturalist: {inat_count:>15,} ({inat_count/total*100:.1f}%)")
120
  print(f"Without URL: {total - with_url:>15,} ({(total-with_url)/total*100:.1f}%)")
121
 
 
 
 
 
 
 
 
 
122
  if total != EXPECTED_ROW_COUNT:
123
  print(f"WARNING: Expected {EXPECTED_ROW_COUNT:,} rows, got {total:,}")
124
 
125
  size_gb = os.path.getsize(output_path) / 1024**3
126
  print(f"DuckDB size: {size_gb:.1f} GB")
 
127
 
128
 
129
  def main():
130
- parser = argparse.ArgumentParser(description="DuckDB Lite conversion")
 
 
 
131
  group = parser.add_mutually_exclusive_group(required=True)
132
  group.add_argument(
133
  "--from-sqlite", type=str, metavar="PATH",
@@ -135,20 +174,24 @@ def main():
135
  )
136
  group.add_argument(
137
  "--from-duckdb", type=str,
138
- help="Copy from existing DuckDB and add Lite enhancements"
139
  )
140
  parser.add_argument(
141
  "--output", type=str, required=True,
142
- help="Output DuckDB path"
 
 
 
 
143
  )
144
  args = parser.parse_args()
145
 
146
  os.makedirs(os.path.dirname(args.output), exist_ok=True)
147
 
148
  if args.from_sqlite:
149
- convert_from_sqlite(args.from_sqlite, args.output)
150
  else:
151
- convert_from_existing_duckdb(args.from_duckdb, args.output)
152
 
153
  print("\nDone.")
154
 
 
1
+ """Convert source metadata to optimized DuckDB for BioCLIP Lite.
2
 
3
+ Two-stage pipeline:
4
+ Stage 1 (this script): Import raw metadata from SQLite or DuckDB source,
5
+ add has_url column, and create a base DuckDB.
6
+ Stage 2 (optimize_duckdb.py): Apply size optimizations (ENUM types, taxonomy
7
+ sort, URL prefix split, type downcasting, corruption cleanup).
8
 
9
  Usage:
10
+ # From SQLite source (slow, ~1-2 hours):
11
  python scripts/data/convert_duckdb_lite.py --from-sqlite SOURCE --output OUT
12
+
13
+ # From existing research DuckDB:
14
+ python scripts/data/convert_duckdb_lite.py --from-duckdb SOURCE --output OUT
15
+
16
+ # Then optimize:
17
+ python scripts/data/optimize_duckdb.py --source OUT --output OPTIMIZED
18
  """
19
 
20
  import argparse
 
27
  EXPECTED_ROW_COUNT = 234_391_308
28
 
29
 
30
+ def convert_from_sqlite(sqlite_path: str, output_path: str, catalog_parquet: str = None):
31
  """Full conversion from the 80 GB SQLite source."""
32
  print(f"Converting from SQLite: {sqlite_path}")
33
  print(f"Output: {output_path}")
 
50
  print("Creating index on id...")
51
  conn.execute("CREATE INDEX idx_id ON metadata (id)")
52
 
53
+ _add_has_url(conn)
54
+ if catalog_parquet:
55
+ _add_in_bioclip2_training(conn, catalog_parquet)
56
  _validate(conn, output_path)
57
  conn.close()
58
 
59
 
60
+ def convert_from_existing_duckdb(
61
+ source_path: str, output_path: str, catalog_parquet: str = None
62
+ ):
63
+ """Copy existing research DuckDB and add has_url if missing."""
64
  print(f"Copying from: {source_path}")
65
  print(f" to: {output_path}")
66
 
 
71
  print(f"Copy complete ({os.path.getsize(output_path) / 1024**3:.1f} GB)")
72
 
73
  conn = duckdb.connect(output_path)
74
+ _add_has_url(conn)
75
+ if catalog_parquet:
76
+ _add_in_bioclip2_training(conn, catalog_parquet)
77
  _validate(conn, output_path)
78
  conn.close()
79
 
80
 
81
+ def _add_has_url(conn: duckdb.DuckDBPyConnection):
82
+ """Add has_url BOOLEAN column if not present."""
 
83
  cols = [r[0] for r in conn.execute("DESCRIBE metadata").fetchall()]
84
  if "has_url" in cols:
85
+ print("has_url column already exists, skipping")
86
  else:
87
  print("Adding has_url column...")
88
  t0 = time.time()
 
93
  )
94
  print(f"has_url column populated in {time.time() - t0:.0f}s")
95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96
 
97
+ def _add_in_bioclip2_training(conn: duckdb.DuckDBPyConnection, catalog_parquet: str):
98
+ """Add in_bioclip2_training BOOLEAN column from training catalog.
99
+
100
+ Marks rows whose UUID appears in the BioCLIP 2 training catalog.
101
+ """
102
+ cols = [r[0] for r in conn.execute("DESCRIBE metadata").fetchall()]
103
+ if "in_bioclip2_training" in cols:
104
+ print("in_bioclip2_training column already exists, skipping")
105
+ return
106
+
107
+ print("Adding in_bioclip2_training column...")
108
+ t0 = time.time()
109
+ conn.execute("ALTER TABLE metadata ADD COLUMN in_bioclip2_training BOOLEAN DEFAULT false")
110
+
111
+ # The catalog uses UUID column; join on normalized UUID
112
+ conn.execute(f"""
113
+ UPDATE metadata m SET in_bioclip2_training = true
114
+ FROM (
115
+ SELECT DISTINCT uuid FROM read_parquet('{catalog_parquet}')
116
+ ) c
117
+ WHERE CAST(m.uuid AS VARCHAR) = CAST(c.uuid AS VARCHAR)
118
+ """)
119
+
120
+ total = conn.execute("SELECT COUNT(*) FROM metadata").fetchone()[0]
121
+ matched = conn.execute(
122
+ "SELECT COUNT(*) FROM metadata WHERE in_bioclip2_training = true"
123
+ ).fetchone()[0]
124
+ elapsed = time.time() - t0
125
+ print(f"in_bioclip2_training populated in {elapsed:.0f}s")
126
+ print(f" Matched: {matched:,} / {total:,} ({matched/total*100:.1f}%)")
127
 
128
 
129
  def _validate(conn: duckdb.DuckDBPyConnection, output_path: str):
 
146
  print(f"iNaturalist: {inat_count:>15,} ({inat_count/total*100:.1f}%)")
147
  print(f"Without URL: {total - with_url:>15,} ({(total-with_url)/total*100:.1f}%)")
148
 
149
+ # Check for in_bioclip2_training column
150
+ cols = [r[0] for r in conn.execute("DESCRIBE metadata").fetchall()]
151
+ if "in_bioclip2_training" in cols:
152
+ training_count = conn.execute(
153
+ "SELECT COUNT(*) FROM metadata WHERE in_bioclip2_training = true"
154
+ ).fetchone()[0]
155
+ print(f"In training: {training_count:>15,} ({training_count/total*100:.1f}%)")
156
+
157
  if total != EXPECTED_ROW_COUNT:
158
  print(f"WARNING: Expected {EXPECTED_ROW_COUNT:,} rows, got {total:,}")
159
 
160
  size_gb = os.path.getsize(output_path) / 1024**3
161
  print(f"DuckDB size: {size_gb:.1f} GB")
162
+ print(f"\nNext step: run optimize_duckdb.py --source {output_path} --output <optimized.duckdb>")
163
 
164
 
165
  def main():
166
+ parser = argparse.ArgumentParser(
167
+ description="Stage 1: Import metadata into base DuckDB. "
168
+ "Run optimize_duckdb.py afterward for size optimization."
169
+ )
170
  group = parser.add_mutually_exclusive_group(required=True)
171
  group.add_argument(
172
  "--from-sqlite", type=str, metavar="PATH",
 
174
  )
175
  group.add_argument(
176
  "--from-duckdb", type=str,
177
+ help="Copy from existing DuckDB and add has_url column"
178
  )
179
  parser.add_argument(
180
  "--output", type=str, required=True,
181
+ help="Output DuckDB path (base DB, not yet optimized)"
182
+ )
183
+ parser.add_argument(
184
+ "--catalog-parquet", type=str, default=None,
185
+ help="Path to BioCLIP 2 training catalog parquet (adds in_bioclip2_training column)"
186
  )
187
  args = parser.parse_args()
188
 
189
  os.makedirs(os.path.dirname(args.output), exist_ok=True)
190
 
191
  if args.from_sqlite:
192
+ convert_from_sqlite(args.from_sqlite, args.output, args.catalog_parquet)
193
  else:
194
+ convert_from_existing_duckdb(args.from_duckdb, args.output, args.catalog_parquet)
195
 
196
  print("\nDone.")
197
 
scripts/data/convert_duckdb_lite.slurm CHANGED
@@ -23,8 +23,17 @@ echo ""
23
 
24
  source "$VENV/bin/activate"
25
 
26
- python "$REPO_ROOT/scripts/data/convert_duckdb_lite.py" \
 
 
 
 
27
  --from-duckdb "${1:?Usage: sbatch convert_duckdb_lite.slurm <source.duckdb>}" \
 
 
 
 
 
28
  --output "$DATA_DIR/metadata.duckdb"
29
 
30
  echo ""
 
23
 
24
  source "$VENV/bin/activate"
25
 
26
+ BASE_DB=$(mktemp "$DATA_DIR/metadata_base_XXXXXX.duckdb")
27
+ trap 'rm -f "$BASE_DB"' EXIT
28
+
29
+ # Stage 1: Import from source into temp file
30
+ python -u "$REPO_ROOT/scripts/data/convert_duckdb_lite.py" \
31
  --from-duckdb "${1:?Usage: sbatch convert_duckdb_lite.slurm <source.duckdb>}" \
32
+ --output "$BASE_DB"
33
+
34
+ # Stage 2: Optimize into final output
35
+ python -u "$REPO_ROOT/scripts/data/optimize_duckdb.py" \
36
+ --source "$BASE_DB" \
37
  --output "$DATA_DIR/metadata.duckdb"
38
 
39
  echo ""
scripts/data/optimize_duckdb.py ADDED
@@ -0,0 +1,487 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Experiment: rebuild DuckDB with size optimizations.
2
+
3
+ Optimizations applied:
4
+ 1. Drop unused columns (scientific_name, basisOfRecord, resolution_status)
5
+ 2. Cast id BIGINT → INTEGER, uuid VARCHAR → UUID (native 16-byte)
6
+ 3. Sort rows by source_dataset, taxonomy (kingdom→species), scientific_name, common_name
7
+ for better compression via long runs of identical values
8
+ 4. Split identifier URLs into prefix (domain) + suffix for dictionary compression
9
+ 5. Cast low-cardinality VARCHAR columns to ENUM types
10
+
11
+ Usage:
12
+ python scripts/data/optimize_duckdb.py \
13
+ --source /path/to/metadata.duckdb \
14
+ --output /path/to/metadata_optimized.duckdb
15
+ """
16
+
17
+ import argparse
18
+ import os
19
+ import re
20
+ import time
21
+
22
+ import duckdb
23
+
24
+
25
+ EXPECTED_ROW_COUNT = 234_391_308
26
+
27
+ # Columns to drop (not used by the app — can be re-added later from source)
28
+ DROP_COLUMNS = {"scientific_name", "resolution_status"}
29
+
30
+ # Low-cardinality columns to convert to ENUM (column → max distinct values observed)
31
+ ENUM_CANDIDATES = {
32
+ "source_dataset": 5, # 2 + NULL
33
+ "kingdom": 50, # 42 (some dirty data)
34
+ "phylum": 200, # 135
35
+ "class": 500, # 383
36
+ "order": 2000, # 1,531
37
+ "family": 15000, # 13,088
38
+ "publisher": 600, # 472
39
+ "img_type": 20, # 13
40
+ "basisOfRecord": 15, # 8
41
+ }
42
+
43
+
44
+ # Valid biological kingdom values
45
+ VALID_KINGDOMS = {
46
+ 'Animalia', 'Plantae', 'Fungi', 'Chromista', 'Protozoa',
47
+ 'Bacteria', 'Archaea', 'Viruses', 'Metazoa',
48
+ 'Archaeplastida', 'incertae sedis',
49
+ }
50
+
51
+
52
+ def find_corrupted_ids(conn: duckdb.DuckDBPyConnection) -> set[int]:
53
+ """Find rows with column-shift metadata corruption.
54
+
55
+ These are GBIF records where taxonomy columns contain timestamps, UUIDs,
56
+ country names, boolean strings, or scientific names with authority citations
57
+ due to column misalignment during original ingestion.
58
+ """
59
+ placeholders = ",".join(f"'{k}'" for k in VALID_KINGDOMS)
60
+
61
+ # Rows with invalid kingdom values
62
+ kingdom_rows = conn.execute(f"""
63
+ SELECT id FROM metadata
64
+ WHERE kingdom IS NOT NULL AND kingdom NOT IN ({placeholders})
65
+ """).fetchall()
66
+ ids = {r[0] for r in kingdom_rows}
67
+
68
+ # Rows with valid kingdom but corrupted phylum
69
+ phylum_rows = conn.execute(f"""
70
+ SELECT id FROM metadata
71
+ WHERE (kingdom IS NULL OR kingdom IN ({placeholders}))
72
+ AND phylum IS NOT NULL
73
+ AND (phylum LIKE '2024-%%'
74
+ OR phylum IN ('true', 'false', 'US', 'bracteatum')
75
+ OR phylum LIKE '%%Wall.%%' OR phylum LIKE '%%Pers.%%'
76
+ OR phylum LIKE '%% L.' OR phylum LIKE '%%Makino%%'
77
+ OR phylum LIKE '%%subsp.%%' OR phylum LIKE '%%var.%%'
78
+ OR phylum LIKE '%%Stokes%%' OR phylum LIKE '%%Reveal%%'
79
+ OR phylum LIKE '%%E.Wolf%%')
80
+ """).fetchall()
81
+ ids |= {r[0] for r in phylum_rows}
82
+
83
+ # Rows with valid kingdom+phylum but corrupted class
84
+ class_rows = conn.execute(f"""
85
+ SELECT id FROM metadata
86
+ WHERE (kingdom IS NULL OR kingdom IN ({placeholders}))
87
+ AND (phylum NOT LIKE '2024-%%' OR phylum IS NULL)
88
+ AND class IS NOT NULL
89
+ AND (class LIKE '2024-%%'
90
+ OR class LIKE '%%INVALID%%'
91
+ OR class LIKE '%%MATCH%%'
92
+ OR (class LIKE '%% var. %%' AND class LIKE '%%.%%'))
93
+ """).fetchall()
94
+ ids |= {r[0] for r in class_rows}
95
+
96
+ return ids
97
+
98
+
99
+ def build_enum_types(source_conn: duckdb.DuckDBPyConnection) -> dict[str, str]:
100
+ """Query source DB to discover distinct values and build ENUM type DDL.
101
+
102
+ Returns a dict of column_name → enum_type_name.
103
+ """
104
+ enum_types = {}
105
+ for col, max_card in ENUM_CANDIDATES.items():
106
+ quoted = f'"{col}"' if col in ("order", "class") else col
107
+ rows = source_conn.execute(
108
+ f"SELECT DISTINCT {quoted} FROM metadata "
109
+ f"WHERE {quoted} IS NOT NULL "
110
+ f"ORDER BY {quoted}"
111
+ ).fetchall()
112
+ values = [r[0] for r in rows]
113
+
114
+ if len(values) > max_card:
115
+ print(f" SKIP ENUM for {col}: {len(values)} distinct > {max_card} limit")
116
+ continue
117
+
118
+ type_name = f"enum_{col}"
119
+ enum_types[col] = type_name
120
+ print(f" ENUM {type_name}: {len(values)} distinct values")
121
+
122
+ return enum_types
123
+
124
+
125
+ def build_url_prefix_table(source_conn: duckdb.DuckDBPyConnection) -> list[tuple[int, str]]:
126
+ """Extract top URL domain prefixes from identifier column.
127
+
128
+ Returns list of (prefix_id, prefix_string) tuples.
129
+ """
130
+ print(" Extracting URL domain prefixes...")
131
+ rows = source_conn.execute("""
132
+ SELECT
133
+ regexp_extract(identifier, '^(https?://[^/]+)', 1) AS domain,
134
+ COUNT(*) AS cnt
135
+ FROM metadata
136
+ WHERE identifier IS NOT NULL AND identifier != ''
137
+ GROUP BY domain
138
+ ORDER BY cnt DESC
139
+ """).fetchall()
140
+
141
+ prefixes = [(i, row[0]) for i, row in enumerate(rows) if row[0]]
142
+ print(f" Found {len(prefixes)} distinct URL domains")
143
+ for domain, cnt in rows[:10]:
144
+ print(f" {domain}: {cnt:,}")
145
+ return prefixes
146
+
147
+
148
+ def create_optimized_db(source_path: str, output_path: str):
149
+ """Rebuild the DuckDB with all optimizations."""
150
+ print(f"Source: {source_path} ({os.path.getsize(source_path) / 1024**3:.1f} GB)")
151
+ print(f"Output: {output_path}")
152
+
153
+ if os.path.exists(output_path):
154
+ os.remove(output_path)
155
+ # Also remove WAL file if present
156
+ wal_path = output_path + ".wal"
157
+ if os.path.exists(wal_path):
158
+ os.remove(wal_path)
159
+
160
+ # Open source read-only
161
+ src = duckdb.connect(source_path, read_only=True)
162
+ src_count = src.execute("SELECT COUNT(*) FROM metadata").fetchone()[0]
163
+ print(f"Source rows: {src_count:,}")
164
+
165
+ # Open destination
166
+ dst = duckdb.connect(output_path)
167
+ # Allow more memory for sorting 234M rows
168
+ dst.execute("SET memory_limit = '100GB'")
169
+ dst.execute("SET threads = 8")
170
+ # Attach source
171
+ dst.execute(f"ATTACH '{source_path}' AS src (READ_ONLY)")
172
+
173
+ # ── Step 0: Identify corrupted rows ────────────────────────────
174
+ print("\n=== Step 0: Identifying corrupted rows ===")
175
+ corrupted_ids = find_corrupted_ids(src)
176
+ print(f" Found {len(corrupted_ids)} rows with column-shift corruption")
177
+ if corrupted_ids:
178
+ for cid in sorted(corrupted_ids):
179
+ print(f" id={cid}")
180
+ # Register as a temp table so we can use it in the CREATE TABLE query
181
+ id_list = ",".join(str(i) for i in corrupted_ids)
182
+ dst.execute(f"CREATE TEMP TABLE corrupted_ids AS SELECT unnest([{id_list}]) AS id")
183
+
184
+ # ── Step 1: Build ENUM types ─────────────────────────────────────
185
+ print("\n=== Step 1: Building ENUM types ===")
186
+ # Exclude corrupted rows from ENUM value discovery
187
+ exclude_clause = ""
188
+ if corrupted_ids:
189
+ exclude_clause = f" AND id NOT IN ({id_list})"
190
+
191
+ enum_types = build_enum_types(src)
192
+
193
+ for col, type_name in enum_types.items():
194
+ quoted = f'"{col}"' if col in ("order", "class") else col
195
+ values = src.execute(
196
+ f"SELECT DISTINCT {quoted} FROM metadata "
197
+ f"WHERE {quoted} IS NOT NULL{exclude_clause} ORDER BY {quoted}"
198
+ ).fetchall()
199
+ value_list = ", ".join(f"'{v[0].replace(chr(39), chr(39)+chr(39))}'" for v in values)
200
+ dst.execute(f"CREATE TYPE {type_name} AS ENUM ({value_list})")
201
+
202
+ # ── Step 2: Build URL prefix lookup ──────────────────────────────
203
+ print("\n=== Step 2: Building URL prefix table ===")
204
+ prefixes = build_url_prefix_table(src)
205
+
206
+ dst.execute("""
207
+ CREATE TABLE url_prefixes (
208
+ prefix_id USMALLINT,
209
+ prefix VARCHAR
210
+ )
211
+ """)
212
+ dst.executemany(
213
+ "INSERT INTO url_prefixes VALUES (?, ?)",
214
+ prefixes
215
+ )
216
+ # Build a lookup for the SQL CASE expression
217
+ prefix_map = {prefix: pid for pid, prefix in prefixes}
218
+
219
+ # ── Step 3: Create optimized metadata table ──────────────────────
220
+ print("\n=== Step 3: Creating optimized metadata table ===")
221
+ print(" Sorting by source_dataset, taxonomy, common_name...")
222
+ print(" Splitting identifier into prefix_id + suffix...")
223
+
224
+ # Build column expressions
225
+ col_exprs = []
226
+
227
+ # id: BIGINT → INTEGER
228
+ col_exprs.append("CAST(s.id AS INTEGER) AS id")
229
+
230
+ # uuid: VARCHAR → UUID native type
231
+ col_exprs.append("CAST(s.uuid AS UUID) AS uuid")
232
+
233
+ # Taxonomy columns — NULL out corrupted rows, ENUM cast the rest
234
+ # For corrupted rows, all taxonomy + common_name are garbage from column shift
235
+ has_corrupt = len(corrupted_ids) > 0
236
+ for col in ["kingdom", "phylum", "class", "order", "family", "genus", "species"]:
237
+ quoted_src = f's."{col}"' if col in ("order", "class") else f"s.{col}"
238
+ if has_corrupt:
239
+ clean_expr = (
240
+ f"CASE WHEN s.id IN (SELECT id FROM corrupted_ids) "
241
+ f"THEN NULL ELSE {quoted_src} END"
242
+ )
243
+ else:
244
+ clean_expr = quoted_src
245
+ if col in enum_types:
246
+ col_exprs.append(
247
+ f"TRY_CAST({clean_expr} AS {enum_types[col]}) AS \"{col}\""
248
+ )
249
+ else:
250
+ col_exprs.append(f"{clean_expr} AS \"{col}\"")
251
+
252
+ # common_name stays VARCHAR (177K distinct — too high for ENUM)
253
+ if has_corrupt:
254
+ col_exprs.append(
255
+ "CASE WHEN s.id IN (SELECT id FROM corrupted_ids) "
256
+ "THEN NULL ELSE s.common_name END AS common_name"
257
+ )
258
+ else:
259
+ col_exprs.append("s.common_name")
260
+
261
+ # source_dataset, publisher, img_type, basisOfRecord → ENUM
262
+ for col in ["source_dataset", "publisher", "img_type", "basisOfRecord"]:
263
+ if col in enum_types:
264
+ col_exprs.append(
265
+ f"TRY_CAST(s.{col} AS {enum_types[col]}) AS {col}"
266
+ )
267
+ else:
268
+ col_exprs.append(f"s.{col}")
269
+
270
+ col_exprs.append("s.source_id")
271
+
272
+ # identifier → split into prefix_id + identifier_suffix
273
+ # Build a CASE expression to map domain → prefix_id
274
+ case_parts = []
275
+ for prefix, pid in sorted(prefix_map.items(), key=lambda x: -len(x[0])):
276
+ escaped = prefix.replace("'", "''")
277
+ case_parts.append(
278
+ f"WHEN s.identifier LIKE '{escaped}%' THEN {pid}"
279
+ )
280
+ case_expr = "CASE " + " ".join(case_parts) + " ELSE NULL END"
281
+
282
+ col_exprs.append(f"{case_expr} AS url_prefix_id")
283
+
284
+ # suffix: strip the matched domain prefix
285
+ suffix_parts = []
286
+ for prefix, pid in sorted(prefix_map.items(), key=lambda x: -len(x[0])):
287
+ escaped = prefix.replace("'", "''")
288
+ suffix_parts.append(
289
+ f"WHEN s.identifier LIKE '{escaped}%' "
290
+ f"THEN substr(s.identifier, {len(prefix) + 1})"
291
+ )
292
+ suffix_expr = "CASE " + " ".join(suffix_parts) + " ELSE s.identifier END"
293
+ col_exprs.append(f"{suffix_expr} AS identifier_suffix")
294
+
295
+ col_exprs.append("s.has_url")
296
+
297
+ # in_bioclip2_training: carry through if present in source
298
+ src_cols = [r[0] for r in src.execute("DESCRIBE metadata").fetchall()]
299
+ has_training_col = "in_bioclip2_training" in src_cols
300
+ if has_training_col:
301
+ col_exprs.append("s.in_bioclip2_training")
302
+ print(" Including in_bioclip2_training column")
303
+
304
+ select_clause = ",\n ".join(col_exprs)
305
+
306
+ # Sort order: source_dataset, taxonomy hierarchy, common_name
307
+ sort_order = (
308
+ 'source_dataset, kingdom, phylum, class, "order", family, genus, species, '
309
+ "common_name"
310
+ )
311
+
312
+ t0 = time.time()
313
+ create_sql = f"""
314
+ CREATE TABLE metadata AS
315
+ SELECT
316
+ {select_clause}
317
+ FROM src.metadata s
318
+ ORDER BY {sort_order}
319
+ """
320
+
321
+ print(" Executing CREATE TABLE ... ORDER BY (this will take a while)...")
322
+ dst.execute(create_sql)
323
+ elapsed = time.time() - t0
324
+ print(f" Table created in {elapsed:.0f}s ({elapsed/60:.1f} min)")
325
+
326
+ # ── Step 4: Create indexes ───────────────────────────────────────
327
+ print("\n=== Step 4: Creating indexes ===")
328
+ t0 = time.time()
329
+ dst.execute("CREATE INDEX idx_id ON metadata (id)")
330
+ print(f" idx_id created in {time.time() - t0:.0f}s")
331
+
332
+ t0 = time.time()
333
+ if has_training_col:
334
+ dst.execute(
335
+ "CREATE INDEX idx_scope ON metadata (source_dataset, has_url, in_bioclip2_training)"
336
+ )
337
+ else:
338
+ dst.execute("CREATE INDEX idx_scope ON metadata (source_dataset, has_url)")
339
+ print(f" idx_scope created in {time.time() - t0:.0f}s")
340
+
341
+ # ── Step 5: Validate ─────────────────────────────────────────────
342
+ print("\n=== Step 5: Validation ===")
343
+ validate(dst, src, output_path)
344
+
345
+ src.close()
346
+ dst.close()
347
+
348
+
349
+ def validate(dst: duckdb.DuckDBPyConnection, src: duckdb.DuckDBPyConnection, output_path: str):
350
+ """Validate the optimized DB against the source."""
351
+ dst_count = dst.execute("SELECT COUNT(*) FROM metadata").fetchone()[0]
352
+ src_count = src.execute("SELECT COUNT(*) FROM metadata").fetchone()[0]
353
+
354
+ print(f" Source rows: {src_count:>15,}")
355
+ print(f" Output rows: {dst_count:>15,}")
356
+ if dst_count != src_count:
357
+ print(f" ERROR: Row count mismatch!")
358
+
359
+ # Check a few random IDs match
360
+ sample_ids = src.execute(
361
+ "SELECT id FROM metadata ORDER BY random() LIMIT 20"
362
+ ).fetchall()
363
+ id_list = ",".join(str(r[0]) for r in sample_ids)
364
+
365
+ # Compare key fields
366
+ src_rows = src.execute(
367
+ f"SELECT id, uuid, kingdom, species, has_url FROM metadata "
368
+ f"WHERE id IN ({id_list}) ORDER BY id"
369
+ ).fetchall()
370
+
371
+ dst_rows = dst.execute(
372
+ f"SELECT id, uuid, kingdom, species, has_url FROM metadata "
373
+ f"WHERE id IN ({id_list}) ORDER BY id"
374
+ ).fetchall()
375
+
376
+ # Cast for comparison (uuid type differs in format: no hyphens vs hyphens)
377
+ mismatches = 0
378
+ for s, d in zip(src_rows, dst_rows):
379
+ s_uuid = str(s[1]).replace("-", "")
380
+ d_uuid = str(d[1]).replace("-", "")
381
+ if str(s[0]) != str(d[0]) or s_uuid != d_uuid or \
382
+ str(s[2]) != str(d[2]) or str(s[3]) != str(d[3]) or \
383
+ s[4] != d[4]:
384
+ print(f" MISMATCH: src={s} dst={d}")
385
+ mismatches += 1
386
+
387
+ if mismatches == 0:
388
+ print(f" Spot check: {len(src_rows)} random rows OK")
389
+ else:
390
+ print(f" ERROR: {mismatches} mismatches in spot check!")
391
+
392
+ # URL reconstruction check
393
+ print(" Checking URL reconstruction...")
394
+ sample_urls = src.execute(
395
+ f"SELECT id, identifier FROM metadata "
396
+ f"WHERE id IN ({id_list}) AND identifier IS NOT NULL "
397
+ f"ORDER BY id"
398
+ ).fetchall()
399
+
400
+ dst_urls = dst.execute(
401
+ f"SELECT m.id, COALESCE(p.prefix, '') || COALESCE(m.identifier_suffix, '') "
402
+ f"FROM metadata m "
403
+ f"LEFT JOIN url_prefixes p ON m.url_prefix_id = p.prefix_id "
404
+ f"WHERE m.id IN ({id_list}) AND m.identifier_suffix IS NOT NULL "
405
+ f"ORDER BY m.id"
406
+ ).fetchall()
407
+
408
+ url_mismatches = 0
409
+ dst_url_map = {r[0]: r[1] for r in dst_urls}
410
+ for sid, surl in sample_urls:
411
+ durl = dst_url_map.get(sid)
412
+ if durl != surl:
413
+ print(f" URL MISMATCH id={sid}: src={surl[:80]} dst={durl[:80] if durl else None}")
414
+ url_mismatches += 1
415
+
416
+ if url_mismatches == 0:
417
+ print(f" URL reconstruction: {len(sample_urls)} URLs OK")
418
+ else:
419
+ print(f" ERROR: {url_mismatches} URL mismatches!")
420
+
421
+ # Size report
422
+ size_gb = os.path.getsize(output_path) / 1024**3
423
+ print(f"\n Output size: {size_gb:.2f} GB")
424
+
425
+ # Per-column storage estimate (count distinct blocks × 256 KB block size)
426
+ print("\n Column storage breakdown:")
427
+ storage = dst.execute("""
428
+ SELECT column_name,
429
+ COUNT(DISTINCT block_id) * 256.0 / 1024 AS mb
430
+ FROM pragma_storage_info('metadata')
431
+ WHERE block_id IS NOT NULL
432
+ GROUP BY column_name
433
+ ORDER BY mb DESC
434
+ """).fetchall()
435
+ for col, mb in storage:
436
+ print(f" {col:<25s} {mb:>8.1f} MB")
437
+
438
+ # Query performance sanity check
439
+ print("\n Query performance check:")
440
+ test_ids = ",".join(str(r[0]) for r in sample_ids[:10])
441
+
442
+ t0 = time.time()
443
+ for _ in range(100):
444
+ dst.execute(
445
+ f"SELECT id, uuid, kingdom, phylum, class, \"order\", family, genus, species, "
446
+ f"common_name, source_dataset, source_id, publisher, img_type, "
447
+ f"COALESCE(p.prefix, '') || COALESCE(m.identifier_suffix, '') AS identifier, "
448
+ f"has_url "
449
+ f"FROM metadata m "
450
+ f"LEFT JOIN url_prefixes p ON m.url_prefix_id = p.prefix_id "
451
+ f"WHERE m.id IN ({test_ids})"
452
+ ).fetchall()
453
+ avg_ms = (time.time() - t0) / 100 * 1000
454
+ print(f" Avg query time (10 IDs, 100 runs): {avg_ms:.2f} ms")
455
+
456
+ t0 = time.time()
457
+ for _ in range(100):
458
+ src.execute(
459
+ f"SELECT id, uuid, kingdom, phylum, class, \"order\", family, genus, species, "
460
+ f"common_name, source_dataset, source_id, publisher, img_type, identifier, has_url "
461
+ f"FROM metadata WHERE id IN ({test_ids})"
462
+ ).fetchall()
463
+ avg_ms_src = (time.time() - t0) / 100 * 1000
464
+ print(f" Avg query time ORIGINAL (10 IDs, 100 runs): {avg_ms_src:.2f} ms")
465
+
466
+
467
+ def main():
468
+ parser = argparse.ArgumentParser(
469
+ description="Optimize DuckDB: drop columns, ENUM types, sort, split URLs"
470
+ )
471
+ parser.add_argument(
472
+ "--source", required=True,
473
+ help="Path to source metadata.duckdb"
474
+ )
475
+ parser.add_argument(
476
+ "--output", required=True,
477
+ help="Path for optimized output .duckdb"
478
+ )
479
+ args = parser.parse_args()
480
+
481
+ os.makedirs(os.path.dirname(os.path.abspath(args.output)), exist_ok=True)
482
+ create_optimized_db(args.source, args.output)
483
+ print("\nDone.")
484
+
485
+
486
+ if __name__ == "__main__":
487
+ main()
scripts/data/optimize_duckdb.slurm ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ #SBATCH --job-name=duckdb_optimize
3
+ #SBATCH --time=04:00:00
4
+ #SBATCH --nodes=1
5
+ #SBATCH --ntasks=1
6
+ #SBATCH --cpus-per-task=8
7
+ #SBATCH --mem=128G
8
+ #SBATCH --partition=cpu
9
+ #SBATCH --account=<YOUR_ACCOUNT> # TODO: set your SLURM account
10
+
11
+ set -euo pipefail
12
+
13
+ # ── Config ──────────────────────────────────────────────────────────
14
+ REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
15
+ VENV="${BIOCLIP_VENV:?Set BIOCLIP_VENV to your virtualenv path}"
16
+ DATA_DIR="${BIOCLIP_DATA_DIR:?Set BIOCLIP_DATA_DIR to your data directory}"
17
+
18
+ echo "=== DuckDB Optimization ==="
19
+ echo "Job ID: $SLURM_JOB_ID"
20
+ echo "Node: $(hostname)"
21
+ echo "Start: $(date)"
22
+ echo ""
23
+
24
+ source "$VENV/bin/activate"
25
+
26
+ python -u "$REPO_ROOT/scripts/data/optimize_duckdb.py" \
27
+ --source "${1:?Usage: sbatch optimize_duckdb.slurm <source.duckdb>}" \
28
+ --output "$DATA_DIR/metadata_optimized.duckdb"
29
+
30
+ echo ""
31
+ echo "End: $(date)"
scripts/data/validate_optimized_duckdb.py ADDED
@@ -0,0 +1,405 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Validate optimized DuckDB against the original source.
2
+
3
+ Checks:
4
+ 1. Row count matches
5
+ 2. Random row spot-checks (id, uuid, taxonomy, has_url)
6
+ 3. URL reconstruction (prefix table + suffix == original identifier)
7
+ 4. Corrupted rows have NULLed taxonomy
8
+ 5. Per-column storage breakdown
9
+ 6. Query performance comparison (optimized vs original)
10
+ 7. Schema and index verification
11
+
12
+ Usage:
13
+ python scripts/data/validate_optimized_duckdb.py \
14
+ --source /path/to/metadata.duckdb \
15
+ --optimized /path/to/metadata_optimized.duckdb
16
+ """
17
+
18
+ import argparse
19
+ import os
20
+ import time
21
+
22
+ import duckdb
23
+
24
+
25
+ VALID_KINGDOMS = {
26
+ 'Animalia', 'Plantae', 'Fungi', 'Chromista', 'Protozoa',
27
+ 'Bacteria', 'Archaea', 'Viruses', 'Metazoa',
28
+ 'Archaeplastida', 'incertae sedis',
29
+ }
30
+
31
+ TAXONOMY_COLS = ["kingdom", "phylum", "class", "order", "family", "genus", "species"]
32
+
33
+ # Columns the app selects (from config.py METADATA_COLUMNS)
34
+ APP_COLUMNS = [
35
+ "id", "uuid", "kingdom", "phylum", "class", '"order"', "family", "genus",
36
+ "species", "common_name", "source_dataset", "source_id", "publisher",
37
+ "img_type", "identifier", "has_url", "in_bioclip2_training",
38
+ ]
39
+
40
+
41
+ def validate(source_path: str, optimized_path: str):
42
+ passed = 0
43
+ failed = 0
44
+
45
+ src = duckdb.connect(source_path, read_only=True)
46
+ opt = duckdb.connect(optimized_path, read_only=True)
47
+
48
+ # ── 1. Row count ─────────────────────────────────────────────────
49
+ print("=== 1. Row Count ===")
50
+ src_count = src.execute("SELECT COUNT(*) FROM metadata").fetchone()[0]
51
+ opt_count = opt.execute("SELECT COUNT(*) FROM metadata").fetchone()[0]
52
+ print(f" Source: {src_count:>15,}")
53
+ print(f" Optimized: {opt_count:>15,}")
54
+ if src_count == opt_count:
55
+ print(" PASS")
56
+ passed += 1
57
+ else:
58
+ print(" FAIL: row count mismatch")
59
+ failed += 1
60
+
61
+ # ── 2. Random spot-check ─────────────────────────────────────────
62
+ print("\n=== 2. Random Spot-Check (100 rows) ===")
63
+ sample_ids = src.execute(
64
+ "SELECT id FROM metadata ORDER BY random() LIMIT 100"
65
+ ).fetchall()
66
+ id_list = ",".join(str(r[0]) for r in sample_ids)
67
+
68
+ src_rows = src.execute(
69
+ f"SELECT id, uuid, kingdom, species, has_url, source_dataset "
70
+ f"FROM metadata WHERE id IN ({id_list}) ORDER BY id"
71
+ ).fetchall()
72
+ opt_rows = opt.execute(
73
+ f"SELECT id, uuid, kingdom, species, has_url, source_dataset "
74
+ f"FROM metadata WHERE id IN ({id_list}) ORDER BY id"
75
+ ).fetchall()
76
+
77
+ mismatches = 0
78
+ for s, o in zip(src_rows, opt_rows):
79
+ s_uuid = str(s[1]).replace("-", "")
80
+ o_uuid = str(o[1]).replace("-", "")
81
+ # kingdom/species may be NULL in optimized if row was corrupted
82
+ s_kingdom = str(s[2]) if s[2] else None
83
+ o_kingdom = str(o[2]) if o[2] else None
84
+ s_species = str(s[3]) if s[3] else None
85
+ o_species = str(o[3]) if o[3] else None
86
+
87
+ id_ok = s[0] == o[0]
88
+ uuid_ok = s_uuid == o_uuid
89
+ has_url_ok = s[4] == o[4]
90
+ source_ok = str(s[5]) == str(o[5])
91
+ # Taxonomy may differ if row was corrupted (NULLed in optimized)
92
+ taxonomy_ok = (o_kingdom == s_kingdom and o_species == s_species) or \
93
+ (o_kingdom is None and s_kingdom not in VALID_KINGDOMS)
94
+
95
+ if not (id_ok and uuid_ok and has_url_ok and source_ok and taxonomy_ok):
96
+ print(f" MISMATCH id={s[0]}:")
97
+ print(f" src: uuid={s[1]}, kingdom={s[2]}, species={s[3]}, has_url={s[4]}")
98
+ print(f" opt: uuid={o[1]}, kingdom={o[2]}, species={o[3]}, has_url={o[4]}")
99
+ mismatches += 1
100
+
101
+ if mismatches == 0:
102
+ print(f" PASS ({len(src_rows)} rows checked)")
103
+ passed += 1
104
+ else:
105
+ print(f" FAIL: {mismatches} mismatches")
106
+ failed += 1
107
+
108
+ # ── 3. URL reconstruction ────────────────────────────────────────
109
+ print("\n=== 3. URL Reconstruction ===")
110
+ has_prefix_table = opt.execute(
111
+ "SELECT COUNT(*) FROM information_schema.tables "
112
+ "WHERE table_name = 'url_prefixes'"
113
+ ).fetchone()[0] > 0
114
+
115
+ if has_prefix_table:
116
+ # Sample 200 rows with URLs
117
+ url_sample = src.execute(
118
+ f"SELECT id, identifier FROM metadata "
119
+ f"WHERE id IN ({id_list}) AND identifier IS NOT NULL "
120
+ f"ORDER BY id"
121
+ ).fetchall()
122
+
123
+ opt_urls = opt.execute(
124
+ f"SELECT m.id, "
125
+ f" COALESCE(p.prefix, '') || COALESCE(m.identifier_suffix, '') "
126
+ f"FROM metadata m "
127
+ f"LEFT JOIN url_prefixes p ON m.url_prefix_id = p.prefix_id "
128
+ f"WHERE m.id IN ({id_list}) AND "
129
+ f" (m.identifier_suffix IS NOT NULL OR m.url_prefix_id IS NOT NULL) "
130
+ f"ORDER BY m.id"
131
+ ).fetchall()
132
+
133
+ opt_url_map = {r[0]: r[1] for r in opt_urls}
134
+ url_mismatches = 0
135
+ for sid, surl in url_sample:
136
+ ourl = opt_url_map.get(sid)
137
+ if ourl != surl:
138
+ print(f" MISMATCH id={sid}:")
139
+ print(f" src: {surl[:100]}")
140
+ print(f" opt: {ourl[:100] if ourl else None}")
141
+ url_mismatches += 1
142
+
143
+ if url_mismatches == 0:
144
+ print(f" PASS ({len(url_sample)} URLs checked)")
145
+ passed += 1
146
+ else:
147
+ print(f" FAIL: {url_mismatches} URL mismatches")
148
+ failed += 1
149
+ else:
150
+ print(" SKIP: no url_prefixes table found")
151
+
152
+ # ── 4. Corrupted row cleanup ─────────────────────────────────────
153
+ print("\n=== 4. Corrupted Row Cleanup ===")
154
+ placeholders_str = ",".join(f"'{k}'" for k in VALID_KINGDOMS)
155
+
156
+ # Find corrupted IDs from source
157
+ corrupt_src = src.execute(f"""
158
+ SELECT id FROM metadata
159
+ WHERE kingdom IS NOT NULL AND kingdom NOT IN ({placeholders_str})
160
+ """).fetchall()
161
+ corrupt_ids = [r[0] for r in corrupt_src]
162
+
163
+ if corrupt_ids:
164
+ corrupt_id_list = ",".join(str(i) for i in corrupt_ids)
165
+ # Check that these rows have NULL taxonomy in optimized
166
+ opt_corrupt = opt.execute(f"""
167
+ SELECT id, kingdom, phylum, class, "order", family, genus, species, common_name
168
+ FROM metadata
169
+ WHERE id IN ({corrupt_id_list})
170
+ """).fetchall()
171
+
172
+ not_cleaned = 0
173
+ for row in opt_corrupt:
174
+ # All taxonomy cols (index 1-8) should be NULL
175
+ for i, col in enumerate(TAXONOMY_COLS + ["common_name"], 1):
176
+ if row[i] is not None:
177
+ print(f" NOT CLEANED id={row[0]}: {col}={row[i]}")
178
+ not_cleaned += 1
179
+ break
180
+
181
+ if not_cleaned == 0:
182
+ print(f" PASS ({len(corrupt_ids)} corrupted rows have NULLed taxonomy)")
183
+ passed += 1
184
+ else:
185
+ print(f" FAIL: {not_cleaned} rows still have non-NULL taxonomy")
186
+ failed += 1
187
+ else:
188
+ print(" SKIP: no corrupted rows found in source")
189
+
190
+ # ── 5. No new corruption introduced ──────────────────────────────
191
+ print("\n=== 5. No New Corruption ===")
192
+ # Check that all non-NULL kingdom values in optimized are valid
193
+ opt_kingdoms = opt.execute("""
194
+ SELECT DISTINCT kingdom FROM metadata WHERE kingdom IS NOT NULL
195
+ """).fetchall()
196
+ invalid = [r[0] for r in opt_kingdoms if str(r[0]) not in VALID_KINGDOMS]
197
+ if not invalid:
198
+ print(f" PASS (all {len(opt_kingdoms)} distinct kingdoms are valid)")
199
+ passed += 1
200
+ else:
201
+ print(f" FAIL: invalid kingdoms found: {invalid[:10]}")
202
+ failed += 1
203
+
204
+ # ── 5b. in_bioclip2_training column ─────────────────────────────
205
+ print("\n=== 5b. in_bioclip2_training Column ===")
206
+ opt_cols = [r[0] for r in opt.execute("DESCRIBE metadata").fetchall()]
207
+ src_cols = [r[0] for r in src.execute("DESCRIBE metadata").fetchall()]
208
+
209
+ if "in_bioclip2_training" in src_cols and "in_bioclip2_training" in opt_cols:
210
+ src_training = src.execute(
211
+ "SELECT COUNT(*) FROM metadata WHERE in_bioclip2_training = true"
212
+ ).fetchone()[0]
213
+ opt_training = opt.execute(
214
+ "SELECT COUNT(*) FROM metadata WHERE in_bioclip2_training = true"
215
+ ).fetchone()[0]
216
+ print(f" Source training count: {src_training:>15,}")
217
+ print(f" Optimized training count: {opt_training:>15,}")
218
+ if src_training == opt_training:
219
+ print(" PASS")
220
+ passed += 1
221
+ else:
222
+ print(" FAIL: training count mismatch")
223
+ failed += 1
224
+
225
+ # Spot-check: verify a sample of training rows match
226
+ sample_training = src.execute(
227
+ "SELECT id FROM metadata WHERE in_bioclip2_training = true "
228
+ "ORDER BY random() LIMIT 50"
229
+ ).fetchall()
230
+ if sample_training:
231
+ training_ids = ",".join(str(r[0]) for r in sample_training)
232
+ opt_check = opt.execute(
233
+ f"SELECT COUNT(*) FROM metadata "
234
+ f"WHERE id IN ({training_ids}) AND in_bioclip2_training = true"
235
+ ).fetchone()[0]
236
+ if opt_check == len(sample_training):
237
+ print(f" PASS (spot-check: {len(sample_training)} training rows verified)")
238
+ passed += 1
239
+ else:
240
+ print(f" FAIL: only {opt_check}/{len(sample_training)} training rows found")
241
+ failed += 1
242
+ elif "in_bioclip2_training" not in src_cols:
243
+ print(" SKIP: column not in source DB")
244
+ else:
245
+ print(" FAIL: column missing from optimized DB")
246
+ failed += 1
247
+
248
+ # ── 6. Schema and indexes ────────────────────────────────────────
249
+ print("\n=== 6. Schema & Indexes ===")
250
+ schema = opt.execute("DESCRIBE metadata").fetchall()
251
+ col_types = {r[0]: r[1] for r in schema}
252
+ print(" Columns:")
253
+ for name, dtype in col_types.items():
254
+ # Truncate long ENUM type strings
255
+ dtype_str = str(dtype)
256
+ if len(dtype_str) > 60:
257
+ dtype_str = dtype_str[:57] + "..."
258
+ print(f" {name:<25s} {dtype_str}")
259
+
260
+ indexes = opt.execute(
261
+ "SELECT index_name FROM duckdb_indexes()"
262
+ ).fetchall()
263
+ idx_names = {r[0] for r in indexes}
264
+ print(f"\n Indexes: {', '.join(sorted(idx_names))}")
265
+
266
+ required_indexes = {"idx_id", "idx_scope"}
267
+ if required_indexes.issubset(idx_names):
268
+ print(" PASS (required indexes present)")
269
+ passed += 1
270
+ else:
271
+ missing = required_indexes - idx_names
272
+ print(f" FAIL: missing indexes: {missing}")
273
+ failed += 1
274
+
275
+ # Check id type is INTEGER (not BIGINT)
276
+ if "INTEGER" in str(col_types.get("id", "")):
277
+ print(" PASS (id is INTEGER)")
278
+ passed += 1
279
+ else:
280
+ print(f" FAIL: id type is {col_types.get('id')}, expected INTEGER")
281
+ failed += 1
282
+
283
+ # Check uuid type is UUID (not VARCHAR)
284
+ if "UUID" in str(col_types.get("uuid", "")):
285
+ print(" PASS (uuid is native UUID)")
286
+ passed += 1
287
+ else:
288
+ print(f" FAIL: uuid type is {col_types.get('uuid')}, expected UUID")
289
+ failed += 1
290
+
291
+ # ── 7. Column storage breakdown ──────────────────────────────────
292
+ print("\n=== 7. Storage Breakdown ===")
293
+ src_size = os.path.getsize(source_path) / 1024**3
294
+ opt_size = os.path.getsize(optimized_path) / 1024**3
295
+ print(f" Source: {src_size:.2f} GB")
296
+ print(f" Optimized: {opt_size:.2f} GB")
297
+ print(f" Reduction: {(1 - opt_size/src_size)*100:.1f}%")
298
+
299
+ storage = opt.execute("""
300
+ SELECT column_name,
301
+ COUNT(DISTINCT block_id) * 256.0 / 1024 AS mb
302
+ FROM pragma_storage_info('metadata')
303
+ WHERE block_id IS NOT NULL
304
+ GROUP BY column_name
305
+ ORDER BY mb DESC
306
+ """).fetchall()
307
+ total = 0
308
+ print(f"\n {'Column':<25s} {'Size (MB)':>10s}")
309
+ print(f" {'-'*25} {'-'*10}")
310
+ for col, mb in storage:
311
+ print(f" {col:<25s} {mb:>10.1f}")
312
+ total += mb
313
+ print(f" {'-'*25} {'-'*10}")
314
+ print(f" {'TOTAL':<25s} {total:>10.1f}")
315
+
316
+ # ── 8. Query performance ─────────────────────────────────────────
317
+ print("\n=== 8. Query Performance ===")
318
+ test_ids = ",".join(str(r[0]) for r in sample_ids[:10])
319
+
320
+ # Optimized query (with URL join)
321
+ opt_query = (
322
+ f"SELECT m.id, m.uuid, m.kingdom, m.phylum, m.class, m.\"order\", "
323
+ f"m.family, m.genus, m.species, m.common_name, m.source_dataset, "
324
+ f"m.source_id, m.publisher, m.img_type, "
325
+ f"COALESCE(p.prefix, '') || COALESCE(m.identifier_suffix, '') AS identifier, "
326
+ f"m.has_url "
327
+ f"FROM metadata m "
328
+ f"LEFT JOIN url_prefixes p ON m.url_prefix_id = p.prefix_id "
329
+ f"WHERE m.id IN ({test_ids})"
330
+ )
331
+
332
+ # Source query (direct)
333
+ src_query = (
334
+ f'SELECT id, uuid, kingdom, phylum, class, "order", family, genus, '
335
+ f"species, common_name, source_dataset, source_id, publisher, "
336
+ f"img_type, identifier, has_url "
337
+ f"FROM metadata WHERE id IN ({test_ids})"
338
+ )
339
+
340
+ # Warmup
341
+ opt.execute(opt_query).fetchall()
342
+ src.execute(src_query).fetchall()
343
+
344
+ iterations = 500
345
+ t0 = time.time()
346
+ for _ in range(iterations):
347
+ opt.execute(opt_query).fetchall()
348
+ opt_ms = (time.time() - t0) / iterations * 1000
349
+
350
+ t0 = time.time()
351
+ for _ in range(iterations):
352
+ src.execute(src_query).fetchall()
353
+ src_ms = (time.time() - t0) / iterations * 1000
354
+
355
+ print(f" Optimized (10 IDs, {iterations} runs): {opt_ms:.2f} ms avg")
356
+ print(f" Original (10 IDs, {iterations} runs): {src_ms:.2f} ms avg")
357
+ ratio = opt_ms / src_ms if src_ms > 0 else float('inf')
358
+ if ratio < 2.0:
359
+ print(f" PASS (ratio: {ratio:.2f}x)")
360
+ passed += 1
361
+ else:
362
+ print(f" WARN: optimized is {ratio:.1f}x slower than original")
363
+ failed += 1
364
+
365
+ # Also test scope-filtered queries
366
+ t0 = time.time()
367
+ for _ in range(iterations):
368
+ opt.execute(
369
+ f"{opt_query} AND m.has_url = true"
370
+ ).fetchall()
371
+ opt_scope_ms = (time.time() - t0) / iterations * 1000
372
+
373
+ t0 = time.time()
374
+ for _ in range(iterations):
375
+ src.execute(
376
+ f"{src_query} AND has_url = true"
377
+ ).fetchall()
378
+ src_scope_ms = (time.time() - t0) / iterations * 1000
379
+
380
+ print(f" Optimized scoped (url_only): {opt_scope_ms:.2f} ms avg")
381
+ print(f" Original scoped (url_only): {src_scope_ms:.2f} ms avg")
382
+
383
+ # ── Summary ──────────────────────────────────────────────────────
384
+ print(f"\n{'='*50}")
385
+ print(f"PASSED: {passed} FAILED: {failed}")
386
+ if failed == 0:
387
+ print("ALL CHECKS PASSED")
388
+ else:
389
+ print("SOME CHECKS FAILED — review above")
390
+
391
+ src.close()
392
+ opt.close()
393
+
394
+
395
+ def main():
396
+ parser = argparse.ArgumentParser(description="Validate optimized DuckDB")
397
+ parser.add_argument("--source", required=True, help="Original metadata.duckdb")
398
+ parser.add_argument("--optimized", required=True, help="Optimized metadata.duckdb")
399
+ args = parser.parse_args()
400
+
401
+ validate(args.source, args.optimized)
402
+
403
+
404
+ if __name__ == "__main__":
405
+ main()
scripts/data/validate_optimized_duckdb.slurm ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ #SBATCH --job-name=duckdb_validate
3
+ #SBATCH --time=01:00:00
4
+ #SBATCH --nodes=1
5
+ #SBATCH --ntasks=1
6
+ #SBATCH --cpus-per-task=4
7
+ #SBATCH --mem=64G
8
+ #SBATCH --partition=cpu
9
+ #SBATCH --account=<YOUR_ACCOUNT> # TODO: set your SLURM account
10
+
11
+ set -euo pipefail
12
+
13
+ # ── Config ──────────────────────────────────────────────────────────
14
+ REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
15
+ VENV="${BIOCLIP_VENV:?Set BIOCLIP_VENV to your virtualenv path}"
16
+ DATA_DIR="${BIOCLIP_DATA_DIR:?Set BIOCLIP_DATA_DIR to your data directory}"
17
+
18
+ echo "=== DuckDB Validation ==="
19
+ echo "Job ID: $SLURM_JOB_ID"
20
+ echo "Node: $(hostname)"
21
+ echo "Start: $(date)"
22
+ echo ""
23
+
24
+ source "$VENV/bin/activate"
25
+
26
+ python -u "$REPO_ROOT/scripts/data/validate_optimized_duckdb.py" \
27
+ --source "${1:?Usage: sbatch validate_optimized_duckdb.slurm <source.duckdb>}" \
28
+ --optimized "$DATA_DIR/metadata_optimized.duckdb"
29
+
30
+ echo ""
31
+ echo "End: $(date)"
src/bioclip_lite/config.py CHANGED
@@ -27,7 +27,7 @@ class LiteConfig:
27
  default_nprobe: int = 16
28
  over_fetch_factor: int = 3
29
 
30
- # Scope: "all" | "url_only" | "inaturalist"
31
  scope: str = "all"
32
 
33
  # Server
@@ -46,10 +46,13 @@ class LiteConfig:
46
  image_fetch_max_workers: int = 8
47
  thumbnail_max_dim: int = 256
48
 
49
- # Metadata columns to SELECT (15 of 18 — excludes resolution_status, basisOfRecord, scientific_name)
 
50
  METADATA_COLUMNS: str = (
51
  'id, uuid, kingdom, phylum, class, "order", family, genus, species, '
52
- "common_name, source_dataset, source_id, publisher, img_type, identifier, has_url"
 
 
53
  )
54
 
55
 
@@ -138,7 +141,10 @@ def parse_args() -> LiteConfig:
138
  )
139
  p.add_argument("--device", default="cpu", choices=["cpu", "cuda", "mps"])
140
  p.add_argument("--model-str", default=None, help="Model identifier")
141
- p.add_argument("--scope", default="all", choices=["all", "url_only", "inaturalist"])
 
 
 
142
  p.add_argument("--host", default="0.0.0.0")
143
  p.add_argument("--port", type=int, default=7860)
144
  p.add_argument("--enable-export", action="store_true")
 
27
  default_nprobe: int = 16
28
  over_fetch_factor: int = 3
29
 
30
+ # Scope: "all" | "url_only" | "inaturalist" | "bioclip2_training"
31
  scope: str = "all"
32
 
33
  # Server
 
46
  image_fetch_max_workers: int = 8
47
  thumbnail_max_dim: int = 256
48
 
49
+ # Metadata columns to SELECT from optimized DB.
50
+ # URL is split into url_prefix_id + identifier_suffix; reconstructed in Python.
51
  METADATA_COLUMNS: str = (
52
  'id, uuid, kingdom, phylum, class, "order", family, genus, species, '
53
+ "common_name, source_dataset, source_id, publisher, img_type, "
54
+ "basisOfRecord, url_prefix_id, identifier_suffix, has_url, "
55
+ "in_bioclip2_training"
56
  )
57
 
58
 
 
141
  )
142
  p.add_argument("--device", default="cpu", choices=["cpu", "cuda", "mps"])
143
  p.add_argument("--model-str", default=None, help="Model identifier")
144
+ p.add_argument(
145
+ "--scope", default="all",
146
+ choices=["all", "url_only", "inaturalist", "bioclip2_training"],
147
+ )
148
  p.add_argument("--host", default="0.0.0.0")
149
  p.add_argument("--port", type=int, default=7860)
150
  p.add_argument("--enable-export", action="store_true")
src/bioclip_lite/services/search_service.py CHANGED
@@ -19,6 +19,7 @@ SCOPE_MAP = {
19
  "All Sources": "all",
20
  "URL-Available Only": "url_only",
21
  "iNaturalist Only": "inaturalist",
 
22
  }
23
 
24
 
@@ -62,6 +63,10 @@ class SearchService:
62
  row_count = self.conn.execute("SELECT COUNT(*) FROM metadata").fetchone()[0]
63
  logger.info(f"DuckDB connected: {row_count:,} rows")
64
 
 
 
 
 
65
  @_timer
66
  def search(
67
  self,
@@ -76,7 +81,7 @@ class SearchService:
76
  query_vector: 1-D embedding vector (768-dim for BioCLIP-2).
77
  top_n: Number of results to return after scope filtering.
78
  nprobe: Number of IVF partitions to search.
79
- scope: "all", "url_only", or "inaturalist".
80
 
81
  Returns:
82
  List of result dicts ordered by distance, each containing
@@ -126,28 +131,34 @@ class SearchService:
126
  distances: List[float],
127
  scope: str,
128
  ) -> List[Dict[str, Any]]:
129
- """Query DuckDB for metadata, applying scope filter."""
130
- id_list = ",".join(str(i) for i in ids)
131
 
132
- where = [f"id IN ({id_list})"]
133
- if scope == "url_only":
134
- where.append("has_url = true")
135
- elif scope == "inaturalist":
136
- where.append("has_url = true")
137
- where.append("source_dataset = 'gbif'")
138
- where.append("publisher LIKE '%iNaturalist%'")
139
 
140
  query = (
141
  f"SELECT {self.metadata_columns} FROM metadata "
142
- f"WHERE {' AND '.join(where)}"
143
  )
144
  rows = self.conn.execute(query).fetchall()
145
  col_names = [desc[0] for desc in self.conn.description]
146
 
147
- # Build lookup keyed by id
148
  meta_map: Dict[int, Dict] = {}
149
  for row in rows:
150
  d = dict(zip(col_names, row))
 
 
 
 
 
 
 
151
  meta_map[d["id"]] = d
152
 
153
  # Merge with distances, preserving FAISS ranking
@@ -155,6 +166,20 @@ class SearchService:
155
  for fid, dist in zip(ids, distances):
156
  if fid in meta_map:
157
  results.append({"distance": dist, **meta_map[fid]})
 
 
 
 
 
 
 
 
 
 
 
 
 
 
158
  return results
159
 
160
  @property
@@ -165,5 +190,19 @@ class SearchService:
165
  def total_vectors(self) -> int:
166
  return self.index.ntotal
167
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
168
  def close(self):
169
  self.conn.close()
 
19
  "All Sources": "all",
20
  "URL-Available Only": "url_only",
21
  "iNaturalist Only": "inaturalist",
22
+ "BioCLIP 2 Training": "bioclip2_training",
23
  }
24
 
25
 
 
63
  row_count = self.conn.execute("SELECT COUNT(*) FROM metadata").fetchone()[0]
64
  logger.info(f"DuckDB connected: {row_count:,} rows")
65
 
66
+ # Load URL prefix lookup (410 entries, ~50 KB in memory).
67
+ # Reconstructs full URLs in Python instead of a SQL JOIN.
68
+ self._url_prefixes = self._load_url_prefixes()
69
+
70
  @_timer
71
  def search(
72
  self,
 
81
  query_vector: 1-D embedding vector (768-dim for BioCLIP-2).
82
  top_n: Number of results to return after scope filtering.
83
  nprobe: Number of IVF partitions to search.
84
+ scope: "all", "url_only", "inaturalist", or "bioclip2_training".
85
 
86
  Returns:
87
  List of result dicts ordered by distance, each containing
 
131
  distances: List[float],
132
  scope: str,
133
  ) -> List[Dict[str, Any]]:
134
+ """Query DuckDB for metadata, filtering by scope in Python.
 
135
 
136
+ Scope filtering via SQL WHERE clauses causes ~370x slowdown on
137
+ ID-based lookups (4ms → 1600ms) because DuckDB scans the full
138
+ column even when nearly all rows match. Since has_url and
139
+ in_bioclip2_training are true for >87% of rows, post-filtering
140
+ in Python is far more efficient.
141
+ """
142
+ id_list = ",".join(str(i) for i in ids)
143
 
144
  query = (
145
  f"SELECT {self.metadata_columns} FROM metadata "
146
+ f"WHERE id IN ({id_list})"
147
  )
148
  rows = self.conn.execute(query).fetchall()
149
  col_names = [desc[0] for desc in self.conn.description]
150
 
151
+ # Build lookup keyed by id, reconstructing full URL from prefix + suffix
152
  meta_map: Dict[int, Dict] = {}
153
  for row in rows:
154
  d = dict(zip(col_names, row))
155
+ if self._url_prefixes and "url_prefix_id" in d:
156
+ # Prefixes are domains (e.g. "https://content.eol.org"),
157
+ # suffixes always start with "/" (e.g. "/data/media/...").
158
+ # Split is guaranteed by optimize_duckdb.py's substr().
159
+ prefix = self._url_prefixes.get(d.pop("url_prefix_id"), "")
160
+ suffix = d.pop("identifier_suffix", "") or ""
161
+ d["identifier"] = prefix + suffix if (prefix or suffix) else None
162
  meta_map[d["id"]] = d
163
 
164
  # Merge with distances, preserving FAISS ranking
 
166
  for fid, dist in zip(ids, distances):
167
  if fid in meta_map:
168
  results.append({"distance": dist, **meta_map[fid]})
169
+
170
+ # Apply scope filter in Python (much faster than SQL WHERE)
171
+ if scope == "url_only":
172
+ results = [r for r in results if r.get("has_url")]
173
+ elif scope == "inaturalist":
174
+ results = [
175
+ r for r in results
176
+ if r.get("has_url")
177
+ and r.get("source_dataset") == "gbif"
178
+ and "iNaturalist" in (r.get("publisher") or "")
179
+ ]
180
+ elif scope == "bioclip2_training":
181
+ results = [r for r in results if r.get("in_bioclip2_training")]
182
+
183
  return results
184
 
185
  @property
 
190
  def total_vectors(self) -> int:
191
  return self.index.ntotal
192
 
193
+ def _load_url_prefixes(self) -> Dict[int, str]:
194
+ """Load url_prefixes table into a dict for fast in-Python URL reconstruction."""
195
+ try:
196
+ rows = self.conn.execute(
197
+ "SELECT prefix_id, prefix FROM url_prefixes"
198
+ ).fetchall()
199
+ prefixes = {row[0]: row[1] for row in rows}
200
+ logger.info(f"Loaded {len(prefixes)} URL prefixes")
201
+ return prefixes
202
+ except duckdb.CatalogException:
203
+ # Legacy DB without url_prefixes table — identifier is a direct column
204
+ logger.info("No url_prefixes table found, using direct identifier column")
205
+ return {}
206
+
207
  def close(self):
208
  self.conn.close()