Language & seeing

Venus and Adonis, Peter Paul Rubens
Venus and Adonis, Peter Paul Rubens (mid-1630s)

What do you see in this image? How would you express it in language?

Just as readily as art speaks to the eyes, we expect to articulate it in language. Such a transformation, counter to this impulse, is not trivial. At once, I see Cupid's wings, or Adonis' calf and raised heel, or where Venus' hand clings to Adonis's arm. As we adjust the granularity of our vision, the way we approach the painting changes too.

Vasari aims to bridge the gap between language and sight by changing the way we search for, and interact with, works of art. A piece of art is not its aggregate nor its components. It is both, simultaneously. Vasari invites the viewer to drill into the detail and trace similarities across different works of art.

Brains & function

Vasari is rapidly changing, but here's how it's currently designed. In my opinion, Vasari's value proposition is distributed across the composite system: primarily in the novel approach to indexing (as applied to the art domain) and the user interface on the frontend (what detail is exposed to the user).

Indexing

Existing semantic search implementations take coarse embeddings of entire works. This is diminutive, lossy, and approaches all art from the same aperture. To give credit where credit is due, it works remarkably well in some cases, particularly in searching for 'moods' or for cursory exploration.

The Harvesters, Pieter Bruegel the Elder
The Harvesters, Pieter Bruegel the Elder

I believe the disconnect is remarkably simple: not all artworks are the same. This is quite the truism, so, more precisely, different artworks are understood and "queried" (with natural language) in different ways and at different granularities. A composition such as Bruegel's The Harvesters can understood as a sum (the entire canvas) or by its parts (individual detail, and groupings of detail). There are so many different granularities by which to approach the canvas, it can almost be likened to the Coastline paradox -- your aperture can continuously become finer. Vasari siezes this observation.

Inspired by two-stage object detection models like ViLD and RegionCLIP, we first identify areas of interest (a la RPNs, or Region Proposal Networks) in a work using Meta's SAM (Segment Anything). This is the "finest" aperture, extracting "entities" from the original piece. We then coalesce entities based on geometric/spatial arrangement into "groups". Finally, "scenes" are extracted similar to entities, but at a much coarser granularity. These three categories (+ the entire painting itself) make up a hierarchy of detail, mimicking the way a human might approach a work (first looking at the sum, then decomposing it into parts). Similar to existing solutions, these proposals are then embedded via CLIP. CLIP pushes both text and image information to the same embedding space, meaning a user can query the indexed corpus with either natural language or similar images/detailed regions.

As it stands, the indexed corpus has a very disproportionate read pattern (index once, read forever). In the future, I'd love to explore incorporating (anonymized) user actions in a RLHF pipeline to improve reranking.

User interface

Where the analogy to RPNs attempt to capture the two-stage, spatial-local process that drives our indexing, the comparison is not one-to-one. The region proposals are a systematc way for the model to "see", but this detail is also exposed to the user. Once again, Vasari's hope is not only to bridge the gap between sight and language, as other semantic search engines have attempted to do, but to alter the way we interact with paintings in the first place. Exposing these regions to the user encourages them to approach the painting from these different granularities and trace similar motifs across disparate works.

Infrastructure

Most[1] of the infrastructure is built on top of Cloudflare Workers. Vector embeds are stored in Vectorize, relational data (work tombstones, extracted detail) in D1 -- R2 is used lightly as this current iteration relies on museum CDNs instead, but this might change in the future. The Workers ecosystem allows us to move fast, try new configurations, and focus on the end interface.

To index such a large corpus, we need a decent amount of compute/parallelization (even if the individual jobs aren't LLM-scale demanding). Modal has been great for indexing and real-time inference, allowing us to easily spin up/tear down environments as we iterate. Although the Modal real-time embedding service is super simple, it uses a bespoke container environment and thus needs to cold-start (or reserve a GPU) each invocation, unless already warm. This is a sore area that we're trying to improve. We're currently seeing ~25s cold starts.

Future & thanks

There's still a lot more to do. At our core, Vasari is comparative. We provide a platform for users to trace threads within and between works. As such, making the user interface seamless is just as important as making search intelligent. We hope to work on better ways to place identified images (or details) next to one another. For instance, imagine you wanted to pursue a sustained study of hands in 16th century Italian art. Simply seeing similar results to a particular painting is not all too helpful -- some might be erroneous, others you'd want to keep but you can only select one! Sustained study also provides the opportunity for refining similarity queries by adding to a collection. An interface that aggregates and organizes similar details/images would continuously increase the verbosity of the similarity query, thus providing more accurate results and accelerating research.

If we set our sights on curation and collaboration, we must also tread carefully. Already, to crop and extract is to dissect and deconstruct artworks. It is to decontextualize. As we build toward these new features, we want to be careful not to suggest the malleability or impermeance of these indexed works, but to consider detail for what it is--a study, a diminutation of vision to focus on one thread.


But that doesn't mean Vasari can't be used today! If you find Vasari useful, or have ideas on how to make it better, please reach out -- I'd love to chat.

Interested in using Vasari programmatically? Check out the docs (note: some routes are subject to change).

Want to read more about building Vasari? Check out some writing on my site.

Disclaimer: Perhaps diametrically opposed to the mission here, this website features some AI generated imagery.

  1. Workers AI doesn't currently support the CLIP embedding model, so we don't currently use that for inference.