Table: hypothesis_search - Query Hypothesis Searches using SQL
Hypothesis is a service that allows users to annotate web pages and PDFs, fostering conversations within the text. It is used by educators, journalists, publishers, and researchers to anchor discussions, express opinions, and share insights directly on top of digital content. Hypothesis Searches are a resource within this service that allows users to query and retrieve these annotations.
Table Usage Guide
The hypothesis_search
table provides insights into Hypothesis Searches within the Hypothesis service. As a researcher or educator, explore annotation-specific details through this table, including the text, tags, and user who made the annotation. Utilize it to uncover information about annotations, such as those with specific tags, the users who made them, and their corresponding details.
Important Notes
- Searches for Hypothesis annotations matching a query. If you authenticate you'll search the Hypothesis public layer plus all your private annotations, and annotations in private groups you belong to. If you don't authenticate you'll just search the public layer.
Examples
Find 10 recent notes, by judell
, that have tags
Explore the recent activity of a specific user, 'judell', to identify instances where they have added tags to their notes. This is useful for understanding their areas of interest and their tagging habits.
select created, uri, tags, group_id, usernamefrom hypothesis_searchwhere query = 'user=judell' and jsonb_array_length(tags) > 0order by created desclimit 10;
select created, uri, tags, group_id, usernamefrom hypothesis_searchwhere query = 'user=judell' and json_array_length(tags) > 0order by created desclimit 10;
Find notes tagged with both media
and review
Explore the instances where notes are tagged with both 'media' and 'review', allowing you to focus on specific areas of interest. This can be particularly useful when you're looking for overlaps in topics or themes. NOTE This matches notes with:
- "media" and "review"
- "social media" and "review"
- "social media" and "peer review"
select uri, tagsfrom hypothesis_searchwhere query = 'tag=media&tag=review';
select uri, tagsfrom hypothesis_searchwhere query = 'tag=media&tag=review';
Find notes tagged with social media
and peer review
Explore the instances where notes are tagged with both 'social media' and 'peer review', which can be useful for understanding the intersection of these two topics in your data.
select uri, tagsfrom hypothesis_searchwhere query = 'tag=social+media&tag=peer+review';
select uri, tagsfrom hypothesis_searchwhere query = 'tag=social+media&tag=peer+review';
Find notes on the New York Times home page, by month
Explore the frequency of annotations made on the New York Times homepage in a given month. This can help you understand the level of engagement or significant events during specific periods.
with data as ( select substring( created from 1 for 7 ) as month, uri from hypothesis_search where query = 'uri=https://www.nytimes.com')select month, count(*)from datagroup by monthorder by count desc;
with data as ( select substr(created, 1, 7) as month, uri from hypothesis_search where query = 'uri=https://www.nytimes.com')select month, count(*)from datagroup by monthorder by count(*) desc;
Find URLs and note counts on articles annotated in the Times' Opinion section
Explore which articles in the Times' Opinion section have been annotated and understand the frequency of these annotations. This can be useful to identify the most discussed or controversial articles.
with data as ( select uri from hypothesis_search where query = 'wildcard_uri=https://www.nytimes.com/*/opinion/*')select count(*), urifrom datagroup by uriorder by count desc;
with data as ( select uri from hypothesis_search where query = 'wildcard_uri=https://www.nytimes.com/*/opinion/*')select count(*), urifrom datagroup by uriorder by count(*) desc;
Find page notes (i.e. notes referring to the URL, not a selection) on www.example.com
Determine the instances where notes were made on the entire webpage of www.example.com, rather than a specific selection. This is useful for identifying overall feedback or comments about the webpage as a whole.
with target_keys_to_rows as ( select id, username, created, text, uri, target, jsonb_object_keys(target -> 0) as target_key from hypothesis_search where query = 'uri=https://www.example.com' group by id, username, created, text, uri, target order by id)select distinct 'https://hypothes.is/a/' || id as link, *from target_keys_to_rowswhere target_key = 'Selector' and target -> 0 ->> 'Selector' is nullorder by created desc;
with target_keys_to_rows as ( select id, username, created, text, uri, target, json_each(target, '$[0]') as target_key from hypothesis_search where query = 'uri=https://www.example.com' group by id, username, created, text, uri, target order by id)select distinct 'https://hypothes.is/a/' || id as link, *from target_keys_to_rowswhere target_key.key = 'Selector' and json_extract(target, '$[0].Selector') is nullorder by created desc;
Find notes, in the Times' Opinion section, that quote selections matching "covid"
Discover the segments that quote selections matching a specific term within the Times' Opinion section. This is useful for exploring user-generated annotations and comments on current events or trending topics, such as 'covid'.
select 'https://hypothes.is/a/' || id as link, uri, username, created, exactfrom hypothesis_searchwhere query = 'wildcard_uri=https://www.nytimes.com/*/opinion/*' and exact ~* 'covid'order by created desc;
select 'https://hypothes.is/a/' || id as link, uri, username, created, exactfrom hypothesis_searchwhere query = 'wildcard_uri=https://www.nytimes.com/*/opinion/*' and exact LIKE '%covid%'order by created desc;
Find annotated GitHub repos, join with info from GitHub API
Explore annotated GitHub repositories and gain insights into their associated details from the GitHub API. This can help you identify patterns or trends in the data, enhancing your understanding of these repositories. NOTE This will take a minute or so. Once it's done, it's cached for 5 minutes, or another duration you can specify, so queries that touch the same data are instantaneous.
with annotated_urls as ( select regexp_matches(uri, 'github.com/([^/]+)/([^/]+)') as match, * from hypothesis_search where query = 'wildcard_uri=http://github.com/*&limit=1000' order by uri),and_repos as ( select *, match [ 1 ] || '/' || match [ 2 ] as repository_full_name from annotated_urls order by uri)select distinct g.name, g.description, g.owner_login, r.uri, r.id, r.usernamefrom github_repository g join and_repos r on g.full_name = r.repository_full_name;
with annotated_urls as ( select uri, * from hypothesis_search where query = 'wildcard_uri=http://github.com/*&limit=1000' order by uri),and_repos as ( select *, uri as repository_full_name from annotated_urls order by uri)select distinct g.name, g.description, g.owner_login, r.uri, r.id, r.usernamefrom github_repository g join and_repos r on g.full_name = r.repository_full_name;
Order annotated GitHub repos by count of annotations
This query is used to analyze the frequency of annotations made by users on different GitHub repositories. It can be beneficial to identify which repositories are receiving the most attention or interaction, potentially indicating areas of high interest or activity.
with annotated_urls as ( select id, created, uri, username, regexp_matches(uri, 'github.com/([^/]+)/([^/]+)') as match from hypothesis_search where query = 'wildcard_uri=http://github.com/*&limit=100' order by uri),and_repos as ( select a.*, a.match [ 1 ] || '/' || a.match [ 2 ] as repository_full_name from annotated_urls a order by a.uri),joined as ( select distinct g.name, g.owner_login, r.uri, r.id, r.username from github_repository g join and_repos r on g.full_name = r.repository_full_name)select count(j.*), j.name, j.owner_login, j.urifrom joined jgroup by j.name, j.owner_login, j.uriorder by count desc;
Error: The corresponding SQLite query is unavailable.
Find URIs with conversational threads spanning more than one day
with thread_data as ( select uri, count(*), min(created) as first, max(created) as last, sum(jsonb_array_length(refs)) as refs, array_agg(distinct username) as thread_participants from hypothesis_search where query = 'limit=1000' group by uri)select uri, count as annos, refs, first, last, date(last) - date(first) as days, thread_participantsfrom thread_datawhere date(last) - date(first) > 0 and refs is not null
Error: The corresponding SQLite query is unavailable.
Fetch the most recent 10000 annotations
Determine the areas in which the most recent 10,000 annotations exist. This is useful for gaining insights into the latest trends and patterns in your data, especially when dealing with large volumes of annotations that are continuously accumulating.
select *from hypothesis_searchwhere query = 'limit=10000';
select *from hypothesis_searchwhere query = 'limit=10000';
NOTE When you use limit
in the query string, it means: If there are limit
annotations that match your query, fetch all of them. They will be stored in the Steampipe cache for 5 minutes by default, or longer if you add an options
argument to your hypothesis.spc
file and adjust the cache_ttl
to a longer duration.
options "connection" { cache = "true" cache_ttl = 300 # default 5 minutes}
Merging historical and live data
Suppose you have 500,000 annotations and are continuing to accumulate them at the rate of several thousand per day. (This is a real scenario.) You could stash the 500,000 in a table, or in a materialized view, like so:
create materialized view my_hypothesis_annotations as ( select * from hypothesis_search where query = 'group=my_group&limit=500000') with data;
Error: SQLite does not support materialized views directly.
You could then merge those with live data like so.
with historical as ( select * from my_hypothesis_annotations),new as ( select * from hypothesis_search new where query = 'group=my_group&limit=5000' and not exists ( select * from historical where historical.id = new.id )),select *from historicalunionselect *from new
Error: The corresponding SQLite query is unavailable.
Schema for hypothesis_search
Name | Type | Operators | Description |
---|---|---|---|
_ctx | jsonb | Steampipe context in JSON form, e.g. connection_name. | |
created | text | The creation date of the annotation. | |
document | jsonb | An element that contains the title and maybe other metadata. | |
exact | text | The text of the selection (aka quote) to which the annotation anchors. | |
group_id | text | The annotation's group: __world__ or a private group id. | |
id | text | The annotation id, works with https://hypothes.is/a/{ID}. | |
query | text | = | The search query. |
refs | jsonb | IDs forming the reference chain to which this annotation belongs. | |
tags | jsonb | Tags on the annotation, as a JSONB array of strings. | |
target | jsonb | The selectors that define the document selection to which the annotation anchors. | |
text | text | Textual body of the annotation, as MarkDown/HTML. | |
title | text | The HTML doctitle of the annotated URL. | |
updated | text | The last update date of the annotation. | |
uri | text | URL of the annotated resource. | |
username | text | The Hypothesis username of the person who created the annotation. |
Export
This table is available as a standalone Exporter CLI. Steampipe exporters are stand-alone binaries that allow you to extract data using Steampipe plugins without a database.
You can download the tarball for your platform from the Releases page, but it is simplest to install them with the steampipe_export_installer.sh
script:
/bin/sh -c "$(curl -fsSL https://steampipe.io/install/export.sh)" -- hypothesis
You can pass the configuration to the command with the --config
argument:
steampipe_export_hypothesis --config '<your_config>' hypothesis_search