projects, tips, and detritus re: digital-humanities, libraries, jekyll, static-sites, iiif, json, yaml, ruby, python, pandas, numpy, minimal-computing, digital-collections, natural language processing, data-visualization, web-scraping, document-parsing, media-materialism, shell-scripting, speculative-fiction, and ________.


Lately, I’ve been all but living in Travis-CI. It runs my tasks, performs my tests, and now pushes my sites out for deployment. It’s almost as though—gasp!—I’ve achieved continuous integration. That being said, it took a ton of trial and error to get here. If this post can spare you some of rigmarole, my efforts will be affirmed. You might be asking yourself, doesn’t Travis have a deploy provider for GitHub Pages? This is a fair question, especially because the answer is yes. However, the built-in GH Pages deploy is pretty inflexible. For example, I couldn’t get it to ignore...

Overview 1: GitHub is a place for Open Source projects, which are stored in repositories. 2: A fork is a copy of a repository that is all your own and helps you track changes. 3: Markdown is an easy to use markup schema for designating headings, lists, links, images and more. 4: Jekyll is software for creating static sites that GitHub can run for you. It is responsible for converting your markdown into your site’s HTML files. 5: Liquid is a templating engine that helps you work with your site’s data and content in more dynamic ways than Markdown. Jekyll...

This post is about a recent Proof of Concept demo I made for adding and storing IIIF compliant annotation lists without configuring an endpoint or database. Feel free to skip ahead and play with the demo here. what it is: A workflow for creating and storing annotations on IIIF manifests that leverages the Project Mirador viewer, Jekyll, and Rake. mirador: serves as the UI for adding and displaying annotations. jekyll: serves the site and helps to reconcile the lists. rake tasks: formats the annotation json and creates a copy of the manifest to reference them. some custom javascript: retrieves the...

Steps: Scrape and save metadata with Python Scrape and download .pdfs to local directory with Python Convert .pdfs to .tifs with ImageMagick Extract raw text from .tifs using Tesseract OCR Connect raw text to metadata (in a static format like csv or in a database) cia-scraper.python¬ from bs4 import BeautifulSoup from os.path import expanduser import os from tqdm import tqdm import pandas as pd import requests import urllib2 import time import lxml BASEURL = "https://www.cia.gov" PAGINATE_PATH ="/library/readingroom/collection/scientific-abstracts?page=" PDF_DIR = expanduser("~") + "/cia_pdfs/" RANGE = 1654 # pages, of 20 docs per page TEST_RANGE = 10 SKIPPED_FILES = [] def retrieve_file(url,...

As part of my gig as the DH developer for Columbia University Libraries, I’ve become the data steward of the Foreign Relations of the United States (FRUS) collection for the History Lab project. This means that, as newly declassified and processed FRUS volumes are released by the State Department as XML files, it’s my job to re-process the collection with the newly added volumes, ingest the processed data into to our MySQL database, and make sure the metadata connecting the volumes to History_Lab’s own bleeding edge topic modeling and named-entity recognition (NER) is preserved for further research. In practice, this...

I’ve been making a lot of Jekyll static sites lately, and the more work I do, the more it looks like I’ll be making many more. Why? Because they’re quickly proving to be sustainable, extendable, and super powerful. (For more on the potential of Jekyll, check out my project jekyll-wax.) With more sites to manage and increasingly complex components to maintain, I’ve shifted my focus to automating acceptance tests with continuous integration. In context, this means that every time I make a branch commit to my Jekyll GitHub repository: Gemnasium will evaluate my dependencies Travis-CI will build my site from...

What is a headless feature test? The term “headless” refers to software capable of working without a GUI. Accordingly, headless feature tests are programmatic actions that simulate in-brower user interactions with site features (without needing to actually open a GUI browser!) and then return results on the success (or failure) of that interaction. In simpler terms: they’re programs that go test your features for you, and come back bearing some good or not-so-good news. Headless feature tests (like any unit tests) are an important part of any Continuous Integration (CI) architecture. If you’re new to CI and want to figure...

This post is part 4 of 4 in a series. Feel free to skip around to: part 1: the task, part 2: data transformation, or part 3: the site. epilogue The demo! The mostly finished demo has directories of plays, productions, performances, authors, performers, characters, kashira, scenes, image tags, slide images, and image albums with individual layouts displaying and linking object data together. It is navigable through the above directory listings, through several dynamic search boxes running client-side Lunrjs, and via clickable D3js data visualizations. It handles relative/massive image sets by implementing lazy load in a jQuery carousel. To learn...

This post is part 3 of 4 in a series. Feel free to skip around to: part 1: the task, part 2: data transformation, or part 4: epilogue. Act 3: The site emerges iv. Ingest + generate In: JSON Tools: Jekyll / wax_tasks gem Once I had my data packaged and ready in individual JSON array files (e.g. authors.json), I needed to create a Jekyll collection for each type, and ‘split’ the array of objects into individual markdown pages (e.g. /_authors/1.md) with YAML as the pages’ front matter: --- dates: fl. 1741-1767 id: '1' label_eng: Asada Icchō label_ka: 浅田一鳥 play_id:...

This post is part 2 of 4 in a series. Feel free to skip around to: part 1: the task, part 3: the site, or part 4: epilogue. Act 2: Data transformation montage After taking a detour prototyping with Google Lovefield + IndexedDB and making a pitstop to play with GraphQL, I finally settled on a simpler plan: export each object type as an array of JSON records, and have each record point to its relationships via arrays of IDs. With this thought, the montage began: scene i. plan + tidy In: MySQL dump Tools: any ol’ erd / OpenRefine...

This post is part 1 of 4 in a series. Feel free to skip around to: part 2: data transformation, part 3: the site, or part 4: epilogue. Enter: Bunraku A few months ago, I was given access to a MySQL database with 27 tables of data on Bunraku, or Japanese puppet theater. The data consists primarily of digitized images from the Barbara Curtis Adachi Bunraku Collection here at Columbia University Libraries, but it also contains a ton of relational data on the Bunraku community as Barbara encountered it—which is to say, as a rich network of performers, plays, productions,...

Note: The following preliminary post has been superseded by a fully-fledged project! For more, checkout out jekyll-wax. Objective 1:Create a set of proof-of-concept Jekyll sites that showcase the extent of what the static site generator can do (i.e. without complex server scripts or databases!) and push those limits. Objective 2:Apply lessons from the these sites in prototyping an Omeka-like, modular, minimal computing multitool using Jekyll for digital exhibitions, journals, and blogs, complete with a menu of components ranging from lazy-loading carousels to IIIF manuscript manifests, dynamic client-side search to data visualizations. v0.1 / codename: paper gods what it is: digital...

Before I start explaining myself and my hacky process, I should say that there is a Wordpress plugin for exporting to Jekyll. It might work amazingly well, automatically doing everything I needed it to and more. I say ‘might’ because I genuinely don’t know; I didn’t get a chance to try it, which I’ll blame on the outdated PHP (5.3) that the WP server was running. (The plugin requires >= 5.5). So, if you’re like me and can’t get the plugin to work (for whatever your reason might be), you’re going to need a Plan B. Depending on the relative...

The goal: Produce a d3.js force-directed graph that visualizes a collection in your Jekyll site via tag clusters. How to do it: 1. Set up your Jekyll collection. You can find tips for how to do that here. 2. Add your tag(s) to each collection page’s YAML front matter.For example: layout: default title: 土地 正神 tags: - Earth - God of Wealth --- 3. Write some Liquid that will translate your collection data/pages into the JSON file that drives the D3 visualization. This is the only tricky part of the process and, as such, will constitute pretty much the entirety of...