Subscribe to
Posts
Comments
You've arrived at Everything is Miscellaneous's blog page that was active 2008-2012. You'll find links to some useful information about the book and its subject matter, but don't be surprised by some dead links, etc.
To order a copy, go to your local bookstore, or Amazon, etc.
For information about me, David Weinberger, click here.
To visit the page underneath this text, click here.

Thanks - David Weinberger

I’m finding Google Lab’s Dataset Publishing Language (DSPL) pretty fascinating.

Upload a set of data, and it will do some semi-spiffy visualizations of it. (As Apryl DeLancey points out, Martin Wattenberg and Fernanda Viegas now work for Google, so if they’re working on this project, the visualizations are going to get much better.) More important, the data you upload is now publicly available. And, more important than that, the site wants you to upload your data in Google’s DSPL format. DSPL aims at getting more metadata into datasets, making them more understandable, integrate-able, and re-usable.

So, let’s say you have spreadsheets of “statistical time series for unemployment and population by country, and population by gender for US states.” (This is Google’s example in its helpful tutorial.)

  • You would supply a set of concepts (“population”), each with a unique ID (“pop”), a data type (“integer”), and explanatory information (“name=population”, “definition=the number of human beings in a geographic area”). Other concepts in this example include country, gender, unemployment rate, etc. [Note that I’m not using the DSPL syntax in these examples, for purposes of readability.]

  • For concepts that have some known set of members (e.g., countries, but not unemployment rates), you would create a table — a spreadsheet in CSV format — of entries associated with that concept.

  • If your dataset uses one of the familiar types of data, such as a year, geographical position, etc., you would reference the “canonical concepts” defined by Google.

  • You create a “slice” or two, that is, “a combination of concepts for which data exists.” A slice references a table that consists of concepts you’ve already defined and the pertinent values (“dimensions” and “metrics” in Google’s lingo). For example, you might define a “countries slice” table that on each row lists a country, a year, and the country’s population in that year. This table uses the unique IDs specified in your concepts definitions.

  • Finally, you can create a dataset that defines topics hierarchically so that users can more easily navigate the data. For example, you might want to indicate that “population” is just one of several characteristics of “country.” Your topic dataset would define those relations. You’d indicate that your “population” concept is defined in the topic dataset by including the “population topic” ID (from the topic dataset) in the “population” concept definition.

When you’re done, you have a data set you can submit to Google Public Data Explorer, where the public can explore your data. But, more important, you’ve created a dataset in an XML format that is designed to be rich in explanatory metadata, is portable, and is able to be integrated into other datasets.

Overall, I think this is a good thing. But:

  • While Google is making its formats public, and even its canonical definitions are downloadable, DSPL is “fully open” for use, but fully Google’s to define. Having the 800-lbs gorilla defining the standard is efficient and provides the public platform that will encourage acceptance. And because the datasets are in XML, Google Public Data Explorer is not a roach motel for data. Still, it’d be nice if we could influence the standard more directly than via an email-the-developers text box.

  • Defining topics hierarchically is a familiar and useful model. I’m curious about the discussions behind the scenes about whether to adopt or at least enable ontologies as well as taxonomies.

  • Also, I’m surprised that Google has not built into this standard any expectation that data will be sourced. Suppose the source of your US population data is different from the source of your European unemployment statistics? Of course you could add links into your XML definitions of concepts and slices. But why isn’t that a standard optional element?

  • Further (and more science fictional), it’s becoming increasingly important to be able to get quite precise about the sources of data. For example, in the library world, the bibliographic data in MARC records often comes from multiple sources (local cataloguers, OCLC, etc.) and it is turning out to be a tremendous problem that no one kept track of who put which datum where. I don’t know how or if DSPL addresses the sourcing issue at the datum level. I’m probably asking too much. (At least Google didn’t include a copyright field as standard for every datum.)

Overall, I think it’s a good step forward.

Well, here’s an application of some of the ideas in Everything is Miscellaneous that I wasn’t expecting: The US GAAP Taxonomy. A post at the XBRL Business Information Exchange says:

The US GAAP Taxonomy was built by the accounting standards setter, the FASB. It was built by accountants. It is a consensus-based product. Not one SEC XBRL filer uses the US GAAP Taxonomy as is to file with the SEC. Every SEC reorganizes the US GAAP Taxonomy.

But the US GAAP Taxonomy is not built to be reorganized. The structure of the taxonomy is more like a book. Can the US GAAP Taxonomy be reorganized? Of course it can. But it is certainly not optimized to allow for reorganization and reorganization is not even mentioned in the design characteristics. As such, it will cost more and be harder to create and maintain these reorganizations.

So how do you make it easier to reorganize? Many smaller pieces which can be put together as needed is vastly easier for a computer to deal with than having one large piece and trying to break that piece apart. That is one example of what can be done. Another is communicating the metadata which exists in the taxonomy, for example the information modeling patterns employed. A third is to make the existing metadata real metadata, rather than burying it in the labels of the concepts. Another is to add more metadata.

The post points out that it’s not that everything about that taxonomy should thrown into a big pile. There are key data points required by law and to achieve financial integrity. Still, this is not a place I would have thought miscellanizing would help. It seems, however, that I may well be happily wrong.

Great Ski Holidays lets you search for a place you want to go skiing using a faceted system, so you can specify tags such as alpine, beginner, nightlife, and spa. (For my ideal ski resort, the tags would be: free, low, and indoors.) It seems well done, but the thing I really like about it is that you can choose which authorities you want to use: ski review sites, ski resorts & club sites, trade sites & tour operators, and (coming soon) reader reviews.

The site started out as a demo of “Authority Driven Facet Tags” by an enterprise search agency called Metaphor Search. It went so well that they opened it up to the Web public, although it still shows some signs of its demo origins, including some typos, etc. It just adds to the charm.

One of their blog posts actually credits Everything Is Miscellaneous as one of the inspirations, which makes me happy. The post says part of the impetus for developing a faceted system with configurable authorities was experiencing the difficulty of coming up with a single, uncontested geographical classification for the Maldives: Asia? Indian Ocean? And it got worse when they tried to come up with a taxonomy of destination types. So, rather than try to figure out what each user’s unexpressed taxonomy is, they decided to let the user decide which authorities to trust and use those authorities’ ways of divvying up the world. Clever, and not unlike the multi-taxonomy approach taken by some species-of-the-world sites.

Visualizing Wikipedia deletions

Notabilia has visualized the hundred longest discussion threads at Wikipedia that resulted in the deletion of an article and the hundred that did not. The visualized threads take on shapes depending on whether the discussion was controversial, swinging, or unanimous. For those whose brains can process visualized information (as mine cannot), you will undoubtedly learn much. For the rest of us: Oooooh, pretty!

They’ve posted some other analyses as well. For example, “The analysis [pdf] of a large sample of AfD discussions (200K discussions that took place between November 2002 and July 2010) suggests that the largest part of these discussions ends after only a few recommendations are expressed.” And: “Delete decisions tend to be fairly unanimous. In contrast, we found many Keep decisions resulting from a discussion that leaned towards deletion…”

Near- and far-in-laws

Keith Dawson has a suggestion for disambiguating “in-law,” which can refer to (for example), your wife’s brother or your sister’s husband. He’s got near-in-laws and far-in-laws. Very handy.

And it raises the question of why English doesn’t already have an easy way of making this distinction. Are we so binary about our family relations that we just don’t give a damn-in-law?

[2b2k] Citizen scientists

Alex Wright has an excellent article in the New York Times today about the great work being done by citizen scientists. (Alex follows up in his blog with some more worthy citizen science efforts.)

Alex, who I met a few years ago at a conference because we had written books on similar topics — his excellent Glut and my Everything Is Miscellaneous — quotes me a couple of times in the article. The first time, I say that the people who are gathering data and classifying images “are not doing the work of scientists.” Some in the comments have understandably taken issue with that characterization. It’s something I deal with at some length in Too Big to Know. Because of the curtness of the comment, it could easily be taken as dismissive, which was not my intent; these volunteers are making a real contribution, as Alex’s article documents. But, in many of the projects Alex discusses (and that I discuss in my manuscript), the volunteers are doing work for which they need no scientific training. They are doing the work of science — gathering data certainly counts — but not the work of scientists. But that’s what makes it such an exciting time: You don’t need a degree or even training beyond the instructions on a Web page, and you can be part of a collective effort that advances science. (Commenter kc I think makes a good argument against my position on this.)

FWIW, the origins of my participation in the article were a discussion with Alex about why in this age of the amateur it’s so hard to find the sort of serious leap in scientific thinking coming from amateurs. Amateurs drove science more in the 19th century than now. Of course, that’s not an apple to apples comparison because of the professionalization of science in the 20th century. Also, so much of basic science now requires access to equipment far too expensive for amateurs. (Although that’s scarily not the case for gene sequencers.)

Ordering your video store

Roger Beebe has posted a fascinating, polemical explanation of the thinking behind the way he physically arranged his Gainesville, Florida video store. He takes educating his visitors as an obligation of the layout. Here’s an excerpt:

There’s a pedagogy to this arrangement, and it’s clearly making a case for a certain kind of engagement with the cinema and with film history. The prevailing first-order logic is one of national cinemas as a way of thinking about large groups of films together. Within those national cinemas, there’s a decidedly auteurist bent, privileging works by significant directors (toward the start of each section) followed by non-auteurist works from those regions. US films get further important subdivisions based on the mode of production and circulation; they are subdivided into Sub-indie (underground, avant garde, etc.), Independent (following the standard nomenclature of that fraught area), and Hollywood. Hollywood is then subdivided further between auteurist works (with a breakdown stretching from Woody Allen to Robert Zemeckis) and non-auteurist works that are then subdivided by genre.

An additional strategy—and this may be more ideological than pedagogical—is the arrangement of sections from the front of the store to the rear. The store has a narrow central corridor with small alcoves of videos along each side. We consciously front-loaded the store with documentaries on one side and our Sub-indie section on the other. The more mainstream Hollywood fare is pushed much further back in the store, forcing anyone seeking out those titles to run the gauntlet past all of these alternative cinemas.

Roger makes reference to Everything Is Miscellaneous throughout, a book about which he has at best mixed feelings. He understandably takes it as an unabashed, “boosterish” argument in favor of the multiple categorizations and sortings that the digitizing and networking of information enables. But, I disagree with part of his interpretation of the book. I did not intend to argue against careful organization of physical goods (the prologue waxes enthusiastic about Staples’ store layout) or against the value of expertly curated collections. Rather, we benefit on the Web from having expert curations as well as curations by multiple, multiple experts, both professional and amateur. Mortimer Adler’s Great Books would have been a welcome addition to the Web, but it would have been only one of many “playlists.” The fact that Adler’s list would have had to compete with those of UnNamed_Teenager at Amazon is a serious problem on the Net, but it’s balanced by the unavoidable harm done during the Reign of Paper by the impact Adler’s list had on which books were actually printed and placed in libraries.

Of course, I’m responsible for not having communicated my intentions adequately.

The Boston Public Library has put 15,827 photos into Flickr, using the least restrictive Creative Commons licenses possible. Tom Blake, the Digital Projects Manager at the BPL reports “he images on our Flickr account have been viewed collectively over 1.6 million times since we launched the account in March of 2008.”

The photos I dipped into were well marked up with metadata, and tagged. (Their new collection is called “Misc.” :) Some great stuff there. E.g., if you’re interested in the early Red Sox, try these. Or stereopticon images.


[the next day:] Jon Udell, in a tweet [twitter: judell], points to Keene Public Library’s recent Flickr uploadingg. ” KPL nicely models photo curation,” Jon tweets.

Here’s a hypothesis that emerged when talking with Henry Copeland [twitter:hc] about a panel at Web2.0 he’s leading:

Previous media have generally gone through a period in which their navigational systems were unsettled, but then developed stabled systems that lasted for at least a couple of generations. Libraries certainly did. Television spawned tables of channels, times, and shows that are still in use today. Newspapers developed a semantic lay out and use of fonts that is so standard that for generations all newspapers have looked and worked basically the same.

So, will the Internet’s navigation systems follow the same pattern? Will they settle down so that over the course of several generations, the Net will look and work basically the same? Even within particular functional areas, say, search engines? Or will we be constantly innovating the basic navigational systems of the Net? Or, will some systems become settled — say, search engines with text entry boxes (and their oral equivalent) and lists of results — while there is wild innovation in other areas?

I don’t know, of course. But, if I had to bet, I’d say that we’re in for perpetual innovation, with some inventions lasting longer than others. The Net may be the exception to the pattern because of its scale, its complexity, and the ease with which anyone can innovate.

(This of course assumes we continue to have an open Internet. But that’s a hobby horse for another trail.)

Everything is Warburg

The NY Review of Books gives a substantial taste of an upcoming article by Anthony Grafton and Jeffrey Hamburger about the library of the Warburg Institute. It organizes books on the shelves — it’s an open stacks library — into clusters of related materials, cutting across the usual subject classification. The University of London, which rescued and preserved the library, now is planning on dispersing its contents.

[The next day:] The full article has now been posted. Thanks, NYRB!

« Prev - Next »