Subscribe to
Posts
Comments
You've arrived at Everything is Miscellaneous's blog page that was active 2008-2012. You'll find links to some useful information about the book and its subject matter, but don't be surprised by some dead links, etc.
To order a copy, go to your local bookstore, or Amazon, etc.
For information about me, David Weinberger, click here.
To visit the page underneath this text, click here.

Thanks - David Weinberger

Tibetan taggers

This is a couple of years old, but it’s interesting. (Thanks to Norm Jacknis for the tip.)

Tibetans living in Switzerland and non-Tibetan Swiss were asked to provide tags for an exhibit of traditional Tibetan work. Then those tags were analyzed, wondering what cultural differences might show up. Some were fairly obvious:

Taggers disagreed in their perceptions of the esoteric deity Chakrasamvara. Tibetans tagged it frequently with “buddha”, accurately identifying its wisdom aspect; however, Swiss Germans found it böse or “angry-looking” and associated it with death. This exemplifies how tags can help uncover cultural misunderstandings: rather than anger, Chakrasamvara actually embodies the union of bliss and emptiness.

It also revealed (or suggests) some differences in how people approach tagging itself:

When Tibetans were asked which images were easiest to tag and why, their responses were contradictory. One person said artworks she knew were easy to tag because she already has something to say about them. Another found unfamiliar works easier to tag because they seemed “freer” The rating indicates that symbolic and familiar works do elicit less diverse responses from Tibetan taggers. And although some people may find them easier to tag because their meanings are culturally pre-defined, the way in which viewers react to them is likely to be less personal and even “less free.”

SpokenWord.org aggregates podcasts, almost all of which are free, and makes it easy for users to export them to, say, iTunes. It’s a non-profit site and is all about the openness. (Disclosure: I’m on its board.) Now SpokenWord is looking for volunteers to curate podcast feeds and episodes in topics that interest them. Their curated collections will be the main feature at the SpokenWord site, because nothing knows what’s interesting to humans better than other humans do. Details here.

Matthew Ingram at Gigagom blogs about an upcoming Twitter feature called Twitter Annotations. Well, it’s not actually a feature. It’s the ability to attach metadata to a tweet. This is potentially great news, since it will give us a way to add context to tweets and to enable machine-processing of tweets, not to mention that URLs could be sent as metadata rather than as subtractions from the 140-character limit. This is yet another example of information scaling to the point where we have to introduce more information to manage it. How about one of those bogus “laws” people seem to like (well, I know I do): Information sufficiently scaled creates a need for more information.

Twitter is specifying the way in which Annotations will be encoded, but not what the metadata types will be. You can declare a “type” with its own set of “attributes.” What types? Whatever you (or, more exactly, developers and hackers) find useful. Matthew cites a number of folks who are basically positive but who express a variety of worries, including Google open advocate Chris Messina who warns that there could be a mare’s nest of standards, that is, values for types and attributes. Dave Winer takes Google to task for slagging off on Twitter for this. I agree with his sentiment that Goliath Google ought to be careful about their casual criticisms. Nevertheless, I think Chris is right: Specifying the syntax but not the actual types and attributes will inevitably give rise to confusion: What one person tags as “topic,” someone else will tag as “subject,” and some people might have the nerve to actually use words for types in, say, Spanish or Arabic. The nerve! [THE NEXT DAY: Here’s Chris’ original post on the topic, which is more balanced than the bit Matthew excerpts, and which basically agrees with the next paragraph:]

But, so what? I’d put my money on Ev Williams and Biz Stone any time (important note: If I had money). You couldn’t have seriously proposed an idea as ridiculous as Twitter in the first place if you didn’t deeply understand the Web. So, yes, Chris is right that there’ll be some confusion, but he’s wrong in his fear. After the confusion there will be a natural folksonomic (and capitalist) pull toward whatever terms we need the most. Twitter can always step in and suggest particular terms, or surface the relative popularity of the various types, so that if you want to make money by selling via tweets, you’ll learn to use the type “price” instead of “cost_to_user,” or whatever. Or you’ll figure out that most of the Twitter clients are looking for a type called “rating” rather than “stars” or “popularity.” There’ll be some mess. There’ll be some angry angry hash tags. But better open confusion than expecting anyone — even the Twitter Lads — to do a better job of guessing what its users need and what clever developers will invent than those users and developers themselves.

I’m embarrassed to say that I just read Randall Munroe’s fabulous color survey from early May. Readers were asked to supply names for colors. It’s a rich experiment: Naming and discrimination, gender differences, hacking, tagging, spamming, hilariousness. The results also seem to support prototype theory’s idea that we agree on what the “real” (prototypical) colors are, at least within a culture: This is blue, but that one is a variant that needs a modifier in front of it (“light blue”) or for which we use a variant name (“teal”).

Randall writes the webcomic XKCD, of course, which is the Doonesbury of his generation, except while you can imagine Garry Trudeau writing a satiric HBO series, you can’t imagine him running and analyzing a color survey.

(I heard about Randall’s color survey via the Mainstream: Christopher Shea at the Boston Globe blog. Christopher also points to Stephen von Worley’s color map. BTW, that post by Christopher also has a great note about iPad censoring a graphic version of the oft-banned James Joyce’s Ulysses. Anyway, I’ve really got to do a better job keeping up with XKCD.)

Democratized curation

JP Rangaswami has an excellent post about the democratizing of curation.

He begins by quoting Eric Schmidt (found at 19:48 in this video):

“…. the statistic that we have been using is between the dawn of civilisation and 2003, five exabytes of information were created. In the last two days, five exabytes of information have been created, and that rate is accelerating. And virtually all of that is what we call user-generated what-have-you. So this is a very, very big new phenomenon.”

He concludes — and I certainly agree — that we need digital curation. He says that digital curation consists of “Authenticity, Veracity, Access, Relevance, Consume-ability, and Produce-ability.” “Consume-ability” means, roughly, that you can play it on any device you want, and “produce-ability” means something like how easy it is to hack it (in the good O’Reilly sense).

JP seems to be thinking primarily of knowledge objects, since authenticity and veracity are high on his list of needs, and for that I think it’s a good list. But suppose we were to think about this not in terms of curation — which implies (against JP’s meaning, I think) a binary acceptance-rejection that builds a persistent collection — and instead view it as digital recommendations? In that case, for non-knowledge-objects, other terms will come to the fore, including amusement value, re-playability, and wiseacre-itude. In fact, people recommend things for every reason we humans may like something, not to mention the way we’s socially defined in part by what we recommend. (You are what you recommend.)

Anyway, JP is always a thought-provoking writer…

Search engines have traditionally focused on building lists. Increasingly, they’re turning to the rectangular display of information: Boxes and tables. Boxes require extracting the relevant information and presenting it four-square in front of the user. While lists sort in a single dimension, tables show at least two dimensions. Boxes and rectangles are useful filters.

Google today announced the further boxing and tabling of data, in response (one supposes) to Bing.com. The Google Blog recommends trying searching for dog breeds, broadway shows, catherine zeta-jones date of birth, or zebra. (Look for the “something different” list in the left margin when you do the zebra search.) I especially like the summary of sources Google gives when it flat-out answers a question.

More boxes! More tables!

Luis von Ahn of Carnegie Mellon University is giving a Berkman lunchtime talk. [NOTE: I’m liveblogging. I’m making mistakes, leaving stuff out, paraphrasing, getting things wrong. This is an unreliable record.]

Luis invented captchas, the random characters you have to type in to convince a web page that you are a human and not a hostile software program. (He shows randomly generated sequences that happened to spell out “wait” and “restart.”) Captchas are useful, he says, when you’re trying to prevent people from gaming a system by writing a program to enter data robotically. They’re also useful to prevent spammers from signing up for free email accounts. To get around this, spammers have started up sweat shops where humans type captchas all day long; it costs the spammers about $0.33/account. And some porn companies ask users to type in a captcha to see photos; the captchas are drawn from email account applications. Damn clever!

He shows some variants. A Russian asks you to solve a mathematical limit. In India one asks you to solve a circuit. Luis says these aren’t all that effective because compputers can solve both problems, but they’re still better than the “what is 1 + 1?” captchas he’s found on US sites.

He says that about 200M captchas are typed every day. He was proud of that until he realized it takes about 10 seconds to type them, so his invention is wasting 500,000 hours per day. So, he wondered if there was a way to use captchas to solve some humungous problem ten seconds at a time. result: ReCAPTCHA. For books written before 1900, the type is weak and about 30% of the text cannot be recognized by OCR. So, now many captchas ask you to type in a word unrecognized when OCR’ing a book. (The system knows which words are unrecognized by running multiple OCR programs; ReCAPTCHA uses those words.) To make sure that it’s not a software program typing in random words, ReCAPTCHA shows the user two words, one of which is known to be right. The user has to type in both, but doesn’t know which is which. If the user types in the known word correctly, the system knows it’s not dealing with a robot, and that the user probably got the unknown word right.

ReCAPTCHA is a free service. Sites that use it have to feed back the entries for the unknown word. About 125,000 sites use it. They’re doing about 70M words per day, the equivalent of 2-4M books per year. If the growth continues, they’ll run out of books in 7 years, but Luis doesn’t think the growth will continue, so it might take twenty years. (There are 100M books.)

(In response to a backchannel question, Luis tells the penis captcha story.)

The ReCAPTCHA system filters out nationalities, known insult terms, and the like, to avoid unfortunate juxtapositions. It’s soon going to be released in 40 languages. Google acquired ReCAPTCHA.

Q: When will OCR be good enough to break captchas?
A: I don’t know. We’ll probably run out of books first.

Q: Business model?,br>
A: Google Books gets help digitizing.

ReCAPTCHA “reuses wasted human processing power.” The average American spends 1.9 seconds per day typing captchas. We also spend 1.1 hours a day playing electronic games. We humans spent 9B hours spending in 2003. It took less than a day of that to build the Panama Canal. So, Luis switches topics a bit to talk about how to solve human problems by playing games.

First is tagging images with words. Image search works by looking at file names and html text, because computers can’t yet recognize objects in images very well.

Does typing two words take twice as long as typing random letters? No, it takes about the same time, he says. Luis says about 10% of the world’s population have typed in a captcha. The ESP game asks two people unknown to each other to label an image until they agree. The game taboos words that other players have already agreed on. The system passes images through until they get no new labels. They’ve gotten over 50M agreements. 5,000 players playing simultaneous could label all Google images in a month. Google has itsown version; Google has an exclusive license to the patent.

Q: Demographics?
A: For my version, average age is 29 (with huge variance), evenly split between women and men.

Q: Compared to Flickr tags?
A: Only a small fraction of Flickr images have useful tags. The tags from flickr tend to be significantly more exact, but also significantly noisier (e.g., a person tagging an image in a way that means something idiosyncratic).

Q: Bots?
A: Yes, we don’t want you to wait for a partner, so sometimes we’ll give you a bot that replays the moves a human had made with the same image.

Q: Google Images benefits from its version of your game. Who benefits from your version of the game?
A: No one.

For some images, guesses change over time. E.g., a Britney Spears photo five years ago got labels like britney and hot. About two years ago, the labels changed to crazy, rehab, and shaved head. Now they’re back to britney and hot. By watching a player for 15 mins, you can guess whether the player is male or female with 95-98% accuracy.

Why do people like the ESP game? Sometimes they feel an intimacy with their partners. They have to step outside of themselves to make the match. They can have a sense of achievement.

He ends by saying that the about the same number of people — 100,000 — have worked on humanity’s big projects, e.g., pyramids, Panama Canal, putting a person on the moon. That’s in part (he says) because it is so hard to coordinate large numbers of people. Now we can get 100M people to work on something. What can we do?

Clay Shirky has given us a surprising number of Internet myths. And by this I mean not falsehoods but the opposite: Broad, illuminating ways of making sense of what’s going on. For example, Clay’s post about the power law distribution of links in the blogosphere (based on research by Cameron Marlow) changed how we view authority, fame, and success in the Web ecosystem, and provided the structure within which Chris Anderson could point to the Long Tail. And Clay’s Ontology Is Overrated made clear that a change in how we categorize our world affects very real power relationships; that essay was highly influential, including on my own Everything Is Miscellaneous.

Clay’s new post — The Collapse of Complex Business Models — gives us a broad way of understanding why those who used to provide us with content will not be the ones who give us content in the future…and why they cannot fathom why not.

Giulia Ricci’s investigates:

the shift between order and disorder within different systems, which is the reason why I recurrently use geometrical grids, although on a more abstract level I am also interested in systems of categorisation and lists and how these can be visualised with diagrams and geometrical drawings.

For example, take a look at these. I find them fascinating as they swim close to resolution but never quite make it.

Want to see one way to use the Web to teach? Berkman‘s Jonathan Zittrain and Stanford Law’s Elizabeth Stark are teaching a course called Difficult Problems in Cyberlaw. It looks like they have students creating wiki pages for the various topics being discussed. The one on “The Future of Wikipedia” is a terrific resource for exploring the issues Wikipedia is facing.

Among the many things I like about this approach: It implicitly makes the process of learning — which we have traditionally taken as an inward process — a social, outbound process. By learning this way. we are not only enriching ourselves, but enriching our world.

My only criticism: I wish the pages had prominent pointers to a main page that explains that the pages are part of a course.

« Prev - Next »