Skip Menu

Return to Skip Menu

Main Content

Analysis and visualization using large bodies of electronic text

Presented by Peter Leonard, Associate Director for Research Computing, Division of the Humanities, and Elisabeth Long, Associate University Librarian for Digital Services, University of Chicago

Transcript


0:16
to give you a sense of how we've approached says he's going to take the
0:19
bulk of of presentations talking about about
0:22
four different projects that we've been working on with faculty and then
0:27
interjected between
0:29
he's i'd want to raise some of the
0:31
challenges the people that faculty based on some of the companies
0:35
or librarian have it's maybe structure we
0:39
opt kinds of activities were doing
0:43
thanks very much with the and uh... thanks all for the opportunity to come
0:46
here and speak with you today i'm really happy to be presenting alongside with
0:49
the flow on
0:50
because although i described the work but i do miss visual humanities
0:54
uh... i think that there's an increasing realization the field and serious
0:58
that long-term vision he made his work can be done that research universities
1:01
without the long-term expertise people in the library
1:05
so it's wonderful that chicago has indeed well for digital services into
1:08
elizabeth my already working together pretty closely interstate and even more
1:12
of that as we try to meet the needs of
1:15
my keys indians
1:18
we don't have to presents for you two days of the brought rupert of new
1:21
horizons in primary source research as we just heard
1:25
and dumped the question that elizabeth and i kinda wanted to bring birth in
1:28
front of you was if we start trip went well
1:31
if we just say it's going forward for twenty twelve
1:34
what in fact they consider ballot primary sources for digital
1:37
investigation
1:39
how do they want to read those lists
1:42
and it puts reduced air quotes because what we have to go through this
1:45
presentation was he's in interesting ways of reading the don't just involve
1:48
the traditional
1:49
meant is going to graduate school
1:51
but once we know this once we know what the sources are and how we want to read
1:55
that hopefully will start to be able to identify pro-lifers can build
1:59
collections of these new types of sorts
2:01
and how libraries can help support new form of this kind of
2:06
so let's begin with our com
2:07
without seeing your hike and you don't know what running data electronic text
2:11
one of the things that i think too
2:13
one it kind of posters that everything is a tax break
2:16
uh... political symbol example of electronic text which is a little text
2:21
we all experienced in this room working with electronic texts we know that kind
2:24
of cn books
2:25
how to work with her father tories we know that that going for the heart of
2:29
trust research center is going to be important player in this world
2:32
we only had to work with me any scripts we now have markham up in semantic xom l
2:36
we know how to put them into corpus query engines we have a great one
2:39
chicago cubs philological comes from the heart of project
2:43
but the real important question is let's assume that's assault prob let's assume
2:47
we have time for us for our books let's assume all of our manuscripts their
2:50
input into a lot
2:52
one of the things i want to point out as we move through this presentation today
2:55
is a possible motion away from a kind of solar generating weary as being the only
3:01
thing we do with it
3:02
and towards data itself being responsible for the generations pat
3:08
end that's what they see that we will move completely away from human is
3:11
knowing what to look for a tax
3:13
but merely an argument that in the big day in the world will always know what
3:16
you're looking for until emerges data itself
3:19
so we use this phrase but the data organizing
3:22
this is what astrophysicist do when they have a radio telescope is giving them
3:25
away too much data and look at and precious heavyset separate signal for
3:28
minorities
3:30
there a lot of techniques that com outside the humanities
3:33
here from field such as information retrieval computer science mathematics
3:36
statistics
3:38
uh... you've heard things like top modeling winter sleigh allocation
3:42
latent semantic analysis
3:44
uh... if you haven't heard of the things you will let you share this for all the
3:47
way to make sense of twenty million books
3:48
but they are going to talk about today actually is a little uh... lower on the
3:52
mathematical scale it just has to do with sequence alignment
3:55
in that endeavor sequence alignment beijing appears easily
4:00
most geneticists most by a lot scientists know that sequence alignment
4:03
with the way of finding patterns in him
4:07
we're gonna talk about it today not with the domain of biology but in the day of
4:11
literature
4:12
your professors are already using sequence alignment
4:15
even if they don't know
4:16
they're using sequence alignment everytime
4:19
debate use one of their undergraduates uh... submit their papers through
4:22
websites like turn it in dot com object players and on that
4:26
even if your campus doesn't use of plagiarism detector
4:28
if he is blackboard as a model bought cord which can optionally check
4:32
undergraduate work for
4:34
region of similarity with wikipedia
4:36
or other corporal of their faces people turn before
4:40
the underwater here is trying to find patterns of textile reduce areas of
4:43
textual readings
4:45
but plagiarism is middle east interesting thing that we can do the
4:48
sequence one
4:49
at another level what sequence alignment is about is finding things that overlap
4:54
alignments of sequences with its genomes that works
4:57
well things we discover we see cancel and it was pretty banal if you put your
5:01
fifteen million books you'll find upset phrases that occur over and over now
5:05
these are in fact this is an plagiarism missus and citation
5:08
this just in the way people talk to you get a lot of examples of once upon a
5:12
time brent
5:13
can also find things like gun references from outside the corpus flight so if you
5:17
analyze the eighteenth and nineteenth century literature
5:20
you find a lot of quotes from the bible right
5:22
noggin not plagiarism is just a common reference point
5:25
and one of the springs opposite to think carefully about your corpus when you're
5:28
running these types of allen's did you think that the second thing to hear it
5:31
would be relevant to the nineteenth century american literature
5:34
well it is if they're a lot of bible books right
5:37
and finally to the point i want to make on this is that not everything is
5:40
cure plagiarism war bundy on uh... unfamiliar citation there's this notion
5:45
built on a smart house of commons higher critic right
5:48
people identified this book
5:50
aka lolita from what the twenties
5:52
uh... and its broad outlines of what it would not locked up for it it's not been
5:56
inspired by rights of the notion here is a good artist copy all the time
6:00
so what can actually surely that's concrete that explains how we can
6:04
identify textile reviews
6:06
when it works well
6:07
we've been running some analysis in chicago recently about nine hundred
6:11
classical latin texts
6:13
when i'm showing on the screen and i pointed out to help clear this is is a
6:16
cora diagram record graph showing connections between virtually other
6:20
taxes
6:22
in his college of art read on the screen what we're seeing is one text has a lot
6:26
of connections were strike
6:28
so this would suggest if you're a class system
6:30
virtual skynet important person people seem to prefer to have a lot
6:34
their religious beliefs you're understand
6:37
i can show you the example in english
6:40
which is based on a bunch of speeches that are given to incoming university of
6:43
chicago undergraduates every year
6:44
uh... we call this someday incident patient and it's a famous opera
6:48
whitehead uh... essay from thirty years ago
6:51
but if we go through the following the hearing greyhound
6:54
remember her
6:56
uh... even for instance she gave
6:58
in we can find patterns of overlap
7:01
i'm gonna find one with alfred north whitehead
7:03
figures on integrate giving a talk
7:06
out in the eighties
7:08
and she says i need to be careful what we call in your ideas those ideas that
7:11
are received in error testable criticisms replication proper like that
7:16
i didn't get the best of what it would be important to get the first education
7:19
speech but the point is the algorithm found
7:21
forty and it's identified that
7:23
it showed me on the score graph how many sequences she shares with not only
7:27
uh... alfred north whitehead that other people delivered thus far over time
7:31
back to our selection
7:35
etc et-cetera
7:37
so much how much is that faculty of faith in that case of all the work
7:41
that's being done right now classical latin
7:44
we have a really nice corpus that already exists
7:48
it's a wonderful world to work and if your arm
7:53
landry somebody's areas where this is already in the building you have a good
7:57
way back to that at least inserts updates
8:00
but a lot of areas don't have this kind of course
8:03
and so
8:05
we support develop the kind of complete corpus is that people be
8:10
to be able to do this kind of work as i think one question that we have
8:13
as peter pointed out once i have to think about how do you
8:18
even think of what is the corpus they need who want to start thinking about
8:22
generation over time
8:24
of i_t_n_'s at influences you often need to go back allot for and then you ready
8:30
to start ob
8:32
we've also had faculty talk about
8:35
how can they teaches kind of thing
8:38
they want to be able to have students comments one class and actually
8:41
developed for this for themselves
8:43
it's not typical that people really want to do these things as academic you know
8:48
exercises are carpets that happens to be available actually wants to them on
8:53
materials that they are interesting in the subject area and so it'd be
8:57
interesting can we get book machine systems can just sit there and scan a
9:01
whole group of things at the beginning of semester
9:04
orcon so
9:05
you again how are we thinking about supporting these kinds of things
9:09
another challenge to take that is often faced here
9:13
for instance in the case of author latin corpus it exists in one place
9:19
but you need to get it how prepared to put it in a school that does some of
9:23
this kind of mapping and analysis
9:25
and often that is not just mean export important
9:29
it means export
9:30
transformation
9:32
an important now that transformation companies dot actually a very difficult
9:35
kind of thing it's a kind of thing a programmer can do relatively easily and
9:40
a lot of cases but it's not something that mcafee necessarily have the skills
9:44
and knowledge do
9:46
so there are also starting to look
9:48
for what kinds of services
9:51
had intervened to help them
9:53
a neighbor actually just
9:55
have a place that
9:56
person they can go to to do that transformation
9:58
org
9:59
what kinds of training do we need to be provided to people to be able to learn
10:03
how to get this kind
10:08
so we talk about one hundred connections between tax if there are passenger
10:12
shared between text for them
10:14
quoted played rice recited over the biggest talking about a general idea
10:18
but the next example i wanna talk about
10:20
uses a kind of extrinsic connection between tests
10:23
and this has to do believe uh... and above
10:25
the apple application network analysis techniques to modern stephens forward
10:30
end up working to faculty on this uh... wait longer richard so in the japanese
10:34
and uh... english respectively so this is a trans
10:37
departmental project because many countries have their own modernist
10:41
strike that they're going to buy kind of
10:43
print cultures so we're both interested in hanging
10:45
anglo-american world and in the japanese language world
10:48
if you were teaching a class
10:50
on uh... modernist poetry in japan
10:53
and you are writing a book
10:54
giving us some of you don't even remember the names of all the votes
10:57
you're working on what they looked like
10:59
we but we also have a sense of fun what journals they published and even thought
11:02
keep a lot of us your head
11:04
and uh... as you work on this class issue for this article you know that
11:08
certain people unbroken certain journals and other people wrote in the same
11:12
journals but that other people on the road one type of journal but
11:15
eventually at scale this becomes very complex putting your head unique eating
11:18
this notion of you who wrote the people of the new york stuff
11:22
but much more exploring with network analysis is what happens when you do get
11:25
the state of scale
11:27
so the answers
11:28
you get a very complex network diagram and his network diagram contains nodes
11:33
which are the circles that also contains entries which are the links between the
11:36
notes in this case
11:38
and into the pilots
11:39
and the edges of the curve lines between the circles
11:42
are importance of active people come published in the singer
11:46
and if you just think it's the only thing has been out we're going to
11:49
interrogate appears that you can access to interrogate the graph
11:53
define clusters of affiliation diff detect communities
11:58
completely algorithmic but based on
12:00
what journals people publish their problems
12:02
at steel you can get discovered the color bunch of the nodes green another
12:06
one blue another one purple
12:08
and then you can put the resulting visualization in conversation with the
12:12
humans without literature person who's written a book
12:14
nineteen twenties and thirties but you can poetry
12:17
and you can get you ever heard us say for this class to really make sense
12:20
right all these people were publishing the same provincial journals in this
12:23
really works
12:24
why do you know driven cluster these other poets and other places
12:28
uh... so that we have a conversation between the quantity of knowledge of the
12:31
algorithm quality knowledge of the expert japanese poetry we think it's
12:36
very productive
12:37
uh... we can do this cross all sorts of different types of languages i can show
12:40
you american poetry
12:42
uh... in the nineteen twenties and and teens
12:45
um... everything is warranted scielo here is a journal everything that is
12:49
um... sort of red is a person
12:51
sify click on the new republic all see all the poets who published he republic
12:55
you can see these interactive exploratory right around the nose for us
12:58
to burn out of the new my question before i clearly not what the question
13:01
emerged from the podium organize
13:07
style
13:09
explodes the question of trying to cut corners that faculty are looking for
13:13
because
13:15
not only do they necessarily mean
13:17
it doesn't work as part of the said all of that
13:21
drills that the poets for publishing yet
13:24
but actually drives the graphs that you which is seen is
13:29
as that really is one of the demographic data
13:31
and service tax in the case are playing on what he has done is taken in modern
13:37
reference
13:38
he's been standing
13:40
pages sections of the areas if he's interested in it's above the village is
13:44
a bibliography of all the polls published by all the palestinians
13:48
japan
13:49
and he is
13:51
scanner bussy already and then having to take that data and get it into fielded
13:56
form to put into an excel spreadsheet
13:58
that was the kind of software that he's interested in
14:02
so one question he's asked of course the is what we hope at the library play in
14:07
maybe working with
14:10
vendor who was producing this fall and saying if is there a way we can get the
14:14
data in another way
14:17
i don't know if any of you have heard about god has been a couple days ago
14:21
articles of the are having worked with the university of british columbia to
14:25
provide
14:27
data mining access to their
14:30
entire database for the faculty there for research purposes yes i think is the
14:35
kind of thing we need to start looking at is people are not wanting the data
14:40
only in the traditional way we've read about you have an article that you pull
14:44
up in the
14:45
they want to be able to get it this data in other ways
14:48
they want to be able to get and manipulate the medicare about articles
14:52
as well
14:53
i could imagine also
14:55
you know and and
14:56
higher
14:57
drobkov our catalogues as being the kind of data that you could pull interest to
15:02
again start looking at some of these connections
15:06
where people published in given i want to take a subset of acting that way
15:11
how do we help support people get things been in the way that their party
15:15
i think another questioner raises in my mind is
15:18
the question twelve one amount of work we would put the issue
15:23
digitizing material that's copyright and for those not consulted purpose we've
15:27
tended
15:28
to want to digitize materials that we can stop by available
15:32
yet faculty as they are not surprisingly want to work and there is a good work
15:36
and at
15:38
crosses that the other is development of nineteen twenty three times
15:43
areas and so i think that
15:45
question of what will we have providing this kind of work and play efforts into
15:50
doing this kind of work
15:53
that we can share
15:54
and thinking about how we also start start
15:57
developing the kinds of services that would allow us not to have to you can
16:01
imagine faculty
16:02
they after are recreating some of this kind of work
16:07
they make their own little
16:08
databases
16:10
that never is able to conflicts
16:15
now i want to take us into the main
16:17
alice differs a little bit from japanese poetry of the twenties although it keeps
16:21
the decade
16:22
and that is to talk about a project very experimental thing we're doing
16:25
with the jazz age magazine in chicago
16:29
it was discovered in a wonderful way by a young the blood presser history at the
16:33
reception cargo you'll hear us
16:34
who's walking for our research library the reconciled discovered the set of
16:38
bound volumes on the shelf
16:39
he's pulled one often discover is amazing suffered on japanese competitors
16:44
the new yorker might put up
16:45
one of the covers onto the right on the slider
16:48
it's amazing to think that
16:50
dynamic uh... after a decade or two but
16:53
was that that that that kind of one moment was a kind of you envision
16:56
himself as a chicago privilege the new yorkers captured
16:59
the color and beyond
17:02
the sort of the sights and sounds of of this city
17:04
in a way which would be prudent to get the kind of documentary evidence of
17:08
before so
17:09
it's rescue from obscurity in the best
17:11
qualitative work on the chicago and that ever could be done was already been done
17:15
a couple years ago money or there is a beautiful monograph reproductions and
17:18
newspapers
17:20
so we got the qualitative analysis out this worked out but you can ask yourself
17:24
what did your parents find when he pulled the bond market shelf havoc for
17:28
cotton part of the reconciled
17:30
d that is the value of the fiction that was printed in the chicago
17:34
maybe the values the political commentary fear of historians about what
17:38
people thought that race relations in chicago nine
17:41
anita valley with the cartoon if you're going to be revision
17:44
history student body
17:45
cartoons in the chicago compare with cartoons new yorker
17:49
but my question is maybe the the text of the chicago and could be thought of as
17:52
the cops
17:53
because the congress park pregnant indexes all representation
17:57
of each for lightly or weekly issue right is designed to sell the issue on
18:01
the new standard represents that we'd just like the new yorker hardness
18:05
so weak he said well if you have a new patents hundreds of covers and brilliant
18:09
color
18:10
twenty mobile had once henry
18:12
doesn't really hundreds parts
18:15
one answer is the mathematical dimensions of images and by dimensions i
18:19
don't mean
18:20
let inches down even half y
18:22
anythings like balloons to the right message
18:25
the human body color
18:28
the saturation in each copper if you ask a computer algorithm to analyze this
18:32
corpus of colors from jasmeet chicago
18:36
what you can end up with some from interesting visualizations
18:39
few grass vi huey the medium he would be each congressman averaging to cuba all
18:44
the data on the cover
18:45
each week reach for night
18:47
the x-axis starts in nineteen twenty six and goes nineteen thirty-five
18:50
the wine access is what unit is you'll see all the balloon covers clustering in
18:54
the middle
18:55
the top is red and the
18:57
bottom is essentially yellow and green
18:59
much of this but you can learn from that uh... when i showed the saturation
19:02
grafton paris this is the saturation susceptible very topical off-color
19:07
so from the bottom three little color could be black or white but it has
19:10
fruitful
19:11
why should this particular drafting a here as he immediately
19:14
pointed at one covers a vanilla cover can point to them
19:18
yes
19:20
the very top cover the green one
19:23
he said we had it depends on the contrary reproduce this friday system
19:26
crazy stockholder green
19:28
it's very difficult to get back
19:29
the algorithm agrees discovers off the charts
19:33
so what we do not want to do is you want to first of all the because the entire
19:35
porpoise king and really well up we have a sub sub sampling of purpose for all
19:40
these covers
19:41
people say whom he didn't really stop this couple
19:44
and look at the couple yellow ones just down a little bit those are also
19:47
protesting not saturated this degree
19:49
so can we discover anything about why he's covers were so interesting is there
19:53
anything that came out with their new designer who designs covers
19:57
that's the type of information you can get that scale
19:59
from images of course and it's not just with japanese magazine covers even in
20:03
one spot for chintu
20:05
to discover with when purple ink was introduced to people purple pigments
20:09
flat oil painting in the renaissance became more purple right thing to do
20:13
that just by greeting cards for
20:16
if we can you know i think you the saturation voluminous individual photo
20:19
spread in hubertod football these covers uh... right in front of you one-by-one
20:23
you'd almost looked like a football practice of how distances that from a
20:27
movie
20:28
it's pretty simple write a movie is about twenty four frames per second
20:31
right so what happens if we think about doing the same type of
20:34
technique to movies movies that mister displeases of images
20:38
so to simplify things will do a black-and-white movie bengali values a
20:41
black-and-white newspaper movements they wouldn't make it fun for breakfast the
20:45
screen incentives
20:46
less than the intended it doesn't have any pulled out so we can do here is of
20:50
course battleship potemkin second eisenstein viewed as a step sequence
20:55
it we don't ask questions of the sequence because really the sequence is
20:58
just about eighteen frames per second supplements
21:01
and questions we can ask our combines greeted every business in the sequence
21:05
of events
21:07
and what is the the discovery of the values look like
21:10
i can show you this
21:11
oak website
21:14
so here we go we never get dust this dust subspecies and uncontrollable right
21:18
i can make people run up and down the stairs right you know i'm controlling
21:21
the time dimension
21:23
and when you see my mood in the white house over is a kind of uh... unemployed
21:26
indicate here kind of like the richter scale visualization of how brighter
21:30
helped our commitment is art every single frame
21:34
now but i think it's covers are certain russian percent point in time when when
21:37
the screen does really darts into gas but those are
21:42
buzzards
21:43
speech also black
21:45
isn't even though you know that's assessing who's doing the right this
21:47
moment in the stigma sequences
21:50
it's a famous point were woman opens a one umbrella in the white umbrella
21:54
completely takes over the screen
21:56
now if you are watching this film in russia and you know eighteen ninety
21:59
years ago it would be brighter than a hundred percent because the a project to
22:03
be shooting off silver on the screen writing your eyeball into this kind of
22:06
person
22:07
but the storage docu tells me that tom this is literally break spartan sequence
22:12
and history into the writing of the image that i'm controlling if anybody
22:15
does digital photography you'll read to recognize this this is a recitation of
22:19
most of the pixels here or on the right of the spectrums of their rights wears
22:23
on intertidal you have a lot of pics of the left which is about
22:31
optional for instance takes over talks about uh... the questions here
22:35
just to put this in context what happens if you take the luminance and you put it
22:39
in conversation with albert shop protection if you can tell the crisis i
22:42
cut right
22:43
that's a very good solve problem like-minded people telling patsy right
22:48
sympathetic conversation algorithm shock detection even the facial recognition to
22:52
the following his ones and face uh... on the screen and eisenstein film now and
22:56
it seemed kind of uh... overkill for too many secrets the latitude though
23:00
eisenstein isn't higher production
23:02
and you put it into a supercomputer to produce all the once you could give all
23:06
film produced in russia in a decade and you could wrote the simpleminded
23:10
correlations between
23:11
bright scenes cutscenes intertidal spaces
23:15
even weather season focusing on
23:17
when you have bear is essentially data mining project the tries to answer the
23:20
question of how these variables are pretty independent bergman
23:26
west and raises the question related also
23:29
starting to think of elements like color hue saturation as content enough
23:34
themselves with name which makes us have to think about our collections in new
23:39
and different ways what kinds of things make up a sensible collection
23:44
when you suddenly have new ways of thinking about it beyond simply the
23:48
elements that might you be the subject matter of of what's in those those
23:52
materials
23:54
these kinds of businesses also i think make us have to start thinking about an
23:59
understanding as we did u ties materials whether or not we need to change our
24:03
digitization practices now interestingly in the case for instance of color
24:08
analysis it turns out they don't even need high-resolution images for that you
24:12
can use bed lower resolution images and still gets captured the information that
24:16
can be
24:18
he and
24:19
still claims to the fact that we need to understand these things that as
24:24
digital humanities is working on new kinds of techniques that we need to
24:28
understand what we need to do in our own conversion practices to support it
24:33
another example would be the fact that we tend to put these materials are
24:37
bundled up because
24:39
you know objects as a hope you have an entire issue of the chicago
24:44
if let's once he is every individual issue about the cover of the chicago one
24:49
and if you don't want to match that across time
24:52
peter showtime do it just across the uh...
24:55
spectrum of appears that it was published but you could also match you
24:59
want to say let's be able to prove everything from da use for indications
25:04
are the following issues that you could think of various ways to pull that
25:07
together
25:08
need to be able to get that image and you need to be able to get all of that
25:12
medicare was that we need to be underwent these anyways
25:15
that people can get all the elements that need to put into some of these
25:18
kinds of tools
25:20
the last thing i mentioned it is i think what we see in
25:25
what appeared to describe the column entrance in the senate race
25:29
shows how some of these tools with digital humanities
25:33
go across disciplines
25:35
the two faculty members in this case are from very very different disciplines
25:39
something we didn't have time to get you know she was the fact that the work that
25:43
on president is doing on the classical all you know corpus sequence alignment
25:48
that same technology and approach to
25:51
doing analysis is now being looked out for music and doing the same thing we
25:56
have to be as a faculty very interested in the issue of influenced by and uh...
26:01
your kind
26:02
relationships of music overtime and so what we're finding is we have faculty
26:09
who
26:09
need to be brought together
26:11
to learn and understand
26:13
these tools and think about how can i apply that to my collection
26:17
they often hard having those conversations with each other because it
26:21
come from completely different parts of university and don't necessarily bump
26:25
into each other and so they're very interested in looking at and i think
26:29
they see off in the library as
26:31
the place where they all come together and we have this kind of
26:34
collaboration and communication