Paul Hollands: Information architecture Archives

Information architecture

June 08, 2005

Nodemap for the new website

I've produced the first page of a nodemap for the new website:

- nodemap page 1

- resources section nodemap

- acitvities section nodemap

Posted by pj at 04:37 PM

June 02, 2005

Auditing and classifying content on an existing web-site for improved access

Website information architecture is the business of deciding where each document sits within a website and what it links from and to. It is about determining what the context of the information should be.

It is also about the navigational elements of a site and the topics or subject headings documents are listed under.

According the the World Wide Web Consortium's good practice guidelines, the location of each resource within its parent site should be reflected in its URL or web address. In other words the way that your website is organised into hierarchies of documents should reflect the nature of the information published therein.

A news item about health should sit with other items of the same topic in the relevant section, giving a URL perhaps like the following:

http://mysite.org/news/topics/health/cold_cures.html

Further, each resource should have its own unique URL which should ideally be readable by humans who may need to remember or re-type it:

http://www.w3.org/QA/Tips/readable-uri

This worked well and was relatively straight forward to achieve when web sites were collections of HTML files in a series of nested directories or folders on a webserver. It is also still possible with modern content management systems which replicate the folder and file model. Things are less easy with database driven websites which rely on session IDs and so forth.

The BBC News website is a good example of how this can be done well:

http://news.bbc.co.uk

It is my experience however, that no matter how well intentioned and clear the rules about the organisation of information on a site are at the start of its life, in the rush to publish and meet deadlines, over time, ad hoc decisions about the location of documents result in a slow but steady degradation the quality of the information architecture.

Also, the larger the site the worse things get.

The MEDEV website <http://www.medev.ac.uk/> has reached the point where the organisation of information on the site has become unwieldy.

We are in the process of designing and building a new website for the subject centre and are therefore taking the opportunity to do an audit of the information we currently publish on the site so that the new information architecture can reflect the information on the site, rather than having to shoe horn documents into an increasingly baroque information architecture as best we can.

A major inconsistency has developed between actual arrangement of information at the folder level of the site and the way this is reflected to the user in what is known as the landmark navigation menu on the top right of each page.

The landmark menu should appear consistently on each page and allow the user to navigate easily between the major sections of the site.

You will notice that there is a Resources menu option which leads the user to the http://www.medev.ac.uk/resources/ section of the site.

There are also links to the main News, Events, Discussion, Funding and Links sections of the site, which should correspond to the main sections of the website information architecture, the top level folders if you will.

You would expect that the News section would have the following URL therefore:

http://www.medev.ac.uk/news/

In fact, the News section is in a sub-folder of the Resources section:

http://www.medev.ac.uk/resources/news/

This is the same for the Events, Discussion, Funding and Links sections of the site.

Given that there are now between two and three thousand resources published on the site and somewhere around seventy two thousand links between those documents, making any changes to the information architecture is a major undertaking if we want to move large sections around without breaking huge swathes of links.

In order to re-organise what information sits where we need to get a grip on the real scope of the information now published on our site. In order to facilitate this I have begun an audit of the information we currently publish there.

In order to determine the context of any documents (including those being generated from databases) and determine where they are linked from and where they link to, I have written a web crawler or robot script. This crawls our own website performing a number of different tasks as it goes:

1. It turns the HTML of each document into plain text and then inserts that text into a MySQL table so that we can quickly do free text searching across everything published on our site.

2. It gathers lists of all the links contained in that document and inserts that information into another table so that by searching with URLs we can quickly determine all the places where a particular document is linked from, in case we wish to change its URL as part of our re-organisation.

3. It also gathers information about the types of links (internal, external, mailto and so forth) and the content-type of each document (whether it is an HTML document, a PDF, a Word document and so forth.)

4. It produces a separate list of links to documents external to our site and checks that they are still active and not broken, logging the HTTP response code for each URL in our database to aid fixing of broken links.

Having knowledge of the full scope of the site is very powerful in that we can now determine which sections have the largest number of resources in them and this will help us in prioritizing which sections to move onto the new site first and what our new bottom up information architecture might look like. We can also tell more easily which sections are redundant or anachronistic so that these can be archived off on the old site.

Having a full text index of the whole site also makes the second step in this process much easier. I'm refering to the job of classifying each resource on the new site according to a given set of subject headings or topics.

Essentially we are talking about cataloguing all of the resources on our current site, to produce metadata for each. This metadata could form the basis of a very simplistic ontology layer, a browsable tree of subject based navigation to ease access to all of the information held on our site for our users.

In the parlance of services like del.icio.us and flickr we will be tagging our resources. Basically, categorising them.

We have a number of ready made vocabularies that we use for cataloguing and categorisation within the subject centre, including MeSH (Medical Subject Headings) and Higher Education Academy Pedagogy and Policy themes, but our topics vocabulary also needs to be adaptable without breaking anything or the need for major re-classification.

If we can develop a robust set of classifications for our resources then this can form a major new element of our information architecture.

Given that we now have an index of the all the text on the site, and once we have our vocabulary figured out, it might be possible to do some of the categorisation automatically. There are some very exciting developments in the area of auto-classification using Bayesian algorithms and there are tools available for filtering email spam using these techniques which might be re-tasked for this purpose:

http://spambayes.sourceforge.net/

Nevertheless, the bulk of the work will probably need to be done by hand and certainly for new resources added to the site after it has gone live. The cataloguing process therefore needs to be quick and easy, taking up the minimum of time while still capturing metadata of suitable quality.

Fortunately myself and my colleagues are already very familiar with software which allows the "cataloguing" and classification of resources quickly and effectively. We use MovableType for blogging useful resources.

Using the MT quickpost facility we can describe, classify and quickly add a basic entry for the resource we currently have in our browser window into our blog. Further, MT stores all the entries in a MySQL database as well as the categories available to classify each resource.

On a vanilla MT blog page the classification terms form a key part of the navigation for users and is the basis for the whole blog information architecture.

Being based in MySQL and also having a good deal of flexibility in terms of publishing chunks of blog content, it would be trivial to incorporate this menu / hierarchy of category terms into the information architecture of our new site.

Further, MT includes lots of options for the syndication in formats such as RSS and Atom. We are intending to incorporate the blog RSS feed to form a 'Latest additions to this site' information pod on the new site home page.

We could also use the MT facilities to allow comments and trackbacks about resources on our site.

Finally, MT produces Dublin Core metadata for each entry added expressed as RDF.

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:dc="http://purl.org/dc/elements/1.1/"> <rdf:Description rdf:about="http://minnesota.ncl.ac.uk/fuzzybuckets/archives/2005/06/article_about_u.html" trackback:ping="http://minnesota.ncl.ac.uk/cgi-bin/moveabletype/mt-tb.cgi/1146" dc:title="Article about using patterns to improve web navigation" dc:identifier="http://minnesota.ncl.ac.uk/fuzzybuckets/archives/2005/06/article_about_u.html" dc:subject="Navigation design" dc:description="Improving Web Information Systems with Navigational Patterns..." dc:creator="pj" dc:date="2005-06-01T16:30:54+00:00" /> </rdf:RDF>

Posted by pj at 12:44 PM

February 03, 2005

Types of content currently published on the LTSN-01 website

This is a stab at a rough, top-of-the-head, first draft information architecture:

The site contains content describing or arising out of subject centre activities, past and current and also there are a number of resources aimed at our constituency which are published on the site. There is a caveat in that obviously many of our activities also produce resources.

I would like to make a distinction between information describing activities and the resources which we publish.

Further, there is a distinction between internally sourced and externally sourced information. A good example is funding opportunities. These typically cover external sources of funding. However, LTSN-01 also provides funding for or participates in research and developement projects, mini-projects and also workshops.

These inter-relationships between our areas of activity and the role of the website for promoting external opportunities needs to be reflected in the information architecture of the site and in the navigation elements.

Activities:

- funding opportunities (including workshop and mini-project calls)

- workshop promotion and booking

- contact details maintenance

- mini-project information and publication of reports

- news and events service

- publications:

-- newsletter

-- special reports

-- email bulletins

- pulication requests handling

- discussion boards

- research and development projects

- link of the month service

- cataloguing resources into BIOME

Resources:

- service for search of BIOME databases

- FAQ

- Glossary

- Free web courseware

- project reports and websites

- workshop resources

- newsletter

- special reports

Another issue is what to do with out-of-date content that we post. There should be some sort of archive area so that there's a clear delineation between current and older information.

We should maybe seek to archive and publish old email bulletins too?

There are two other issues that need to be decided upon as well, one is the issue of metadata for each page on the site. The other related matter is how each piece of content is classified, i.e. whether we should use MESH, METRO and whether we should be assigning terms from the Academy vocabularies to keep them happy.

Posted by pj at 03:29 PM

January 28, 2005

We probably need an intranet section for our site

Rather than have an approach where users and editors visit the same URI and get a different view of have a number of pages to visit for content admin purposes, we should have a distinct URI tree for content management tasks and related interfaces.

Posted by pj at 03:06 PM