Friday 29 November 2013

Sorting Algorithms: Integer Sorts

For my last entry, I'm going to try to cover integer sorting. This isn't something we have really touched on in class, but something that I decided to look into after the performance of count sort in the trials that Danny ran, and from Danny saying that Radix sort is his favourite sorting algorithm.

From what I've read, there are two approaches to sorting. There are comparison sorts(like mergesort) and then there are integer sorts, like the aforementioned count sort and radix sort.

We went over count sort in class, and it's pretty simple. It takes a list of integers, and forms a new list with a placeholder value for each item in the input list(ie, the length of the new list is the number of items in the input list). It then reads out of the indices into a new list and returns it. This has the effect of sorting the list of integers. The flaws in this approach are pretty obvious. While the algorithm will handle the list [3,2,0,1] with ease, it can't handle something like [11,3,46], or even [5,0]. This is because the index numbers won't correspond to the integer values. This makes the algorithm seem a little bit useless to me(maybe it's just the implementation we used?). But it's useful to illustrate the difference between integer sorts and comparison sorts, as an exemplar of the approach of the latter. Count sort(perhaps to its detriment, and utility) doesn't even look at the input values, besides how many of them there are. Compare(ha!) this to something like mergesort, whose approach to sorting depends on looking at, and comparing, theses very values.

The other type of integer sort I'm going to try to cover is Radix sort, which is the professor's favourite. Now, we didn't really cover radix sort at all in class, so I had to look into it. This obviously won't be an in-depth treatment, but I'll see what I can do. There seems to be a couple of different implementations of radix sort, the main distinction being on sorting by least or most significant digit. I'll just cover the LDS sort here. Now, as radix sort is an integer sorting algorithm, it doesn't sort the values directly. It depends instead on keys. It seems to me that if you are sorting a list of integers, then it's pretty safe to have the keys directly correspond to the integers you're dealing with, but I suppose it'd be a different story if you had to deal with strings or such. Now, once you have these keys, you sort them by the LDS of each one. You organise your keys by their LDSes, and pass over them repeatedly, moving them accordingly. Eventually you wind up with a list keys that are properly sorted.

Now obviously(and probably evidently) I haven't delved into this very deeply, but it seems pretty nifty to me. Apparently radix is a preferred algorithm for comparing text strings, but I wasn't quite able to figure out why, as it seems to me that the obvious way to use it(sorting a list of keys based on the length of the strings) doesn't lend itself to radix sort more than any other sorting algorithm. But it's late, and maybe I'm just being stupid!

This was my last entry, thanks for reading!

Sunday 24 November 2013

Sorting Recap


It felt like the treatment we did of sorting was pretty brief, but I found the subject fairly engaging. It also served as our first introduction into the analysis of algorithms. I'm going to use this blog to try to get my thoughts straight on the concept of sorting, as I consider it pretty likely that the subject is going to make a showing on the exam, considering it didn't appear on the second test.

Sorting is the process of ordering a list of elements in some sort of fashion. For example, we could try to sort a list of integers from smaller to largest(ascending order) or vice versa(descending order), or a list of words from shortest to longest, etc. We covered a few different approaches to sorting; namely mergesort, countsort, quicksort, and select sort.

Of the sorts I listed above, countsort came out on a top, if memory serves, with mergesort coming in second. I looked into why countsort did so well in the tests that professor Heap ran(it was briefly explained in class, but I didn't really catch the reason why) and it turns out the answer is pretty interesting. Most of the sorts we covered, like mergesort, or quicksort are what are called comparison sorts. This is because they directly compare the items they are sorting. For example, quicksort works by partitioning the list into sublists by picking a pivot and recursively quicksort-ing each sublist. Mergesort, on the other hand, breaks its input down into a series of pairs, which it sorts. It then compares each set of pairs etc, until it has ordered the entirety of the input. Agorithms like countsort, or professor Heap's favourite Radix sort, appear to be what are called integer sorts. We didn't really cover this in class, so I'm kind of shaky on what exactly integer sorts do. I'm going to try to figure it out and make it the topic of my next blog entry.

Sunday 10 November 2013

Entry 4: Assignment 2 Post-Mortem Part Two

I'm going to post a more interesting diary later, but first I'd thought I would touch on an experience I had during assignment 2. I think I learned something interesting here about analyzing my own code, and also about the prudence in changing one's approach, and in thinking about the problem before one starts, instead of getting in a mess later.

First, an overview of the problem. In part 2 of the assignment, we had to write a function to match strings against regular expressions. The solution seemed pretty simple for me to grasp: take the regex, split it at the position of the the the top-level regex operator(the one on the first nesting level) and then partition the string the same way, and then match. I made a couple of mistakes right of the bat. I had a gut feeling that it would be easier to work with regular expressions in string form than it would in RegexTree form. So I coded up a neat little function to convert them. In and of itself, writing that function didn't waste too much time, but it started me down what in retrospect was the wrong approach to the problem. It actually was MUCH easier just to work with the RegexTrees.

After coding up that function, I hit the second problem, where once again I took the wrong fork in the road. After figuring out to handle the regexs, I had to decide how to handle the strings. They needed to be split in certain ways, according to the regex symbol. This was tricky, for obvious reasons, as regex strings of only 5 or 6 characters could correspond to strings hundreds of characters long. Somehow, I created some extremely convoluted code to handle these situations for bar and dot operators. The trouble really began when I turned my attention to the star operator.

Attempting to partition the string precisely here was nearly impossible, but that didn't stop me wasting several hours on an attempted solution. I did find a much better approach in the end- namely splitting the string into every possible configuration and seeing if one of them matched. I feel as if I learned some important lessons here:

1. Think before you code! Despite the fact that this in the briefing to every assignment and exercise, I didn't follow this rule here, and paid the price in the form of a lot of wasted time. The first approach I chanced on was not the right one, and I think I could have avoided this if I had put some more thought into it.

2. Don't be afraid to change your approach. This was my second problem. I leaped into the problem without looking, but I could have fixed that if I gave up sooner. Instead, I didn't want to feel like I was quitting the problem, and so pressed on to my detriment. It would have been far better to cut my losses and switch approaches, even if I have to throw out some of my code.

Saturday 9 November 2013

Entry 3: Assignment 2 Wrap-Up



Unfortunately these things are always easy to postpone for 'one more day', especially when it's so hard to think of a topic(and when there's work to do). I'm going to talk today a little bit about assignment 2, specifically what went right and what went less-than-right, and what I'm going to do about it in the future.

Overall, I felt it was a more straightforward assignment than assignment one. I think this is chiefly on account of the fact that we wrote assignment 2 more or less from scratch, while in assignment one we were fitting our code into a pre-existing framework that involved passing information into a gui, etc. The hardest problems in each(moving cheeses, creating RegexTrees, matching strings and Regexes) were both best addressed using recursive strategies, but the one's in assignment two were less conceptually thorny; when one is faced with a series of self-similar subproblems(eg assembling a more complicated valid regex out of other valid regexes in a nested structure) a specific recursive solution comes to mind immediately. I found more challenging, on the other hand, the task of coding a neat way to move the cheeses around.

My problem in assignment two then, was in the elegance of my code. I got my code in working order pretty quickly, but it was an absolute mess. There was a heavy reliance on special cases and the like, and the code just felt ugly. I was able to fix this to a degree, but I couldn't banish the feeling entirely. My impressions of repugnance at my solution were reinforced by comparing the length of my solution to the what the instructors and my fellow students posted on piazza. Needless to say, mine was much longer, and while I know that long code is not necessarily bad code, I couldn't help but feel that I had missed some key insight. My two problems were as follows. In the first part of the assignment, writing the constructor for RegexTree, I couldn't figure out an efficient way to parse the string, I had to write several helper functions to accomplish this in a somewhat messy fashion. My second problem was in the match function. I had to do what I felt was an inordinate amount of fussing around with the strings as I matched them to the RegexTree, and I needed about 6(!) helper functions to do it. This all for a function that the instructors said they had managed to tackle in about a dozen lines.

Overall, I don't feel badly about assignment 2. I think I grasped the core concepts(binary trees, recursion, traversal) and I expect that I'll do pretty well on it(my code appeared to function without glitch or bug). I think my feelings about my solution are probably useful indicators that I need to spend some more time programming, so I can improve my facility with some of the tools at my disposal(namely recursion).


Wednesday 23 October 2013

Entry 2: Trees

This week I've decided to talk about trees, as they seem to be the focus of the course at this time. As Professor Heap outlined, the name 'tree', while also corresponding intuitively to the species of plant, appears to come from graph theory.

As I'm sure the reader knows, the tree is a recursive data structure, defined in terms of its nodes, which in a tree contain both a value(say an integer) and some number of references(potentially zero) called children(which is a bit of mixed metaphor, next to all the plant stuff). A tree is a root node and its sub-trees(which each have a root node), and so on and so forth. At the bottom of a tree is a "leaf" which sensibly is a node which does not connect to anything(this is if a tree is finite; but practically I don't think we will ever have to deal with infinite trees except as an error).

By this definition a linked list is simply a tree with a root node and a single branch between each of its nodes. In fact a node is also a kind of tree, a root with no sub-nodes.

So far we've stuck mainly which with binary trees, which are, unsurprisingly, trees where each sub-node has two children(which may be other nodes or None). Binary trees obviously present some advantages: it's easier to write code to traverse them, for one, as one knows in advance that each node has two children(and therefore at most two subtrees). These are actually part of the focus of assignment two(which I need to get working on!).

Most of my interest in trees comes from their structure. It's very elegant, an elegance that comes from its recursive structure and implementation. What I'm not so sure about at this point is their utility. The only real 'use' for trees we've seen so far is their ability represent regular expressions in assignment two, but I'm not at all experienced with regular expressions, and therefore I can't quite speak to whether or not using trees with regular expressions is good for anything more than an academic exercise.

I can think of some data that trees would be useful for. For example, if for some reason you wanted to store a 'master list' of organisms, it would make a good deal of sense to do it with a tree. The root node would contain references to Animalia, Plantae etc, each of which would contain references to sub-classications, until finally one would reach the leaves, which would be species or sub-species. However, it doesn't seem like most data would be suited to a similar representation: I can't imagine wanting to represent a list of students in a school this way, for example.

Sunday 13 October 2013

Entry 1: Recursion and OOP

Object-orientated programming:

Insofar as I understand it, the primary distinction between object-orientated programming and other 'styles'(which to my recollection we have not dealt with explicitly) is the bundling of certain data with its appropriate methods into the titular 'objects', as opposed to simply writing a number of functions to use the data as you wish, while leaving it separated it from them in a mass(I assume that the methods of the non-OOPers are not as primitive in actuality). The former method seems as if it would have many advantages over the latter. Simply on the level of organisation, knowing from the structure of the code how functions(methods) and data relate would seem to be an advantage; if a change to the program was necessary, it seems to me as if OOP-styling would make it's implementation easier and its debugging more expeditious. I am not currently equipped with the knowledge to comment on how this might effect the functioning of the program itself, but I suspect the impact is non-trivial.


Recursion:

Recursion thus far has come in two flavours: recursive data structures and the recursive functions which we have written to operate on them(the latter has been the primary focus, and our of use of recursive functions has not been limited to dealing with recursive data structures). Unless I'm missing something, the nested list has been the only real recursive data structure we've encountered, and usually only really as a motivating example for the use of recursive functions. I've found writing recursive functions to be challenging and intellectually exciting. There's a certain elegance when you finally conceive of a recursive solution to a tough problem; this sense of elegance is only magnified by the fact that you often only have to write half as much code to implement it! Again, I don't feel confident making any sort of grand pronouncement as to the usefulness of recursive functions more broadly to computer science, but they seem to be the optimal approach in certain situations, especially when dealing with the aforementioned self-similar data structures.