Jamboree Social Media forensic session
Some thoughts to include so far:
- Some statistics
- Which platforms and why?
- What did we learn?
- What did not work?
- What worked?
- Tips and tricks
What would you be interested in hearing about?
Some thoughts to include so far:
What would you be interested in hearing about?
While I still remember, some learning outcomes from running the World Scout Jamboree website.
I’m sure more will follow.
There are a lot of bad teachers in academia. Actually, there are a lot of super smart professors, PhD students, teaching assistants and whatnots, who are all really bad teachers. There are exceptions of course, but it need not be that way. The default should be awesome teachers.
I’ve heard several arguments such as “Yeah, but teaching is not what they like the most,” “I’m here for the research. Working on hard problems is what captivates me,” and “I only get funding for X%, the rest I have to fill up with teaching.” Arguably all valid reasons. Take note that what I am about to propose doesn’t mean professors should do it 100% of their time. I’m just suggesting a shift in their priorities, such that when we evaluate their research and our education, we will read comments like “Amazing course! Makes me want to do a PhD asap”, “Awesome!”, and “This is ground-breaking!”
Here’s why I think universities should make teaching the number one priority for their researchers, PhD students, professors, and otherwise:
1. Tap into awesome ideas
In every course and every class, there is a vast unused or under-used pool of resources. Students, freshmen and seniors, controls an enormous amount of intelligent ideas. Ideas and students alike are all different, causing a positive diversity which enables a multitude of angles to be explored. Through teaching, professors can challenge students, explore alternatives, evaluate their ideas, and receive contributions to their current (and past) research.
2. Use free labour for a better cause
In computer science a large part of any course’s curriculum is project work. I’ve been part of the most awesome project and also some really boring ones. The boring ones tend to follow the scheme: “Here’s the requirements, implement XYZ, make some not-so-extensive-testing in the last minute, write a report, get a grade assigned to it, never look at it again” Not so motivating. So many times (too many times?) students spend hours and hours implementing some algorithm just to meet the requirements of the course. Here’s a thought: let professors use the students free labour during course work to implement and/or evaluate some of their on-going research. I would love to be able to see how my late nights of hacking contribute to something real. I don’t give a rats about a grade as long as I learn, and I’m sure many students feel the same.
3. Make students interested in research
This is rather a result of the two previous ideas than something ground-breaking. If students are exposed to, through course work for example, current research activities by professors at their university, then that will open students’s eyes. Much of the time I have no clue what the computer science department’s researchers are working on even though I have a genuine interest. Probably much of it I will never understand. But, if teaching is professors’s number one priority, I’d be able to see and learn about what problems they are working on. They are not the only ones who like to work on hard (real) problems. Show me what you are up to and even I might consider a PhD.
The result: a bunch of A-class students with a deep understanding of research. In fact, much more than that! Professors will likely produce even better research, have a greater chance to get publications (there seems to be a frenzie about getting this in academia), get more work done in less time, and be highly appreciated by their students.
In short, we’ll all have a jolly good time while being awesome.
The Android Eclipse plugin is generally a very handy plugin. There are, however, a few limitations. For example, if you are working with multiple emulators running simultaneously and want to update the GPS coordinates in both emulators, you will find that you can load only one KML file at the time. Needless to say, this is a very specific issue, but annoying enough when developing applications which heavily depend on the GPS functionality and interacting with other clients at the same time.
@pauloricardomg wrote a GPS server to be used in his and @navaneethr course project. He programmatically updates the coordinates by sending instructions over Telnet to the emulator. For the evaluation of our own project, we needed something similar. Below is a hack which extracts the essential functionality and wraps it in a python script.
Hope it may come in handy.
Who for some mysterious reasons share a deep love to Sweden. Last weekend they demonstrated their passion by singing, without any rehearsing or knowledge of Swedish, the Swedish national anthem. Love it.
The DHTs are in business!
Today Lalith, Bruno and I presented Cassandra. A distributed database with high availability and partition tolerance. I.e, the A and P of the CAP theorem.
Overall we were pleased with delivery. We got some good feedback from our professor and our colleagues. One comment addresses something which I try to mind when shaping presentations.
Sometimes your presentation is directed to a public which is familiar with Distributed Systems. This is PADI, but many people don’t have Distributed Systems as their major or minor, but Software Engineering. I’m not familiar with a lot of systems you talk about in Parallel Computing or Cloud Computing.
We should always start with the audience. Who is listening to our talk? What do they know? Why are they here (apart from attendance)? Where do they come from? We should think about this before we start thinking about content, structure, and other presentation mechanisms that we can use in our delivery.
Probably we thought about it too, but not to sufficient detail. I thought: “they’re students like us.” Now I know, next time I make a presentation, I should do my audience analysis more carefully and avoid generalising too much.
Thanks for the feedback!
For Friday’s presentation of Cassandra – a distributed storage system – I needed to understand how the system is able to detect node failures. In distributed systems a so called failure detector is sometimes used to simplify an algorithm’s work. And, Cassandra uses a failure detector called the Accrual Failure Detector. Accrual for those of you who don’t know, means accumulation, or the act of accumulating over time.
The basic idea is that a node’s state is not only up or down. It is not true or false. Rather, it is an educated guess which takes multiple factors into account. With approximation we can, for example, take slow messages into consideration and, thus, allow ourselves to be wrong. How weird?
A server (node A) suspects that a node is down because it hasn’t received the two last heartbeats from node B. Node A assigns a Phi value of (let’s say) 1. Phi denominates the suspicion level that another server might be down. This value can be adjusted dynamically according to local conditions such as load.
Phi represents the likelihood that Node A is wrong about Node B’s state. So, when a third heartbeat is considered lost Phi increases, and eventually a threshold is reached. When that happens the application will be notified about the failed node. The threshold is a configured value.
Cassandra approximates Phi using exponential distribution. Thus, the higher the Phi, the bigger the confidence that Node B has failed. I still haven’t found any more detailed explanation than the following as to why exponential is used rather than Gaussian:
Exponential Distribution to be a better approximation, because of the nature of the gossip channel and its impact on latency.
Don’t know if that made sense to anyone else, but I think I get it know.
The last post was a good exercise. I decided to do the same for the next paper titled: Cassandra - A Decentralized Structured Storage System written by Avinash Lakshman and Prashant Malik, both employed at Facebook. We’re presenting this paper on Friday as part of our Distributed Systems course.
Same premises as before. I write as I read. And the goal of this paper:
Cassandra system was designed to run on cheap commodity hardware and handle high write throughput while not sacrificing read efficiency.
It is immediately noticed in the introduction of this paper that there is some weight behind their motivation:
Facebook runs the largest social networking platform that serves hundreds of millions users at peak times using tens of thousands of servers located in many data centers around the world.
Maybe it is unfair to compare papers from industry with papers from academia? Not everyone has access to millions of test users… Anyway. Failures are treated as the norm rather than the opposite. Moreover, their systems must support continuous growth. It is also noted that Cassandra does not provide anything new, it only draws on what was existing before. What’s new is the combination of techniques, and their implementation of it.
It scales to 250 million users as of writing and was initially intended for Inbox Search.
Related work mentions two projects commonly referenced in mobile computing literature: Coda and Ficus. Both replicate files for availability but don’t provide consistency guarantees. They also compare to GFS, Google File System, and it’s simple design. It is interesting to note that they make no distinction between acamedic and industrial projects. Conflict resolution, network partition and data schemes are different among all related projects. Dynamo is highlighted for its gossip membership. None fulfill the goal in one aspect or another.
The data model is very similar to Bigtable with the addition of super column families; a family which includes other families. Sorting by name or time. In general very little specifics, only possibilities listed.
The architecture of a storage system that needs to operate in a production setting is complex.
No shit. They limit the system architecture section to: partitioning, replication, membership, failure handling and scaling. A high-level overview is provided.
The principal advantage of consistent hashing is that departure or arrival of a node only affects its immediate neighbors and other nodes remain unaffected.
Immediately afterwards they also provide the drawbacks: non-uniform distribution of load and heterogenous nodes. Solution is based on being able to make deterministic decisions on load-balancing. Clearly related to the goal set out early in the paper.
It is also interesting to note that Cassandra makes use of real-life notions:
Cassandra provides various replication policies such as “Rack Unaware”, “Rack Aware” (within a datacenter) and “Datacenter Aware”.
There is a clear separation of concerns. Things not related directly to Cassandra are left out of the implementation and instead Cassandra depends on other tools to perform some tasks for it. For example, the leader-election is done by Zookeeper (developed by Yahoo).
Ironically, there are some mishappenings that are not very common to the average programmer:
Data center failures happen due to power outages, cooling failures, network failures, and natural disasters.
They use a failure detector called Accural Failure Detector which instead of sending a boolean value to other nodes in the system, sends a suspicion value. The original authors of the failure detector suggested an approximation using gaussian distribution, but at seems exponential distribution is better in gossip based settings. I’m not sure I understand this. However, do note that they make use of clearly relevant research and adapt it to their needs, and provide a reason for doing so (“adjust well to network conditions and server load conditions”)
Some optimisations are provided for the local persistence:
All implementation details are left to a section which is very compact and absolutely loaded with information. It was a bit hard to digest at 00h39. I’m unsure how relevant some of these details are to the understanding of Cassandra.
Evaluation is completely based on experiences rather than experimental data. They note that many of the problems that arised was not, and couldn’t be, foreseen in an experimental environment.
One very fundamental lesson learned was not to add any new feature without understanding the effects of its usage by applications.
Might seem obvious, but awww we so often forget what it really means.
It is also worth noting that they provide key data to some research projects on failure detectors and their scalability. Most of them were unsuitable as size of clusters grew. They also present the case of Inbox Search. However, no comparison data is provided. Only values, and those don’t tell me much. Not in respect to anything else at least. Their qualitative analysis is a bit weak as it only provides some “interesting” scenarios, but aren’t doing a very good job at linking these to Cassandra’s system architecture. It shows inclinations, or directions, rather than concrete qualitative statements about performance.
In conclusion, this paper is truly different from the previous one presented here. First of all it is a lot shorter, 6 pages compared to 18 pages. Second, there are real motivations provided along with very clear and constructive examples. Third, they make no distinction between academic and industrial projects, and seem to make equal use of both. Whatever works, works. And last, they don’t try to solve everything. It is focused on one thing, and one thing only: decentralised storage. In other words, it is not an “all in one” solution.
My previous post was a bit disorganised. I have too many opinions about academia, some very strong, that I couldn’t focus. Thus, I decided to do an analysis of the next paper. By the time I have finished this post, I have finished reading the paper: "Mobile Computing with the Rover Toolkit" by Anthony D. Joseph, Student Member, IEEE, Joshua A. Tauber, Student Member, IEEE and M. Frans Kaashoek, Member, IEEE. Published in IEEE TRANSACTIONS ON COMPUTERS, VOL. 46, NO. 3, MARCH 1997
Let’s start with what they set out to show in this paper:
In this paper, we describe the Rover toolkit, a set of software tools that supports applications that operate obliviously to the underlying environment, while also enabling the construction of applications that use awareness of the mobile environment to adapt to its limitations.
Quite high, and again I flag for the “trying to take over the world” stance. Moreover, Lalith just popped in and we got into the discussion: Should researchers really produce applications?
They’re apparently evaluating this by “We illustrate the effectiveness of the toolkit using a number of distributed applications, each of which
runs well over networks that differ by three orders of magnitude in bandwidth and latency.” It remains to see in which context though. Immediately afterwards the author’s goes on about a number of assumptions made, and a lengthy description of the characteristics of mobile computing is presented. Too many details, especially regarding data consistency, are presented in my opinion. Eventually they reach some kind of conclusion:
[…] a mobile-aware application can store not only the value of a write, but also the operation associated with the write. That operation can include any relevant context. Storing the operation allows the application to use application-specific semantic and contextual information;
Great. Basically, applications need to be aware of what’s going on underneath (i.e what network connectivity do we have? How much battery is left?) so that it can optimise performance accordingly.
Here’s a summary of some implementation details
Here it is worth noting: does this really make life simpler for the programmer? Surely, it seems they don’t have to worry about moving the actual code. But how flexible and adaptable is it? What trade-offs do they have to make?
They present four main results:
In related works we find that Rover is, apparently, also the first toolkit to support both the development of mobile-aware applications and, so called, proxies to enable untouched applications to benefit of the “mobile-awareness”. (This, once again, feels like a marketing trick… but sure let’s go with it.)
I won’t describe anything from the implementation details. To be honest, I didn’t bother reading it too carefully either. Let me add a thought that emerged from these sections though: there are potentially some ideas that have migrated into “real” products here. The details are also irrelevant since the paper was written in 1997 and much have changed since.
Section 5 presents the programmer with something similar to guidelines for how to port, or integrate, Rover to their mobile applications. It looks like a good overview of what steps are required for using Rover. Question: there has gone a significant amount of thinking into this project, why spoil it with miniscule details like:
"The application developer also must decide which mechanisms to use for notifying users of the cache status of displayed data. In the e-mail application, color is used to distinguish operations that have not been propagated to a server."
Secondly, that is not very unintrusive. How many users knows what a cache is?
A table showing number of lines changed to integrate/add Rover to existing and new applications. Convincing? It is good though to see how much work is required, or at least get an estimation of it.
Lab tests were carried out for evaluation. There is a concise list of hypotheses that they are evaluating. I like. Unfortunately, there are only internal comparisons. We discussed in class one day that it is hard to do benchmarks (in general) in Computer Science because the field is moving too fast. I believe that if you cannot do quantitative measurements, at least provide qualitative assessments.
Their evaluation obviously shows gains (I have only seen a few papers where a solution was disproved… they were an entertaining read!). I’m mostly concerned about their values; 17% doesn’t sound like a significant gain. Is it worth it? Increase of bugs? Their final graph on speedup shows some promising results. There is a significant increase against the original versions (based on a subset sample of tasks). A 7.5 speedup over slow networks is mentioned in the conclusion.
However, they do a bad job of connecting to their original goal.
"We have found it quite easy to adapt applications to use these Rover facilities"
What does that mean by the way? Really, when you’re making qualitative statements, provide a solid argument. Don’t make loose relative statements from your quantitative 7.5 speedups.
In practice, we find the combination of the Rover cache, relocatable dynamic objects, and queued remote procedure calls results in a surprisingly useful system.
Surprise! Now, except for you guys, who did/does? Show me!
Once again, the paper presents some cool ideas, probably genuine and innovative at the time of writing. But seriously, even if this is 14 years old, this still happens today. Perhaps even more because of increased competition.
Footnote: I wrote this post as I was reading the paper in an attempt to track my thoughts. In other words, an experiment.
Our course literature for Mobile Computing consists of a lot of academic papers. A superb way integrating academic work into classes compared to often long and partially irrelevant books. Some papers are well written and easy to understand. Usually because the authors have paid careful attention to the structure of the paper. Sometimes though, actually more often than not, I find that researchers are really trying to “take over the world” with their proposed solutions. They offer glory and a resort to all your problems.
This is obviously not true. Neither do I think the researchers think that it is the case. But it sounds like that. So why spice papers up with sentences like the following?
The Odyssey architecture supports application-aware adaptation while paying careful attention to a variety of practical considerations. Our prototype confirms the feasibility of realizing this architecture, and its ability to support a wide range of applications.
A paper is, inevitably, an argumentation. It is a space to convince a reader that proposed solution maps to a problem defined in the introduction. My problem, I think, is that researchers rarely present any realistic and convincing arguments to the problem in the first place. Their solution might be great. It may even be innovative! But if it is not a problem conceived by users, and where a proposed solution does not provide added value to the user, it is flawed in the first place.
Odyssey is the first system to simultaneously address the problems of adaptation for mobility, application diversity, and application concurrency. It is the first effort to propose and implement an architecture for application-aware adaptation that pays careful attention to the needs of mobile computing.
The statement above from the related works section of the same paper has some legitimacy. They do build on previous research and manage to show that clearly. I can even see the progression from their problem statement that this is might “true”. Nevertheless, I interpret it as a marketing trick.
Perhaps am I looking in the wrong place, but I’d like to see better motivations to the research conducted. And I shouldn’t have to resolve to reading a ton of surveys before reading a new paper. Although this is only a hypothesis so far, I sense that papers stemming from industry are better at providing realistic and believable scenarios and motivations. Ultimately, conclusions in those papers also hold stronger.
How do we know we, and researchers, are spending time on “the right thing”?
[Update: note so self - be more structured next time you write a post]
Active Scout since many years, most recently team lead for the Info/PR team for Lägr1. Also presentation nerd, food lover, active reader, cautiously enthusiastic, avid traveller, and a big fan of smart ideas.