Crowd Sourcing and Game With A Purpose

From CS2610 Fall 2014
Jump to: navigation, search

slides1 slides2


Contents

Readings

Reading Critiques

Xiaoyu Ge 14:27:20 11/25/2014

Designing Games with a Purpose: In this paper author presents a wide range of games and game types, which is used to help solve real-life problems that computers currently have difficulty with. The authors describe games with a purpose (GWAP) as the constructive channeling of human brainpower through computer games where as a side effect of playing the users performs tasks computers are unable to perform. One metric left out of this paper was the cost associated with game development. In this paper I think the idea of creating applications that leverage user input is very interesting, in particular, for cases where computers can’t solve problems. If we think about other interesting problems, like automatic driving, in this case if we turn “need for speed” into a user data collector, we might be able to learn useful info for the computer. But, think of solving the language translation problem, then it is hard to think about a fun thing to do for users, which can also generate useful annotations. I would say that GWAP is a promising direction, but I believe it still needs a long way to research and develop. Crowdsourcing User Studies with Mechanical Turk: this paper examines the use of Amazon’s Mechanical Turk system, particularly how to get real, significant results from a Turk experiment. By reading this paper I realized that special care must be taken in able to design user studies implemented through micro-test markets. In this paper for the user study part, the author focused on one specific task, which is rating the quality of wick articles, but he designed two slightly different experiments to analyze the impact of micro task types on the annotation reliability, in other word, how well it matches expert ratings. By reading this paper I believe that the author of this paper provided a good justification in the introduction as to why studying this particular system was interesting. Also in this paper the authors suggest a sort of guideline for a good survey design for crowdsourcing. However, those suggestions are more like general tips for those who are trying to conduct a survey that may be a little bit disappoint to me. By conclusion, I believe the purpose of this paper is how to control the quality and quantity of the results obtained by using Mechanical Turk. After reading this paper I believe with proper planning and careful evaluation of the results, crowdsourcing can be a great tool for running experiments that require user feedback and I believe this is the case in many research projects.

Wenchen Wang 17:45:07 11/28/2014

<Crowdsourcing User Studies With Mechanical Turk> <Summary> This paper evaluate the utility of Amazon’s Mechanical Turk by two experiments. They come to a conclusion that Mechanical Turk has advantage and limitations. Researchers should take special care as applying Mechanical Turk for user study requiring subjective or qualitative information combination. <Paper Review> Micro-task markets have very creative idea that they recruit various users allover the world to do user study for researchers. By applying human intelligence, users could help researchers do user study topics, such as natural language processing, image recognition. However, this kind of micro-task has some limitations. This paper is to study and investigate Mechanical Turk to give some guidance and suggestions for researchers about how to utilize this user study platform. I think evaluation the existing system or software is helpful. For example, a previous paper we read called “Design Lessons from the Fastest Q&A Site in the West”, which learns some lessons from existing Q&A website, stackOverflow. This could assist people learn some lessons from existing system or give some advice to better utilize it. I think this paper could also give some suggestions for Mechanical Turk designers to improve their existing system to adapt more kinds of user study. <Designing Games With A Purpose><Summary> This paper introduces a set of design principles for development and evaluation of a class of games with a purpose. <Paper Review> GWAPs (games with a purpose) is an approach to design computer program by taking advantage of constructive channeling of human brainpower through computer games. ESP Game, Peekaboom, Phetch and Vervosity are all GWAPs, which are familiar to us. I think GWAPs is a very novel idea, similar to Amazon’s Mechanical Turk to take advantage of human intelligence for researchers. GWAPs combines computation and gameplay, which is a platform to collect training data for machine learning research area. One principle is we should make the game more entertainment, in order to attract users to play. Designers could set timed response, score keeping, skill level and high score list to appeal more users to participate the game. I am more interested in GWAP evaluation. GWAP is evaluated by throughput, average overall amount of time the game will be played and expected contribution. However, the proposed game principles, or templates have some limitations. It just focus on similarity as a way to ensure output correctness, not including tasks that demand creativity and diverse viewpoints.

Qiao Zhang 1:37:59 11/29/2014

Designing Games with a Purpose While traditional computational approaches try to solve problems by improving artificial-intelligence algorithms, in this paper, the authors present "games with a purpose", in which people, as a side effect of playing, perform tasks computers are unable to perform. For example, they designed a game called ESP, in which users tag objects in photos that are very difficult to solve by using computer-vision techniques. One important insight of their work is that users are merely looking for entertainment rather than they are personally interested in solving an instance of a computational problem. I have heard such a kind of collaborating filtering game before taking this class: http://fold.it/portal/ (Foldit, a revolutionary new computer game enabling game players to contribute to important scientific research) To me, the hardest part of GWAP is to find a proper scientific problem that can be transformed to a game, rather than how to encourage users to devote themselves into game playing. I know that there are a lot of problems that are relatively easier for human beings and hard for computers (mainly rely on context and background knowledge), but turning the problems into games is a challenging thing. This this paper, several general guidelines for GWAP development are discussed. Three game-structure templates are mentioned: 1) output-agreement games, in which users try to generate the same output from the same input; 2) inversion-problem games, in which a user tries to guess out what the other user tries to describe; and 3) input-agreement games, in which players try to determine whether they have been given the same input. All of the three approaches are seeing the game as a black box. One problem with such design is that, as stated in the conclusion part, players are rewarded for thinking like other players, which potentially suppresses the creativity of the users and the diversity of the answers. These approaches may work well with simple problems where there indeed are correct answers which most users will agree on. To increase the player enjoyment, the authors read the literature on motivation in psychology and organizational behavior. They believe that in the game design, goals should be both well-specified and challenging. However, I believe that telling the users that they are contributing to scientific researches will also be beneficial, because it provides a meaningful incentive, making a game not only a game. To ensure output accuracy, the authors suggest to use random matching, player testing, repetition and taboo outputs. Player testing proves to be successful in the other paper we are reading. It can also be seen from reCAPTCHA, where a user has to enter the correct captcha. Several evaluation metrics are given for determining GWAP success, including throughput, lifetime play, and expected contribution. ================================ Crowdsourcing user studies with Mechanical Turk In this paper, the authors investigate the utility of a micro-task market such as Amazon Mechanical Turk for collecting user measurements, and discuss design considerations for developing remote micro user evaluation tasks. A micro task is a task that is in bitable size that can be completed in a short amount of time. The inspiration of the system was to have human users complete simple tasks that would otherwise be extremely difficult (if not impossible) for computers to perform. The authors conducted two experiments to test the utility of Mechanical Turk as a user study platform. The first experiment tries to mirror the task given to admins as closely as possible without any user testing, and the results show that the correlation is only marginally significant. The second experiment intends to make creating believable invalid responses as effortful as completing the task in good faith. The result shows that with the help of adding four verifiable questions, the correlation increases and becomes significant. 66% agreement suggests that it is possible to use crowds to approximate expert judgments in this setting. To harness the capabilities of micro-task markets, first, it is important to have explicitly verifiable questions as part of the task; second, it is advantageous to design the task such that completing it accurately and in good faith requires as much or less effort than non-obvious random or malicious completion; third, it is useful to have multiple ways to detect suspect responses. All in all, by using micro-task markets such as Amazon's Mechanical Turk, hundreds of users can be recruited for marginal costs. However, special care must be taken in the design of the task.

Eric Gratta 16:07:41 12/1/2014

Designing Games with a Purpose This paper seems like a great research contribution because it covers some fundamental principles that will help guide others to create great games with purpose that may have real benefit for society. We can assume the authors are credible because they have created many GWAPs. I immediately thought of the game Foldit, which may have been developed after this paper. The drawbacks highlighted in the related work section were interesting. Specifically, I began to wonder how this paper would recommend eliminating error from user contributions that could occur in the Open Mind Initiative. Another interesting observation was that gameplay should be tightly integrated with the work being accomplished to make the tasks appealing. The scope is refined so that “success” is measured in terms of human hours played, assuming that this represents users’ desire to play. After providing the basic templates of gameplay, the authors turn the focus to increasing player enjoyment, which will increase the hours that people play as well as improve the quality of their output. They outline game-design principles that increase enjoyment. A clear theme running through this discussion was the use of a variety and multiplicity of feedback for motivational purposes. Although this is inconsistent with earlier statements in the paper about success being measured in human hours, the metric of “efficiency” was introduced for gauging success in terms of number of problems solved per human hour. ------------------------------------------------------------------------- Crowdsourcing User Studies with Mechanical Turk The title of this paper states the unique opportunity that researchers have to improve user study sample size at low cost in utilizing “micro-task markets.” The authors claim to have discovered special considerations that must go into user studies taking advantage of the crowdsourcing approach. I have actually used Mechanical Turk briefly just after hearing about it by word of mouth. Many potential difficulties with using Mechanical Turk were identified. Most of these issues have to do with lack of knowledge about the users and the lack of mechanisms for ensuring quality responses from users who may be malicious. They conducted two experiments, one quantitative another qualitative (I thought this was important), to try and assess the potential of micro-task markets in research studies. The researchers made a wise decision in giving users the “Featured article criteria” rubric; one would assume this improved the consistency of user responses by giving users a shared set of expectations. Results from that experiment, unsurprisingly, demonstrated that Mechanical Turk was extremely susceptible to users who wish to game the system (i.e. not provide meaningful responses) and receive rewards quickly. Their redesign of the experiment attempted to address these issues. I did not feel that the discussion of the redesign decisions was very clear. Why, for example, does providing concretely verifiable information reduce low-quality submissions? Were users blocked from submitting their responses if this information was incorrect? The verb “required” is used but it’s not clear that a user could not simply type in random numbers and continue on. If my assumption is correct, then this was a clever technique to ensure that even users wishing to rush through their response would be required to do the same tasks for as long as the well-intended users, exposing them to the contents of the article and thereby improving their response quality.

Qihang Chen 18:26:52 12/1/2014

"In this paper “Crowdsourcing User Studies With Mechanical Turk”, the authors talks several aspects about the Micro-task Markets: Mechanical Turk. The authors first give some introduction about Mechanical Turk; then briefly talks about the benefits from Mechanical Turk; thirdly, give two experiments on Mechanical Turk and at last, provide some design recommendations. I agree with the first design recommendation in this paper. Explicitly verifiable questions are very important in Mechanical Turk. Actually, how to design these questions are needed to pay lots of attention. Those questions should be used to exclude those people for seeking money. On the other hand, we cannot promise the verifiable questions are able to keep those malicious away. So if we have some abnormal outliers or inaccurate results, can we detect them without manually selecting? What’s the acceptance range for those data we obtain from Mechanical Turk? I think people should think of these questions before using Mechanical Turk. In this paper, “Designing Games with A Purpose,” the authors want to utilize the large amounts of time people spend on Game. The authors present three general classes of games containing all the GWAPs they have created. And they also describe a set of design principles. Finally, the authors propose a set of metrics defining GWAP success. I have to say that this idea or motivation may be not new to researchers. But with the development of internet and GPU, there is more space to explore in this area. Briefly speaking, the authors want to obtain useful training data through game. Those data are devised elaborately and can improve the intelligence of computer. As the authors say “people enjoy the game makes them want to continue playing, in turn producing more useful output.” This is also the reason why the real measure of utility for a GWAP is a combination of throughput and enjoyability. I think this is also the most challenging part in designing GWAP. I would say that GWAP is a promising direction, but it still needs a long way to research and develop. "

Bhavin Modi 22:17:59 12/1/2014

Reading Critique on Designing Games with a Purpose The paper encourages the use of games to collect data for various AI and Machine Learning tasks. Performing computational tasks while playing that are easy for humans but prove to be very tough for computers. Beginning with the importance of GWAP, it is fact that by the age of 21 an average human spends about 10,000 hours playing games. To make use of this fact, that we like to entertain yourselves from time to time, stemming from this, motivation to incorporate certain computational benefits arises. Existing implementations include ESP, Peekaboom, Phetch, Verbose etc. The design principles for GWAP’s (Games with a Purpose) are discussed so as to outline and highlight the major features and advantages of such games and what should be kept in mind while making them. When we say games, this is not like the traditional games but here the guideline is to design them such that administrative work can be incorporated into the games and playing it should provide the desired outcome, which should be related to the output of the game itself, providing a direct one to one correspondence. The three game structure templates are input-oriented, output-oriented and inversion-problem games. The target is to make people interested in playing because they enjoy it and not due to altruism or some monetary benefit. This can be accomplished as people like playing games while collaborating with individuals to achieve the common goal of winning. The other design considerations involve some features that we have looked at before, making the game more competitive as in including ranking system, skill levels, and time constraints amongst other feature. After which we still have to discuss the problem of correctness as we want to use the information gathered from the users playing with each other. Correctness includes labelling and identification games discussed in the paper, so using randomness and correct answers with their probabilities from other games to enforce it can be used. The use of Taboo words as mentioned helps to improve the verbosity of the information gathered. The conclusion from the paper is that such an approach is promising in its ability to use crowdsourcing to contribute to the field of AI. The idea of creating a first person shooter game for the task of system administration is a unique and promising concept. The paper highlights the way we can make crowdsourcing more entertaining and still achieve the desired results and is more of overview or a summary of related work, extracting the important design principles from those implementations. -------------------------------------------------------------------------------------------------------- Reading Critique on Crowdsourcing User Studies with Mechanical Turk An introduction to Mechanical Turk (mturk) a crowdsourcing platform hosted by Amazon to assign micro tasks to a wide user base (anyone who has access to the internet) for some monetary benefit. The problem identified for researchers is the gathering of people for conducting user studies. These studies may be long, and to find people willing to participate, sacrificing their time for no gain is difficult and providing high monetary benefits has cost problems. So to find ready users through the interne at a low cost is what amazons Mechanical Turk provides. Like in the previous paper one of the major problems with such systems is users trying to “Game” the system. Such malicious users can affect the results one wish to achieve, and their presence and effect can be clearly seen from the first experiment conducted by the authors. The silver lining can be seen in the second experiment when we use quantifiable questions whose answer can be verified so segregate the malicious users. We can also ban such users to improve performance. The major benefit of using such an platform is that it gives ready access to users all over the world and the speed for conducting is user studies is very fast. We lose out as we have no control on the surrounding settings and control over the experiment in general and there are many variables that affect the viability of our results. Though this approach has a lot of potential, we need to work out many problems, maybe using such techniques for local recruitment can prove to be beneficial but its application for conducting many studies still seems a far reality. The upside is that we can see that the ratings derived from comparatively amateur users seem to match those of expert admins and a more detailed study could reveal some useful facts as to how to better design studies, like checking qualification of users, their reputation level and make the studies more interactive and use some principles from the previous paper.

Brandon Jennings 22:21:42 12/1/2014

¬Designing Games With A Purpose One reason artificial intelligence is difficult is because training a computer system to make decisions like a human is hard. There are vast relationships between a given scenario and available options to achieve a particular goal. This work is in a way an extension of the idea of AI video game opponents adapting the player difficulty based on how well the player is performing, only instead of an end goal that doesn’t mean anything (winning the game), this paper uses the idea to perform useful tasks. The best way to imitate human decision-making is to have humans make decisions and model such thought processes given the variables of the scenario. The appeal of incentives like scores and winning will maintain the crowd sourcing to sustain such AI training methods, unlike most crowd sourcing applications that seem more work than they are worth. I appreciated the level of detail the authors went into describing the design of their games and addressing the concern of making the games as entertaining as possible while still being useful. Crowd Sourcing User Studies Micro-task markets are extremely useful because humans are recruited to complete tasks that would otherwise be handled by complex and sometimes inaccurate algorithms. People are readily at disposal and can be incentivized to complete such tasks. I think the analysis of Mechanical Turk was providing and appreciate the use of a real, utilized system. Another importance of crowdsourcing is real data from real people. It is not simulated and they are not test subjects. The data is more or less unbiased and comes from a more represented population. Like the paper suggested, one of the most important designs in collecting data this way is using questions with a definite answer. This will help ward off participants who you giving unreliable and inaccurate answers simply for completion. One thing I would like to see in the future is an analysis of the different kinds of experiments that work most effectively with user testing in micro-task markets.

Wei Guo 22:48:22 12/1/2014

Title: Designing Games with a Purpose A game which can let user improving AI algorithms while playing with entertainment is called GWAP. There are three kinds of game mentioned in this paper that proved to be successful GWAP: output-agreement games, inversion-problem games, and input-agreement games. GWAP (for Games With a Purpose) was an academic project at CMU that explored the idea of Human Computation games to solve problems that computers cannot solve. I think this is a very good idea in doing visualization. It is very difficult for computers to determine what is in an image. If user can help adding tags for the image, later search will be easier. The problem is how to let user doing such a job without being boring. The entertainment is very important, as the authors suggest in this paper several times. The authors offer 3 successfully kinds of games. I am interested in the unsuccessful kinds of games. I think the authors should list some of them and tell us why they are not successful. Title: Crowdsourcing User Studies With Mechanical Turk This paper investigate the utility of a micro-task market for collection user measurements, and discuss design consideration for developing remote micro user evaluation tasks. Amazon’s Mechanical Turk is the example the authors use to examine the utility. “Amazon Mechanical Turk is based on the idea that there are still many things that human beings can do much more effectively than computers, such as identifying objects in a photo or video, performing data de-duplication, transcribing audio recordings, or researching data details. Traditionally, tasks like this have been accomplished by hiring a large temporary workforce (which is time consuming, expensive, and difficult to scale) or have gone undone.” I think the description Amazon gives is different from describing in this paper. I believe it is proper to say this Amazon mechanical Turk is a GWAP. The entertainment of these games is the rewards small amount of money. This paper also mentions that special care is needed in formulating tasks in order to harness the capabilities of the approach. Although the micro-task markets offer a quick access to a large user and data pool, we still need to pay extra attention to the safety and reliability of the data that collected this way.

Longhao Li 23:15:05 12/1/2014

Critique for Designing Games With a Purpose In general, this paper introduced a research topic that trying to use game as front end to do some that task that can only be done by human. There are several examples introduced in the article. This is an important paper from my aspect. It not only introduced that idea of games with a purpose (GWAPs), but also introduced examples and how to make it be a good one. Since computer is still not smart enough to do all the tasks, there are some works needs to be done by human beings. Some of the works is boring so that people don’t want to do that. GWAP can help to solve the problem. Design a game that people can do the task by playing games will make the work not boring at all. Actually, it can make it be fun. Another reason why this can work is that there are lots of people play games everyday and the hours that they spend on them is huge. If people can use this time on the GWAP, there will be a lot of task done in a very short of time. However, how to make the game fun need to be considerate a lot, and also how to balance the throughput and enjoyability needs to solved. Well-developed GWAP will make the task done is good quality while user can get entertained. In nowadays’ market, there are some similar products. Instead of doing task, they do job to help user to learn something. Some of them help users to learn language, and some of them help users to exercise their brain. Even though they are different. But the idea, that using game to help user to do something with pleasure, is leading all of them. I think this idea can be used a lot in the future since it can help the human computer interaction. Critique for Crowdsourcing User Studies With Mechanical Turk This article talked about online user study, which can get more feedback and cost less. Big sample size will make the experiment result become more accurate. The main point of this paper is the introduction of micro-task market and the analysis of one example Mechanical Turk. Micro-task market is a system that can let user do small tasks with reward. It is great for collecting users’ feedback. The traditional way of doing user study is hard to reach a very big sample size and also sample may group in one area, it may lead to bias in the result. Also conducing it in traditional way may lead to a lot of cost. Mechanical Turk, which is an example of micro-task market, makes the user study become much easier to be conduct, and due to the big sample size, the result can be accurate. Users can preview a task and see how much money they can get from the task. The author did the group of experiment to test if it is a good tool to use. The result is not bad. A good design of tasks can lead to a good rating that closer to expert ratings. This kind of user study is good to get a great result since the big sample size. But it has the problem that when the task load becomes big, time used for the task will increase. People may have less interest to finish the user study even if they can get paid from it. It may lead to a smaller sample size. Also online user study may lead to some security problem. Users secret information may leak, which is forbidden by most of the rule of user study so that increase the security of online user study can make it better.

zhong zhuang 23:48:11 12/1/2014

This article introduces a very interesting research area -- Using game to facilitate computational tasks. There are two basic facts, one is that there are many tasks are trivial for humans but are impossible to computer programs. The other one is people spent millions of hours to play computer or video game every day. If we bring these two together, that becomes a very interesting research question: can we use computer games to facilitate computational tasks? The author brings out a term called games with a purpose or GWAP, in which people , as a side effect of playing , perform tasks computers are unable to perform. To achieve this goal, there are three basic roles to follow: an increasing proportion of the world’s population has access to the Internet; certain tasks are impossible for computers but easy for humans; and people spend lots of time playing games on computers. In the article, the author introduces three game prototypes: Out-put agreement games, inversion-problem game and input-agreement games. In out-put agreement games, two player are given the same input and must agree on an appropriate output. In inversion-problem game, two players are given an input, player one produces an out put and player2 guess the output. In input-agreement game two players are given two input, they must determine their input are the same or not.

SenhuaChang 0:25:25 12/2/2014

Crowdsourcing User Studies with Mechanical Turk This paper shows several aspects about the Micro-task Markets: Mechanical Turk. The authors first give some introduction about Mechanical Turk; then briefly talks about the benefits from Mechanical Turk; thirdly, give two experiments on Mechanical Turk and at last, provide some design recommendations. I agree with the first design recommendation in this paper. Explicitly verifiable questions are very important in Mechanical Turk. Actually, how to design these questions are needed to pay lots of attention. Those questions should be used to exclude those people for seeking money. On the other hand, we cannot promise the verifiable questions are able to keep those malicious away. <………..> Designing games with a purpose This paper shows how people can solve problems when playing the game, so that we want to borrow their brain power into solving hard problems. The paper basically shows three templates of games which can let users provide useful annotations which they enjoy the game. The annotations here are very simple. I played those games, and I don’t think they’re fun. But this is an interesting research direction. If we think about other interesting problems, like automatic driving. If we turn “need for speed” into a user data collector, we might be able to learn useful info for the computer. But, think of solving the language translation problem, then it is hard to think about a fun thing to do for users which can also generate useful annotations. As the authors say “people enjoy the game makes them want to continue playing, in turn producing more useful output.” Which really inspires me a lot, it is the reason why the real measure of utility for a GWAP is a combination of throughput and enjoyability.

zhong zhuang 1:27:44 12/2/2014

This paper discuss the possibility of using micro-task market such as Amazon’s Mechanical Turk to provide user studies for researchers, user study is essential in research, but recruiting a large number of participants requires large monetary reward. But if the user pool is small, the user study will not reveal real problems. To solve this paradox, using micro-task market may be one solution. micro-task is tasks that require only minimum time and effort to complete, user study questions normally falls into this category. Posting user study tasks on micro-task market can gather large number of responses in a relatively low cost. The paper tests this idea by two experiments, the conclusion is, in order to gather high quality feed backs, special care must be taken in the design of the task, especially for user measurements that are subjective or qualitative.

Yingjie Tang 1:51:59 12/2/2014

Designing Games With A Purpose” is an article which introduced three templates for ESP designers. I am quite interested in this paper, the Art of Science of this research is that there people spend many hours on computer games and we can take use of the side effects of the computer games to do some computations which is hard for computers to do solely. These kind of contribution can be applied to machine learning area as well as computer vision areas. The three type of templates are output-agreement games, inversion-problem games and input agreement games. When I read this paper, I realized that I played some kind of inverse-problem games which were quite attractive to me when I was in China. The game was playing in groups and one of the group member describe the objects and the other group members guess it. I think it was interesting because it inherently build an competitive mechanism. When people compete with each other, the game becomes interesting. I think this point can be applied to a major point to increase player enjoyment. Since games is an area very difficult for researchers to quantify some of it’s attributes, the user applied some simple but useful metrics to evaluate the games contribution and popularity. Throughput was introduced to the evaluation part which is impressing to me. Throughput here means the average number of problem instances solved per human hour. Besides the three basic templates to design a high quality of Game with A Purpose, the paper also suggests some other principles to develop these games. Such as showing the ranking status of the high skilled players in order to motivate the players interests and thus increase player enjoyment. I remember there was a time when I played a game in my undergraduate study, I put in a large amount of time to play it in order to improve my ranking status.—————————————————————————————————————— “Crowdsourcing User Studies With Mechanical Turk” is an article which introduces the studies on the Amazon Micro-task Markets Mechanical Turk. There are two studies in this paper, one is followed by another with small changes. The hypothesis of these studies is that the users’ ranking collected from the Mechanical Turk for the wiki pages have a strong correlation with those collected by the Administrators. However, the result of the first study reviews that there exits only marginal correlation between the information from Mechanical Turk and the original information. This may be caused by the “play” to the investigation by the participants. The study also reviews that there exits nearly half of the participants who randomly choose the rankings in order to get small bonus. In order to eliminate the influence by these participants, the researchers designed another study with a slight change. That’s to add some small questions before the ranking concerning the keywords from the questionnaire. These is useful because they can recognize the “playing” participants from the “true” participants. These paper is quite short with only 4 pages but it has a great influence with more than 800 citations. And there is no prototype constructions in this study. I think the reason why it is so successful is that it successfully find a way to evaluate the research in Crowdsourcing User Studies. Besides, I learned a lesson from this paper that a good research comes from good idea. The most important idea in this paper is to add some questions about the content of the pages of the wiki pages in order to figure out how familiar the user is to the questionare and thus figure out who is just “playing” with the study. The problem of finding those who is ranking randomly is hard for the computers to solve but they are easy to accomplish by people—the participants. This is some point of view from Computer Science Collaborative Work. In crowd computing, there exists some unsolved challenges. Specifically in user study on crowdsourcing, we can not control the environment the participants are in. And the users are from all over the world and thus hard to control specific participants in order to generate some conclusion in specific circumstances.

changsheng liu 2:29:04 12/2/2014

<Crowdsourcing user studies with Mechanical Turk> User studies are important for many aspects of the design process and involve techniques ranging from informal surveys to rigorous laboratory studies. However, the costs involved in engaging users often requires practitioners to trade off between sample size, time requirements, and monetary costs. Micro-task markets, such as Amazon's Mechanical Turk, offer a potential paradigm for engaging a large number of users for low time and monetary costs. The paper mainly investigates the utility of a micro-task market for collecting user measurements, and discusses design considerations for developing remote micro user evaluation tasks. Although micro-task markets have great potential for rapidly collecting user measurements at low costs, it shows that special care is needed in formulating tasks in order to harness the capabilities of the approach. <Design games with purposes> The set of guidelines the paper mentioned for building GWAPs represents the first general method for seamlessly integrating computation and gameplay, though much work remains to be done. Indeed, researchers will improve on the methods and metrics described in the paper. The GWAP approach represents a promising opportunity for everyone to contribute to the progress of AI. By leveraging the human time spent playing games online, GWAP game developers are able to capture large sets of training data that express uniquely human perceptual capabilities. This data can contribute to the goal of developing computer programs and automated systems with advanced perceptual or intelligence skills.

Christopher Thomas 5:47:53 12/2/2014

2-3 Sentence Summary of Designing Games With A Purpose: The authors discuss GWAP (games with a purpose) and their utility. They provide a paradigm for designing and thinking about GWAP's goal-directed tasks. Finally, the authors state a number of conclusions about GWAP. I liked this article. It was upbeat and easy to read. Even though it lacked any sort of evaluation section, I still thought it made its points well and they were well motivated. One of the fundamental things that we have been discussing all semester in this class is how can we bring humans into the loop to solve problems that could otherwise not be solved. We can see that paradigm clearly expressed in games with a purpose. Computers are unable to solve certain problems requiring real intelligence - such as comprehension of texts, understanding, etc. Instead of focusing on improving A.I. routines, HCI tries to solve the problem in a different way - by using humans. While this might seem ridiculous at first, it is actually a great way to get human-annotated data. Games with a purpose are games that solve real problems through gameplay. For instance, by a user tagging certain things in an image for points in the game, the computer can use those tags in the image for something else later (such as machine learning annotations). The authors discuss the fundamentals of GWAP - which is that the work needs to be enjoyable for the user. After all, if the game is boring, it's unlikely that it will be played very much and thus, not a lot of work will be done. Thus, one of the key observations made by the authors was that a good GWAP will have a high throughput (a lot of useful data is being collected) and that it's actually enjoyable for the user. One of the most interesting parts of the paper was a game structure analysis that the authors provided of common techniques used in GWAP. Because the data is initially unlabeled, the system must manually learn whatever it is learning from the users. Unfortunately, users' inputs are not always accurate or correct and abusive users and mistakes need to be filtered out to avoid bad tags. Since the goal of such crowd-sourcing techniques is to often obtain "gold-standard" human annotated data from a distributed setting, checking the agreement between players is critical. This enables the system to mitigate the effects of noisy human players. Thus, to accomplish this goal, the authors establish several paradigms of gameplay. The first is output-agreement, where players must agree on the same output when given the same input. Another type of game is the "inversion problem" where one player produces an output and another player guesses the original input to the first player. Finally, the authors discuss input-agreement games, where the players must determine whether or not they have been given the same input. The authors provide a fairly thorough analysis of each type of game, describing what task each type of game is appropriate for, rules, etc. The authors call these ideas "game templates" because they describe the fundamental structure of the game, divorced from whatever task it is trying to accomplish. I thought this was a nice, high-level abstraction that could be quite useful if someone needed to design a GWAP. The authors also discussed techniques that can be used to mitigate errors (such as through repetition and agreement). Finally, they presented a number of metrics that could be used to analyze GWAP, such as expected contribution, ALP, throughput, etc. All in all, I felt that the article was very informative and a light read. It discussed many of the challenges in GWAP but also illustrated their promise. I think I was most surprised by the authors’ analysis of how GWAP could be used to obtain data of arbitrary accuracy. I didn't realize that GWAP could provide such clean data. 2-3 Sentence Summary of Crowdsourcing User Studies With Mechanical Turk: The authors discuss Amazon’s Mechanical Turk service, which enables anyone to post tasks which users from around the globe complete either for reputation or money. The authors explain that Mechanical Turk can be used for user studies (such as acquiring human annotated data), but observed some peculiarities for subjective tasks. Finally, the authors present their conclusions and suggestions for designing user studies using the Mechanical Turk service. Mechanical Turk provides a quick way to get human input on essentially a limitless number of tasks. MT has been used in many domains, ranging from computer vision, NLP, etc. In the past, I worked on a project that used MT for NLP. In that project, users were asked to rate words on a certain scale based on the word’s affect (emotion). However, in that project, we found that the inter-annotator agreement was very poor. In other words, many people’s opinions of the words were vastly different. However, when we looked at the past analysis of inter-annotator agreement that had been done by students at the University, we found that there were significant differences. In other words, the people on MT disagreed much more about the words’ affect than they did at the University. This experiment revealed a fundamental problem which was also revealed in the paper. The people on MT are often paid a very small amount of money per task they complete. For subjective things, there is no right or wrong answer. Knowing this, the users just click any button as fast as they can just to get to the next page (because they get paid for each word they complete). Thus, it is in their best interest to complete as many as possible because they then earn the most money. The authors of this paper actually came to the same conclusions in their experiment with Wikipedia. They tried to compare the ratings from MT of Wikipedia articles with those given by experts on Wiki. They showed a very poor correlation between the expert ratings and the MT ratings. They also suspected (as we did) that users were “gaming” the system. In other words, they were just clicking through as many pages as possible to get as much money as fast as possible, without regard to quality. For example, they even saw that one user took the same phrase (more pictures to break up the text) and used it on every article that he reviewed. In their second experiment, they added a “test” with verifiable facts that the MTs had to answer. This enabled the system to automatically detect malicious users because they would get the questions about the article wrong. This experiment enabled the authors to provide us with some useful conclusions for conducting studies using these types of techniques. First of all, having explicitly verifiable questions provides a fast way to detecting malicious users. It also tells users that their responses will be checked and that they aren’t likely to get away with putting down anything and not doing their work. Thus, it increases the amount of time the users spend on each task, thereby increasing the quality and thus, increasing the inter-annotator agreement score and the correlation to the expert ratings. The take-away message of this paper for me is that we cannot always trust humans to be honest. We must build safeguards into our experiments to check for malicious users. Also, having an inter-annotator correlation metric is extremely useful because it shows how consistent the users are (how much agreement there is among the users). If there is low agreement, either the users are gaming the system or the task is not well designed (or perhaps it is just too subjective). Thus, this paper was extremely useful for those interested in crowdsourcing user studies or depending on unsupervised human data.

yeq1 6:29:25 12/2/2014

Yechen Qiao Review for 12/2/2014 Designing Games with a Purpose In this article, the authors looked at the existing interactive approaches that can leverage human’s need for entertainment to solve real world problems, and defined a new class of algorithm/game which they call GWAP. They elaborated further on GWAP by 1) categorizing existing GWAPs into three classes (input agreement, output-agreement, and inversion), 2) describing design principles for GWAPs (increase enjoyment: timed response, score keeping, skill levels, scoreboard, and randomness, increase output accuracy: random matching, player testing, repetition, and others), 3) proposing new evaluation metrics (defining speed by human hours, and enjoy ability by average lifetime play, and expected contribution. This article is interesting in that the theory behind is seems to be similar to that of IP complexity class, where convergence of agreement is used interpreted as the answer due to super-polynomial decrement on probability of error after each round/play. While we have no need to use humans to solve classic IP complexity class problems (because we can easily automate both prover (guesser) and the oracle), it seems that the class of problems that can be solved in GWAP within poly throughput can also be a class of its own. There are many small details are still missing in this article that I have to figure out in order to utilize this approach. Some problems that came to mind are: How many plays/rounds are good enough? How to create variations to utilize old players? What is the limitation of GWAP? Also, the evaluation metric proposed seems more like a golden standard instead of something we can use, as things such as “ALP” can only be speculated unless we have some kind of cutoff. (And if we do, evaluations across games becomes problematic, as it may not be possible to simply provide cutoffs on intervals due to other contributing factors). Overall this article is great at opening up new directions where researchers can spend their time. Crowdsourcing user studies with Mechanical Turk In this article, the authors had described experiences in using Amazon Mechanical Turk to solve change validation problem in Wikipedia. Initially, a small group of users had generated enough noise so the responses no longer correlates with the actual answers. However, some simple techniques such as providing keywords and use quantified screening questions can help picking out these participants. The authors therefore provided guidelines for checking suspicious responses. Furthermore, they listed advantages and disadvantages of using the system. The article claims that while this is good for rapid prototyping that uses evaluations in combination of subjective and objective gatherings, user studies are still necessary in making formal evaluations as validity cannot be guaranteed. The paper mainly focused on discussions in how to design tasks for good workers and make it difficult for bad workers. This problem in collecting human subject research results is not necessarily unique. A month ago I went to an IRB event where the guest speaker gave suggestions to the board and researchers in Pitt about common guidelines for social media recruiting strategies. The talk was excellent, and the guest speaker had noted, among many other points, that it is a common problem that small number of users online will attempt to cheat researchers. She and one of her collaborator had this issue a year ago where thousands of dollars were wasted on data that was later validated to be useless. They uncovered a research dollar farm originated in China, which hires people to complete tasks as fast as possible. This cheat is very sophisticated and can bypass many screening approaches. However, they discovered that those people generally have some characteristics that are difficult for them to hide. The study was done over Facebook recruiting and they discovered that payment info is the biggest hint on whether this is a research farm or not. It is also interesting to see that while IRB generally sides on research ethics, they have some leeway when faced with an equally unethical practice by the participant. Anonymity and confidentiality can sometimes be sacrificed ethically when the researcher is facing the choice of publishing invalid data and providing the means for validating the data. Some suggestions by the speaker is very interesting and I would like to see them implemented in recruiting programs in social network. In general, this new types of study has a lot of potential, has gained traction in both HCI and social science communities. However, from what I’ve gathered in the talk, both researchers and IRB are currently still trying to figure out guidelines in how to use them correctly.

Jose Michael Joseph 7:15:56 12/2/2014

Crowd sourcing user studies with mechanical turk This paper talks about using micro-task markets to collect user information and discuss design considerations for developing micro user evaluation tasks that are remote in nature. The main problem such an approach addresses is that it does not force researches to make a tradeoff between sample size, time requirement and monetary constraints. A micro-task market is a system in which small tasks are entered into a common system in which users can select and complete them for rewards. Thus this incentivizes uses to participate in such studies whereas also ensuring that the researcher has a large number of subjects to test out some functionality of their product. The authors then discuss Amazon’s Mechanical Turk and its three characteristics: task requires little time and effort, adapting original problem to such a task can be challenging, best suited for applications that have a definite answer and the population diversity of this application is unknown. Thus to further understand these various characteristics the authors performed two tests on this system. The first experiment conducted was that of users assessing a Wikipedia page and trying to analyze if the changes made were correct. Their response is compared to that of admins. It was found out that many users tried to game the system by providing junk answers to maximize the pay. The co-relation between user’s inputs and admin inputs were only marginal. The second experiment was similar to the first except that it is designed in such a way that it takes as long to make a fake believable answer as it would to actually answer the question. The results show that the number of ratings per user was significantly smaller and the co-relation between users and admins higher. This shows that the approach followed by the authors had a beneficial impact. The biggest challenge in utilizing such a system would be to find a way to effectively adapt a researcher’s current system into a way that can enable users to fulfill it as a series of small tasks. Since many of the current problems could have strong co-relations it will be hard to separate them into small fragments. Addressing that challenge is critical and is something the authors have not explicitly done.

Jose Michael Joseph 7:16:36 12/2/2014

Designing games with a purpose This paper discusses the methods that can be used so as to allow games to be designed in such a way that it subtly makes humans perform computations as part of its game. This computation is then used to with AI to improve the computers performance at some complex task. These computations that a human performs effortlessly is often quite difficult for a computer to reproduce. Thus by incentivizing humans to do them helps the system to collect data to train itself. GWAP approach is based on three factors: an increasing number of people have access to the internet; what is easy for a human to perform is sometimes quite difficult for a computer to replicate; people spent a lot of time playing video games. The different types of games that can be modeled into GWAP are: output agreement games, input agreement games and inversion problem games. Output agreement games requires the various non-communicating players to reach a common output stage from the same input stage. Inversion problems have one person has the describer and the other as the guesser. Input agreement games have an input that is known by the system and the objective of the players is to find out whether they had the same or different inputs. The authors then describe the various methods to increase player enjoyment and output accuracy. The biggest problem with this method is that as we have seen with other papers in HCI, people often find it difficult to be consistent about naming things. Thus the output obtained by such a method would be very subjective and could also be under the influence of cultural influence. Finding a way to remove both of these challenges would be a difficult task in itself.

Xiyao Yin 8:32:25 12/2/2014

‘Designing Game With A Purpose’ provides a new idea in computational approaches, that is to use the constructive channeling of human brainpower through computer games, not the improved artificial intelligence algorithms. Many games have shown their effect in this area. The ESP Game provides labels which can be used to improve Web-based image search, which typically involves noisy information. Other GWAPs include Peekaboom, Phetch and Verbosity are also useful. In this paper, authors articulate three GWAP game ‘templates’ representing three general classes of games containing all the GWAPs we’ve created to date, including Output-agreement games, Inversion-problem games, and Input-agreement games and use aspects to differentiate them in Initial setup, Rules and Winning condition. This paper provides many good ideas in collecting data: People play not because they are personally interested in solving an instance of a computational problem but because they wish to be entertained. It is essential that the number of tasks for players to complete within a given time period is calibrated to introduce challenge and that the time limit and time remaining are displayed throughout the game. The real measure of utility for a GWAP is therefore a combination of throughput and enjoyability. These all show advantages and effect of GWAP. Future study is a kind of developing new templates for tasks. In ‘Crowdsourcing User Studies With Mechanical Turk’, authors investigate the utility of a micro-task market for collecting user measurements, and discuss design considerations for developing remote micro user evaluation tasks. They found that special care is needed in formulating tasks in order to harness the capabilities of the approach. It is often not possible to acquire user input that is both low-cost and timely enough to impact development, so practitioners need new ways to collect input form users on the Web. The micro-task market is a system in which small tasks entered into a common system in which users can select and complete them for some reward which can be monetary or non-monetary and can offer the practitioner a way to quickly access a large user pool, collect data, and compensate users with micro-payments. Authors set two experiments to test the utility of Mechanical Turk as a user study platform. In my opinion, both are convincing but it is better to add more data and figures in the paper. Results show that micro-task markets may be useful for other types of user study tasks that combine objective and subjective information gathering. Although there are still a number of limitations in Mechanical Turk, we can look forward to the future study.

Vivek Punjabi 9:41:37 12/2/2014

Designing Games with a Purpose: The authors in this article describe the method of building games such that the data generated during the game play is used to solve simple computational problems and train AI algorithms. These problems are either difficult or impossible to solve using computers but very trivial for humans. The users wont't realize if they are doing so and get entertained at the same time. The author has termed it as GWAP, games with a purpose. The author uses certain examples such as ESP, which is a Google Image Labeler. Some of the basic guidelines or templates mentioned to create GWAPs are output and input agreement games, inversion problem games, increase player enjoyment and desire, and the output accuracy. So the game's throughput as well as enjoyability can be used as real measure of utility of GWAPs. The author also provides some metrics to evaluate and compare a GWAP. These are throughput, ALP and expected contribution. Thus, the idea of using data generated while playing addictive games to solve computational problems is much exploratory and interesting. Moving this idea to complex games like RPGs and multiplayer games would be a great challenge for the researchers. Crowdsourcing user studies with Mechanical Turk: In this paper, the authors study and investigate a method of collecting user data in the form of micro-tasks. They examine the utility of the micro task market using the example of Amazon's Mechanical Turk system, which can be used by anyone to post tasks for user spread worldwide and provide monetary or non-monetary rewards in exchange. They conducted two experiments where they asked the users of this Turk system to rate Wikipedia articles and compared their ratings to the expert Wikipedia administrators. They found that by just changing the rating task in certain aspects like making questions quantitative and verifiable can help to gather more useful data and less suspect edits. they also provide the advantages and limitations of this Turk system based on their user study. This research seems too small compared to other studies. The authors could have evaluated at least two more similar systems and compared the results which could give a better insight and limitations of such systems. One of the ways to create experts and genuine user inputs is to improve the reputation system and separation of tasks based on the reputations levels.

Mengsi Lou 9:57:15 12/2/2014

Designing Games With a Purpose This article discusses a different focusing to program and design a game that is the constructive channeling of human brainpower through computer games. The author presents general design principles for the development and evaluation of a class of games, which is games with a purpose, also called GWAPs. That evaluates the side effects of playing, perform tasks computers are unable to perform. The history. In ESP, people provide meaningful, accurate labels for images on the web as a side effect of playing the game. These labels can be used to improve Web-based image search, which typically involves noisy information. There are some problems for harnessing human processing skills through computer games. First, the Networked individuals accomplishing work. Second, Open Mind Initiative. Third, interactive machine learning. Fourth, making work fun. The GWAP approach is characterized by three motivating factors: an increasing proportion of the world’s population has access to the Internet; certain tasks are impossible for computers but easy for humans; and people spend lots of time playing games on computers. /////////////////////////////////////// Crowdsourcing User Studies With Mechanical Turk This paper investigates the utility of a micro-task market for collecting user measurements, and discuss design considerations for developing remote micro user evaluation tasks. User studies are vital to the success of virtually any design endeavor. An important factor in planning user evaluation is the economics of collecting user input. Collecting input from only a small set of participants is problematic in many design situations. So the author implements a different paradigm for collecting user input: the micro-task market. Micro-task markets offer a potential paradigm for engaging a large number of users for low time and monetary costs. In the Micro-task markets, anyone can post tasks to be completed and specify prices paid for completing them. The system aims to have human users complete simple tasks. Tasks typically require little time and effort, and users are paid a very small amount upon completion. Mechanical Turk is best suited for tasks in which there is a bona fide answer, as otherwise users would be able to “game” the system and provide nonsense answers in order to decrease their time spent and thus increase their rate of pay. the diversity and unknown nature of the Mechanical Turk user base is both a benefit and a drawback. Although micro-task markets have great potential for rapidly collecting user measurements at low costs, the author found that special care is needed in formulating tasks in order to harness the capabilities of the approach.