Category Archives: machine learning and artificial intelligence

Crowd 'speaking' with labels "Level Up!, 5 stars, A+"

Leveling up crowd-sourced educational content — with a little help from machine learning and lessons from HCOMP19.

More lessons from HCOMP 2019

Crowd-sourcing content: Although publishers (I work for one) create high quality content that is written, curated, and reviewed by subject matter experts (SMEs, pronounced ‘smeez’ in the industry), there are all sorts of reasons that we always need more content. In the area of assessments, learners need many, many practice items to continually test and refine their understanding and skills. Also, faculty and practice tools need a supply of new items to ‘test’ what students know and can do at a certain point in time. Students need really good examples that are clearly explained. (Khan Academy is a great example of a massive collection of clearly explained examples). When students make attempts to solve homework problems, they also need feedback about what NOT to do and why those aren’t the correct approaches. Therefore, we need to know what the core concepts needed for each activity or practice are. 

Faculty and students are already creating content! Because faculty and students are already producing lots of content themselves as part of their normal workflow, and faculty are assembling and relating content and learning activities, it would be great to figure out how to leverage the best of that content and relationship labeling for learning. This paper by Bates, et. al, looked at student generated questions and solutions in the Peer Wise platform (https://journals.aps.org/prper/pdf/10.1103/PhysRevSTPER.10.020105) and found that with proper training, students generate high quality questions and explanations. 

So at HCOMP 2019 I was listening for ways to crowdsource examples and ground truth. In particular, it would be useful to see if machine learning algorithms could find faculty and student work that is highly effective at helping students improve their understanding. The two papers below address both finding content exemplars and training people to get better at producing effective content. 

Some highly-rated examples are better than others. Doroudi et. al wanted to see what effect showing highly rated peer-generated examples to crowd workers would have on the quality of work they submitted. In this study, the workers were writing informative comparison reviews of products to help a consumer decide which product better fits their needs. The researchers started with a small curated ground truth of high quality reviews. Workers that viewed highly rated reviews before writing their own ended up producing better reviews (more highly rated). That isn’t surprising, but interestingly, even among equally highly rated reviews, some reviews were much more effective in helping improve work! They used machine learning to determine the most effective training examples. So that suggests that, while viewing any highly rated examples will improve new contributors’ content,  we can then improve the training process even more by selecting and showing the examples with the best track record of improving workers’ content creations.  

Using training and leveling up:  Madge et. al introduced a crowd-worker training ‘game’ with a concept of game-levels. They showed the method was more effective than standard practice at producing highly effective crowd workers. Furthermore, they showed how machine learning algorithms could determine which tasks belonged in each game level by observing experimentally how difficult tasks were. 

Crowd workers are often used to generate descriptive labels for large datasets. Examples include tagging content in images (‘dog’), identifying topics in twitter feeds (‘restaurant review’), and labeling the difficulty of a particular homework problem (‘easy’). In this particular study, workers were identifying noun phrases in sentences. The typical method of finding good crowd workers is to start out by giving a new worker tasks that have a known “right answer” and then picking workers that are best at those tasks to do the new tasks you actually want completed. The available tasks are then distributed to workers randomly, meaning a worker might get an easy or difficult task at any time. These researchers showed that you could train new workers using a ‘game’, so that they improve over time and are able to do more and more difficult tasks (harder and harder levels of the game), and the overall quality of labeling for the group of workers is better.  

Better education content: Faculty and students could become more effective producers of education content with the help of these two techniques. Motivating, training and selecting contributors via comparison with highly rated examples and leveling up to ‘harder’ or more complex content would be useful to help contributors to create high quality learning content (example solutions, labeling topics and difficulty, giving feedback). These techniques also sound really promising for training students to generate explanations for their peers, and potentially to train them to give more effective peer feedback. 

Getting to the truth, the ground truth, and nothing but the ground truth.

Takeaways for learning from HCOMP 2019, Part 2

At HCOMP 2019, there was a lot of information about machine learning that I found relevant to building educational technology. Surprisingly to me, I didn’t find other ed-tech companies and organizations at the Fairness, Accountability, and Transparency conference I went to last year in Atlanta or the 2019 HCOMP conference. Maybe ed-tech organizations don’t have research groups that are publishing openly and thus they don’t come to these academic conferences. Maybe readers of this blog will send me pointers to who I missed! 

Mini machine learning terminology primer from a novice (skippable if you already know these): To train a machine learning algorithm that is going to decide something or categorize something, you need to start out with a set of things for which you already know the correct decisions or categories. Those are the ‘ground-truths’ that you use to train the algorithm. You can think of the algorithm as a toddler. If you want the algorithm to recognize and distinguish dogs from cats, you need to show it a bunch of dogs and cats and tell it what they are. Mom and Dad say —  “look, a kitty”; “see the puppy?” An algorithm can be ‘over-fitted’ to the ground truth you give it. The toddler example is when your toddler knows the animals you showed them (that Fifi is a cat and Fido is a dog), but doesn’t know what new animals are, for example the neighbor’s pet cat. To add a further wrinkle, if you are creating a ground-truth, it is always great if you have Mom and Dad to create the labels, but sometimes all you can get are toddlers (novices) labeling. Using novices to train is related to the idea of wisdom of the crowd, where the opinion of a collection of people is used rather than a single expert.  You can also introduce bias into your algorithm by showing it only calico cats in the training phase, causing it to only label calicos as “cats” later on. Recent real world examples of training bias come from facial recognition algorithms that were trained on light-skinned people and therefore have trouble recognizing black and brown faces. 

Creating ground truth: A whole chunk of the talks were about different ways of creating ‘ground truths’ using ‘wisdom of the crowd’ techniques. Ed-tech needs quite a bit of ground-truth about the world to train algorithms to help students learn effectively. “How difficult is this task or problem?” “What concepts are needed to do this task/problem?” “What concepts are a part of this text/example/explanation/video?” “Is this solution to this task/problem correct, partially correct, displaying a misconception, or just plain wrong?” 

Finding the best-of-the-crowd: Several of the presentations were about finding and motivating the best of the crowd. If you can find and/or train ‘experts’ in the crowd, you can get to the ground-truth at lower cost (in time or money). I am hoping that ed-tech can use these techniques to crowdsource effective practice exercises, examples, solutions, and explanations. 

  1. Wisdom of the toddlers. Heinecke, et. al (https://aaai.org/ojs/index.php/HCOMP/article/view/5279) described a three step method for obtaining a ground truth from non-experts. First, they used a large number of people and expensive mathematical methods to obtain a small ground truth. (If we are sticking with the cats and dogs example from the primer above, you have a large number of toddlers tell you whether a few animals are cats and dogs and use math to decide which animals ARE cats and ARE dogs using wisdom of the toddlers.) From there, step 2 is to find a small set of those large numbers of people who were the best at determining a ground-truth, and use them to create more ground-truth. (Find a group of toddlers who together labeled the cats and dogs correctly, and use them to label a whole bunch more cats and dogs). Finally, you use the large set of ground truth to train a machine learning algorithm. I think this is very exciting for learning content because we have students and faculty doing their day to day work and we might be able to find sets of them that can help answer the questions above.
  2. Misconceptions of the herd: One complicating factor in educational technology ground truths is the prominent presence of misconceptions. The Best Paper winner at the conference, Simoiu et. al (https://aaai.org/ojs/index.php/HCOMP/article/view/5271), found an interesting, relevant, and in hindsight unsurprising result. This group did a systematic study of crowds answering 1000 questions from 50 different topical domains. They found that averaging the crowd’s answers almost always yields significantly better results than the average (50th percentile) person. They also wanted to see the effects of social influences on the crowd. When they showed the ‘consensus’ answer (current three most popular answers) to individuals, the crowd would be swayed by early wrong answers and thus did NOT perform on average better than the average unswayed person. Since misconceptions (wrong answers due to faulty understanding) are well known phenomena in learning, and are particularly resistant to change (if you haven’t seen Derek Muller’s wonderful 6 minute TED talk about this, go see it now!) we need to be particularly careful not to aid their contagion when introducing social features.

Are misconceptions like overfitting in machine learning? As an aside, my friend and colleague Sidney Burrus told an interesting story that sheds light on the persistence of misconceptions. Sidney talked about how, during the initial transition point between an earth-centered and sun-centered model of the solar system, the earth-centered model was much better at predicting orbits, because people had spent a lot of time adding detail to the model to help it correctly predict known phenomena. The first sun-centered models, however, used circular orbits and did a poor job of prediction, even though they had more ‘truth’ in them ultimately. Those early earth-centered models were tightly ‘fitted’ to the known orbits. They would not have been good at predicting new orbits, just like an overfitted machine learning model will fail on new data. 

HCOMP 2019 Part 1 – Motivation isn’t all about credit.

HCOMP 2019 Humans and machines as teams – takeaways for learning

    The HCOMP (Human Computation) 2019 conference was about humans and machines working as teams and, in particular, combining ‘crowd workers’ (like those on Mechanical Turk and Figure Eight) and machine learning effectively to solve problems. I came to the conference to ‘map the field’ to learn about what people are researching and exploring in this area and to find relevant tools for building effective educational technology (ed-tech). I had an idea that this conference could be useful because ed-tech often combines the efforts of large numbers of educators and learners with machine learning recommendations and assistance. I wasn’t disappointed. The next few posts contain a few of the things that I took away from the conference. 

    Pay/Credit vs. Quality/Learning. Finding the sweet spot. Ed-tech innovators and crowd work researchers have a similar optimization problem: finding the sweet spot between fairness and accuracy. For crowd workers, the tension comes from a need to pay fairly for time worked, without inadvertently incentivizing lower quality work. The sweet spot is fair pay for repeatably high quality work. We have an almost identical optimization problem with student learning, if you consider student “pay” to be credit for work, and student “quality” to be learning outcomes. The good news is that while the two are often in tension with each other, those sweet spots can be found. Two groups in particular found interesting results in this area.

    1. Quality without rejection: One group investigating repeatability of crowd work (Qarout et. al) found that there was a difference in quality (about 10%) between work produced for Figure Eight and Amazon Turk (AT). Amazon Turk allows requesters to reject work they deem low-quality and Figure Eight doesn’t and the AT workers completed tasks at about 10% higher quality. However, the AT workers also reported higher stress. Students also report high levels of stress over graded work and fear making mistakes, both of which can result in detriments to learning, but we have found that students on average put in less effort when work is graded for completion rather than correctness. Qarout et. al tried a simple equalizer. At the beginning of the job, on both platforms, they explicitly said that no work would be rejected, but that quality work would be bonused. This adjustment brought both platforms up to the original AT quality, and these modified AT tasks were chosen faster than the original ones because the work was more appealing once rejection was off the table. It makes me think we should be spending a lot of research time on how to optimize incentives for students expending productive effort without overly relying on credit for correctness. If we can find an optimal incentive, we have a chance to both increase learning and decrease stress at the same time. Now that is a sweet spot. 

    2. Paying fairly using the wisdom of the crowd: A second exploration that has implications for learning is FairWork (Whiting, et. al). This group at Stanford created a way for those wishing to pay $15/hour to Amazon Turk workers to algorithmically make sure that people are paid an average of $15/hour. Figuring out how long a task takes on AT is hard, similar to figuring out how long a homework takes, so what the Stanford group did was ask workers to report how long their task took and then throw out outliers and average that time. They then used Amazon’s bonusing mechanism to auto-bonus work up to $15/hour. The researchers used some integrated tools to time a sample of workers (with permission) to see if the self-reported averages were accurate and found that they were. They plan to continue to research how well this works over time. For student work, we want to know whether students are spending enough effort to learn and we want them to get fair credit for their work. So it makes sense to try having students self-report their study time, and using some form of bonusing for correctness to balance incentivizing effort without penalizing the normal failure that is part of trying and learning.