How To Structure and Run A Data Science Competition
Data science competitions are an effective and efficient way to crowdsource innovative artificial intelligence (AI) solutions to pressing problems in education. However, successful competitions require planning and preparation. This report outlines best practices for competition management and key decision points on how to create and launch smoothly running competitions that will result in innovative, high-quality solutions. These elements were identified and outlined in consultation with experts from the field.
Competition organizers should carefully plan and execute their challenges to create a smooth process that will yield quality results. Specifically, competition organizers should:
- Select or build a robust, novel, diverse dataset to enhance competition outcomes. For instance, when aiming to develop an AI essay detection algorithm, the dataset should contain both AI and human writing, including demographic diversity to assess potential bias. Additionally, organizers should differentiate the dataset by its size or by eliminating spelling errors to prevent easy human identification. By investing in dataset curation, organizers can help ensure the competition’s results will be generalizable, accurate, and innovative.
- Design the competition and its metrics to reflect the project’s goals. While there are many formats and adaptations for competitions, organizers must find the one that will work best for their dataset. For example, when creating a free public resource using a large, text-based dataset, an open competition on a popular data science competition platform may be the best choice. However, when utilizing a small unusual dataset – such as one based on audio or visual clips – recruiting specialists to a private competition might spur more dedicated and targeted results.
- Create a collaborative environment with clear, consistent communication and guidelines for competitors and other stakeholders. For example, detailed rules and metrics provide potential participants the ability to understand the tasks, timeline, and parameters for the competition. This allows them to jump into task-related problem-solving and increase participants’ engagement in the competition. Furthermore, competitions that use communication tactics like discussion boards can also prevent competitor burnout by providing a place where people can collaborate on tough problems and alleviate their frustration when they encounter issues or roadblocks. Overall, strong communication can help participants persist through difficult tasks or when they are not on the leaderboard.
- Prepare for cost considerations of data and competitions. For example, the size of datasets and the complexity of a task can greatly impact the overall costs. Awareness of the general costs prevents mistakes in planning and preparation before projects even begin.
What Is A Data Science Competition?
Technology is advancing so quickly that systems are hard-pressed to keep up or adapt. Half of experts said that human-level AI will exist before 2061, and 90 percent said it would happen within the next 100 years. The shifts are both exciting and daunting. The latest advancements in artificial intelligence have revolutionized industries like big tech and health care, but others, like education, have been slower to adapt. In this evolving landscape, aligning educational strategies with AI advancements is becoming even more paramount to improve education and learning. Data science competitions can act as a catalyst for bringing AI tools into the classroom. This report outlines how to maximize the value of data science competitions for education.
Data science competitions can crowdsource innovative artificial intelligence solutions to pressing educational needs. However, to achieve this, competitions require foresight, planning, and preparation.
Educational data science competitions have the potential to create a lasting impact in the education field and can pioneer innovation in student- and teacher-facing tools. They can attract critical talent to education, establish new benchmarks for developing and evaluating machine learning models, and help democratize access to AI algorithms. They promote transparency in AI by creating an open-source community for sharing innovations.
Educational data science competitions have the potential to create a lasting impact in the education field and can pioneer innovation in student- and teacher-facing tools. They can attract critical talent to education, establish new benchmarks for developing and evaluating machine learning models, and help democratize access to AI algorithms.
Take, for example, the Automated State Assessment Prize (ASAP). Launched with the support of the Hewlett Foundation 14 years ago, the ASAP dataset helped launch the practice of automated essay scoring, and even today, the researchers and technologists actively discuss on message boards how they could better parse the data.
Last year, The Learning Agency Lab along with Georgia State University and Vanderbilt University launched five competitions that engaged over 8,000 data scientists, with the winning algorithms matching human-level performance. These algorithms ranged in tasks from evaluating student-written summaries to predicting student performance in game-based learning environments. These public, open-source solutions can then be adopted by popular edtech platforms, showing how data science competitions are not just interesting research projects or thought experiments – they can have a real, lasting impact on students and teachers.
Best Practices For Running A Data Science Competition
The various best practices of competition management can be categorized into three sections: choosing a dataset, designing the competition, and competition communications.
Section 1: Select or build a robust, novel, diverse dataset to enhance competition outcomes.
Datasets are the backbone of any data science competition, and the resulting algorithms are only as good as the datasets they are trained on. There are many variables to consider when identifying whether a dataset is fit for a data science competition. To begin, organizers should pay particular attention to their dataset’s novelty or uniqueness, its size, demographic representation, and the potential impact of the data. Importantly, organizers must assess whether the data can be used to complete the task that the competition is built around.
Secondly, organizers should anticipate that participants who can quickly and easily understand the data will find the competition more enjoyable and productive. Additional considerations such as the sharing terms of the data, the train/test set split, and methodology explanations are important for framing a competition and for aligning the data to suit the desired outcome.
We identified eight important elements to consider when designing a competition dataset:
- Uniqueness of Data Collection
- Size
- Demographic Information
- Potential Impact
- Viable Competition Task
- Ethical Data Collection
- Building a Test Set
- Data Sharing Terms
Uniqueness of Data Collection
The dataset should contain information on a novel subject matter or have a level of detail that has not been previously collected. Novel and granular datasets enable contestants to develop innovative models that could not be created otherwise.
Size
Datasets used in competitions to produce machine-learning models must be large enough to train, build, and evaluate highly accurate models. Both the “length” and “width” of the dataset should be considered (number of samples and number of variables). Exact size requirements, however, depend on the novelty of the data collection, the complexity of the machine learning task, and the data’s suitability for simple or more complex machine learning algorithms.
Datasets used in competitions to produce machine-learning models must be large enough to train, build, and evaluate highly accurate models.
If similar public datasets already exist, the prospective dataset should be “longer” or “wider” — having a larger number of samples or variables. If similar datasets do not exist, size requirements are flexible, given the novelty of the data.
For example, a dataset on keystroke logging, which tracks and records a student’s key entries on a computer during assessment, would be considered a novel data collection because it is a new and emerging topic for researchers. In this case, a few hundred samples could suffice if there is a robust number of data points. On the other hand, datasets of test scores are common, so it would need a record number of samples or variables to be considered superior to existing, open datasets.
Additionally, datasets that require deep learning techniques or pose a more challenging task, like sequence prediction or text segmentation, will require tens or hundreds of thousands of samples.
Demographic Information
If the focus of the analysis in a dataset is a student, the data should include information on important student demographic characteristics, such as:
- Economic disadvantage (i.e., student eligibility for free school meals or government assistance programs)
- Race and/or ethnicity
- Sex and/or gender identity
- Second Language learner status
- Special needs or disability status
If there are multiple groups within the demographic category, it is highly recommended to have a sufficient representation of the minority or historically marginalized group(s). For example, if a U.S.-based dataset provides race information and includes data from White, Black/African American, Asian/Pacific Islander, and American Indian/Alaskan Native students, students who identify as members of those groups must be represented in the data on par with national populations or be overrepresented.
If there is no way to balance the data by recording more responses, the imbalance may be able to be addressed by splitting or weighting the test set to reward models that better account for demographic data. However, not getting the right split of training data can be frustrating for users because their models may under-perform on the test sets.
Potential Impact
Datasets should align with the needs of students, educators, parents, or other stakeholders and address persistent challenges in education. Overall, there should be a clear explanation of how the dataset can serve as a benchmark for the field, drawing the attention of researchers and technologists to encourage innovation in a specific domain.
Viable Competition Task
The dataset should lend itself to the development of machine learning or deep learning algorithms that can be used in a range of contexts. The competition’s prediction task should be clear. The challenge should be rigorous yet solvable to spur advancement in machine learning techniques and the development of state-of-the-art models.
The challenge should be rigorous yet solvable to spur advancement in machine learning techniques and the development of state-of-the-art models.
Ideally, the algorithms created from the dataset could be adopted by several educational platforms and help to improve these platforms’ functionality and effectiveness. The dataset should help produce transferable models that can make accurate predictions given new data.
Experts recommend that the task be easy to define and understand so that people can grasp the goals and concepts. With more complicated challenges, focusing on one task at a time and breaking them down into distinct objectives can help to plan and set up a competition.
Ethical Data Collection
If the competition concerns an existing dataset, organizers should describe how the dataset was collected and whether ethical considerations, such as privacy and bias, guided the collection process. If competition organizers plan on creating a new dataset, they should have a clear plan for data collection that addresses ethical considerations.
For both existing and developing datasets, competition organizers should document the data collection process, the intended uses of the dataset, and what ethical considerations should guide the future use of this dataset in research, practice, and model development. Additionally, organizers should describe what materials they intend to make available alongside the dataset to support the ethical use of the dataset.
Use A Test Set
In a machine learning competition, teams usually receive only a portion of the data (i.e., training set) to develop their solutions while the rest of the data is withheld (i.e., test set) to evaluate solutions at the end of the competition. Carrying out this unbiased evaluation is a central part of machine learning and improves the quality of the models and algorithms. This is crucial in competitions that offer prizes and awards. Otherwise, teams could use test data to gain an unfair advantage and undermine the quality of their solutions. When test data is used in training a model, the accuracy of the metric is artificially inflated since the model has seen the test data before. Thus, the dataset (or test set, at the very least) should not be released publicly until after the competition closes.
It is worthwhile to note that a well-prepared test set encourages more people to join the competition. Competitors in these contests can be experts in data science and their first step in any competition will be to review the evaluation and test set.
Data Sharing Terms
Issues of privacy and bias are key to building a robust dataset. Organizers need to ensure that proper procedures have been followed to remove any personally identifiable information. Similarly, the dataset should have appropriate data-sharing terms and establish strict licensing agreements for how the data can be used for personal, research, or commercial purposes. As an example, to run a competition on Kaggle, organizers must allow competitors to use the data but may also allow researchers to use the data for non-commercial purposes.
Section 2: The competition and metric design should reflect the project goals.
A successful data science competition requires prior planning and preparation of the competition’s setup and design. The planning process includes defining the overall parameters for the competition, like whether the competition will be public or private, metric design, and whether prizes will be offered. Other aspects such as the size of the field, the competition task, the clarity of the goal and its anticipated benefit, as well as the processes of submission, evaluation, and recognition, also need to be reviewed and detailed in a competition plan.
Key best practices include addressing the following:
- Hosting a public vs. private challenge
- Importance of getting people outside the field
- Submission and evaluation procedures
- Metric design
- Prizes and recognition
Hosting a public vs. private Challenge
When creating a competition, organizers need to decide if it will be public or private. Public competitions are open to anyone who wants to join and most commonly occur on popular data science competition platforms such as Kaggle or CodaLab. Private competitions are invite-only and can be run on any forum.
Both have advantages. In private competitions, organizers can more closely monitor submissions and results. There’s also less risk involved when using sensitive data, provided appropriate data security measures are in place. Additionally, special interest competitions offer a niche for experts, ensuring that the competition attracts individuals with a deep understanding of the subject matter. Small competitions can make working with other experts feel less intimidating and more accessible since there are less individuals.
Requests for private competitions generally originate with companies that want to offer their employees a motivating and fun challenge but maintain privacy over their proprietary data. Another reason to keep a competition private would be if the outcome algorithms are susceptible to bias. In these situations, it may be helpful to do a first pass with a smaller, private competition to gain topical knowledge that could help avoid problems when brought to a large, public scale.
In many ways, the question is quality vs quantity. In a public competition, the sheer volume of participants often results in a broader exploration of ideas and maximizes the collective hours spent on tackling the problem at hand whereas a private competition will prioritize high-skilled participants that can lead to the generation of top-tier, quality solutions.
Public competitions, on the other hand, have the benefits of size. Their open nature democratizes access to the data, making them inclusive and accessible to a broad spectrum of individuals curious about the subject matter. These competitions are both competitive and collaborative. Even though it seems counterintuitive, people who are competing will share with each other what works or doesn’t work because it’s enjoyable, builds their knowledge, and creates a feeling of camaraderie.
In many ways, the question is quality vs quantity. In a public competition, the sheer volume of participants often results in a broader exploration of ideas and maximizes the collective hours spent on tackling the problem at hand whereas a private competition will prioritize high-skilled participants that can lead to the generation of top-tier, quality solutions. Striking the right balance between quantity and quality participation ultimately depends on the specific goals and nature of the competition, with each approach offering its own set of advantages and considerations.
Importance of Getting People Outside the Field
Even in a private competition, it is paramount to involve individuals from diverse fields in data science. Bringing together participants with varied backgrounds and expertise ensures the consideration of different perspectives and catalyzes creative and innovative solutions. Each participant, from learning sciences experts to computational linguists, brings a unique set of experiences, methodologies, and problem-solving approaches to the competition that, ultimately, benefits everyone.
This diversity of thought not only enriches the collaborative process but also injects a multitude of ideas into the solution. It encourages participants to consider novel angles, alternative methodologies, and unconventional insights that might not be apparent to a homogenous group. The synthesis of these diverse viewpoints often results in a comprehensive and well-rounded solution, ultimately elevating the overall quality of the competition’s outcome. The cross-pollination of ideas and approaches from different fields not only enhances problem-solving but also reflects the real-world interdisciplinary nature of data science applications.
When trying to ensure diverse perspectives, the size of the field for the task must be considered by organizers. For instance, competitions have a higher turnout for Natural language processing tasks regardless of prize money simply because there is an increased interest in this topic. So even for niche topics in education, focusing on fields of current interest can ensure a robust turnout.
Submission and Evaluation Procedures
All effective submission and evaluation procedures begin with two basic questions: What is the key goal? What other possible goals could be beneficial? Defining the goal and providing strong answers to these questions can help guide the submission and evaluation procedures while minimizing participant frustration. Participants will not want to spend months in a competition without some assurance that they are going to get a solid result that will accurately address the competition’s goals. The only way to reassure competitors of the quality of their work is through clearly outlining submission and evaluation procedures such as cross-validation and benchmarking.
All effective submission and evaluation procedures begin with two basic questions: What is the key goal? What other possible goals could be beneficial?
For all competitions, the use of automated metrics for evaluation and scoring must be transparent and reproducible. Metric details should be readily available to participants so they understand how their solutions will be assessed. Posting a leaderboard will also facilitate transparency as teams will have clarity on their relative performance and may be more motivated to compete.
After the competition, winning solutions should be reviewed for integrity and compliance. This approach helps verify the plausibility of submissions, and participants should be made aware of this process before the competition launches.
Metric Design
To identify and communicate the competition’s precise outcome, organizers must craft a detailed and thorough metric. Identifying the exact objectives and translating them into a well-defined metric is invaluable, and will ensure that participants understand the criteria for success and can tailor their solutions accordingly. Organizers should use standard benchmark metrics (e.g., F1 for a classification task) so there’s a shorter learning curve interpreting metric scores and comparing model performance, however, novel tasks may require a custom metric with multiple evaluation metrics synthesized into one.
Another important consideration is to ensure that competition benchmarks can be implemented in the code. Having a benchmark describing the criteria is good, but when competitors can actually run the code and roughly achieve the benchmark, they come away with a much deeper understanding.
Additionally, no matter how well-curated and resourced the data is, if there is no good proxy or metric for quality, the competition will not make much sense to competitors. For instance, the chosen metric is only meaningful in so far as the numbers correlate perfectly with desirable behaviors from a model. There are a lot of scenarios, particularly reinforcement learning, where competitors may try to optimize for an objective and get a good metric score but still not achieve the desired outcome.
Some competitions may require multiple “tracks” for prizes to account for different approaches and multiple solutions, such as an overall track measuring the solutions’ accuracy and an efficiency track that balances a measure of accuracy with the time taken to accomplish the task. Other aspects can also be prioritized. For instance, if hosts would like to encourage creativity rather than just problem-solving, they may consider creating a track where participants can self-nominate a submission to be evaluated for technical novelty. (It is necessary to note that human evaluation requires a substantial time commitment from experts so the more tracks, the more time-consuming a competition is to grade.)
Incentivizing different tracks encourages diverse problem-solving approaches. For multiple tracks, balance is important. For example, competition organizers may want to give an efficiency track sufficient weight to encourage people to work on it, but not so much weight that participants will only want to work on this track. Additionally, requirements should be clear and thorough such that participants could simply submit a print statement and win by being the fastest.
Organizers should provide a comprehensive explanation of the metric once it’s been selected. This not only empowers competitors to better comprehend and fulfill the competition requirements, but it also helps organizers anticipate and address any questions that may arise during the competition. This fosters a more transparent and constructive environment for both participants and organizers.
Prizes and Recognition
Rewards and prizes boost the motivation and engagement of participants which directly increases the likelihood of high-quality solutions. These incentives can include a cash prize, academic publication opportunities, scholarship or fellowship opportunities, or certificates of completion.
Different incentives will appeal to different audiences. For instance, students may desire publication opportunities or academic credits while industry professionals gravitate toward cash prizes. Partnering with leading academics, university centers, or companies for name recognition will also attract more participation from their respective communities.
While the size of the prize pool highly correlates with engagement, this may not be because of the money itself but rather the increased attention and publicity that prizes can attract. For instance, prize money competitions are featured on the main page of Kaggle, so it is prominently displayed and seen by more potential participants. Additionally, these tend to be “headline” competitions with really interesting problem sets, which can create a prestige factor and add excitement to the community. Furthermore, by recognizing participants through the awarding of points and medals, and displaying a “leaderboard,” participants can feel a sense of pride in being involved in important work and research. This aspect of recognition should not be overlooked as an incentive.
While the size of the prize pool highly correlates with engagement, this may not be because of the money itself but rather the increased attention and publicity that prizes can attract.
Section 3: Create a collaborative environment with clear, consistent communication and guidelines for competitors and other stakeholders.
Effective communication plays a pivotal role in retaining skilled individuals who were recruited into data science competitions. Without strong organizer-led communication, participants may experience frustration, lose interest, or even disengage entirely from the competition. At a minimum, organizers should strive to build a community where competitors can connect, discuss, and find teammates.
All competitions should also have a competition communications plan. This plan is a strategic approach to address the different phases of the competition: pre-competition, during the competition, and post-competition. The following sections delve deeper into each phase, highlighting key communication strategies that can contribute to a successful and engaging data science competition experience.
Key best practices include addressing the following:
- Community Environment
- Communications Plan for Pre-, During, and Post-Competition
Community Environment
Even though data science challenges are inherently competitive, creating a sense of community and promoting knowledge sharing among participants is important. Community building and knowledge sharing promote innovation by connecting the education, research, and artificial intelligence communities through a shared experience they may not otherwise have. Strategies for structuring a community environment within a competition include:
- Public sharing of participants’ names and backgrounds
- Channel-based messaging platforms (e.g., Slack, Discord) for community discussions
- Flexibility with team membership, size, and collaborations
Communications Plan for Pre-, During, and Post-Competition
To make communication planning more approachable, it is best to break the process down into pre-, during, and post-competition. In the pre-competition phase, organizers should emphasize gauging participant interest. During the competition, the focus of your communications should shift to competition dynamics, minimizing misunderstandings, and keeping morale high. Finally, post-competition engagement centers on submission solutions and direct communication with winners. The details of each section of the communication plan are discussed further in the following paragraphs.
Pre-Competition
One aspect particularly helpful in private competitions is pre-registration. Pre-registration has two benefits: First, it helps organizers estimate participation from highly skilled participants while preventing the influx of low-quality submissions. Second, organizers can use the pre-registration period to get familiar with their participant pool by collecting information on their backgrounds and interests.
During Competition
Comprehensive documentation, accompanied by example code, stands as an essential practice to ensure participation. This not only facilitates a smooth initiation for competitors but also sets the stage for robust and transparent competition dynamics. Participants will have many questions about the dataset, and providing a data dictionary, codebook, or sample code for baseline models and exploratory analyses will minimize these inquiries. Involving the original dataset provider in these communications will also help ensure clarity and alignment. Requesting competitors submit runnable code post-submission adds a layer of verification, ensuring the plausibility of their outcomes.
It is crucial that competition organizers do not “poison the well” by communicating their ideas or assumptions about what a successful model might look like. When competitors are exposed to the organizer’s thinking, it reduces their creativity and leads participants to design models for that test set. In nearly all competitions, it is important to make models that can be generalizable so they can function on new data. This does not mean that organizers should not respond to competitor questions. Hosts should be responsive to the questions but refer competitors to published resources or encourage them to engage with other participants on the competition’s discussion boards.
It is crucial that competition organizers do not “poison the well” by communicating their ideas or assumptions about what a successful model might look like. When competitors are exposed to the organizer’s thinking, it reduces their creativity and leads participants to design models for that test set.
Formal discussion boards and informal communication channels significantly enhance competitions. These platforms serve as virtual spaces where participants can talk, share insights, and collaborate. Formal discussion boards, whether integrated into the competition platform or hosted separately, provide an organized avenue for participants to resolve issues promptly. These boards serve as a centralized hub for addressing queries, clarifying competition guidelines, and sharing important updates. They also foster a sense of community and inclusivity among participants.
Post-Competition
Post-competition communications are crucial. They can enhance the effectiveness of the competition by providing a platform for in-depth exploration of winning solutions while also fostering a collaborative environment for continued improvement. Organizing video calls or virtual meetings among the winners enables direct interaction among the successful teams and allows organizers to learn more about the methodologies they employed and gain insights into the nuances of different approaches.
Presentations, whether in the form of webinars or conferences, offer winners the chance to showcase their work to a broader audience, encouraging knowledge dissemination and creating opportunities for valuable feedback. Writing publications documenting the winning solutions contributes to the wider body of knowledge in the field and serves as a comprehensive resource for future participants. This iterative feedback loop ensures that the competition remains a catalyst for ongoing advancements, drawing input from a diverse range of perspectives, and promoting a culture of continuous learning and refinement in the realm of data science.
Section 4: Preparing for cost considerations of data and competitions.
Two of the biggest factors influencing cost is whether or not the data is prepared/collected 1) externally or internally and 2) whether or not the dataset requires human annotation. Human annotation is necessary when the dataset contains written work across any discipline, as this establishes a baseline score. To mitigate potential bias in these scores, it is important that rubrics and other evaluation tools are evaluated for bias before they are used.
Data Preparation Costs
Regarding dataset preparation, an externally prepared dataset often might come from an online educational platform. This platform captures and records the data, but not necessarily in the format required for a competition. A data analyst would need to clean and reformat the data in order for it to be usable. This is often required for internally prepared datasets that are not created for data modeling purposes
With internally prepared datasets, the tasks can range dramatically. For example, if the dataset is a body of student essays, the essays would need to be sourced and collected; cleaned and uploaded into a dataframe; scrubbed for Personal Identifiable Information (PII); annotated (after a rubric has been designed first); and then reformatted a final time.
Datasets can range widely in price from $50,000 to more than $1,000,000 or more depending on details. On average, cost is often around $250,000 to $550,000 including costs to obtain the data and conduct basic annotation, if needed. The range depends on:
- Cost to obtain the dataset, if at all.
- Complexity of annotation, if required
- Time to clean and prepare the dataset
In an effort to provide transparency, here’s an example of the needs and costs associated with Feedback Prize – Predicting Effective Arguments Competition. The Feedback Prize competitions featured 26,000 argumentative essays written by students in grades 6-12. These essays were sourced from various states, de-identified, and annotated. The goal for the competition was to generate algorithms to better inform equitable automated writing evaluation systems. However, to do this the essays needed annotation for scores and needed removal of PII from student essays. The total costs associated to score and remove PII in 26,000 essays were $550,000. The annotation aspect of that cost $250,000. A smaller subsect of this dataset used in a secondary competition consisting of 6,500 essays cost $40,000 to annotate.
There are common additional costs associated with datasets. The following chart outlines what these tasks are, their purpose, and what they can cost. The quantity of data and the timeline for completion influence the cost of each task.
Task | Purpose | Cost Range |
Annotation and de-identification | When working with a body of written work, annotations are needed to establish fair, baseline scores. Developing a rubric is the first step in the annotation process.
Nearly all education-related datasets will contain some PII. Datasets must be de-identified in order to maintain ethics. If the PII is associated with written work, it is often de-identified during the annotation process. Vanderbilt University is working to create an algorithm that can remove PII from data at no cost. | $100,000 – $500,000 |
Bias reviews | All datasets are inherently biased to some degree. If this bias remains in the dataset unmitigated, it will lead to the production of biased algorithms that result from the data science competition. In order to combat this, thorough bias reviews should be conducted. These reviews will help to establish mitigation plans, or in some cases, determine the dataset ineligible for competition. | $50,000 – $100,000 |
Competition Costs
Data science competitions can either be run privately or through an established hosting platform, such as Kaggle. With Kaggle, the most well-known hosting platform, the fees depend on the complexity of the competition task, as this will impact the complexity of the metric that Kaggle has to develop. Competitions with standard metrics typically cost $50,000, and competitions with customized metrics can cost up to $60,000. In addition to the cost of running the competition, there is also the cost of the prize purse. Again, the amount will depend on the complexity of the competition – complex competitions with a small prize purse will not attract competitors – but typically ranges from $50,000 to $150,000.
What Is The Future of Data Competitions?
With AI pushing the advancement of technology ever faster, data science competitions could see great changes in the next few years. One consideration is the use of synthetic data panels in competitions. Personally identifiable information (PII) in datasets, for example, often prevents their use in competitions due to privacy concerns. However, if an entirely synthetic dataset could be made that mimics real data but without any private information, it could solve these issues. Currently, there are challenges with this due to the ability to reverse engineer synthesized data but this may change in the coming years.
Technological advancements could also remove or reduce barriers to coding and annotation. For instance, for data sets that contain PII, someone needs to manually hand code and review those datasets for any privacy concerns. This hand-coding and annotation process is cumbersome, expensive, and time-consuming. If an algorithm could accurately remove PII, it could open up many datasets which are currently unavailable.
Data scraping has also become eminently more feasible in recent years. When it comes to competitions, creating a new dataset is always the better option because the uniqueness of created datasets makes it hard for people to find the data online and use it to overfit the model. Furthermore, data scraping is limited in terms of scope. However, data scraping does relate strongly to data augmentation techniques which will likely drastically improve over the next few years.
When it comes to competitions, creating a new dataset is always the better option because the uniqueness of created datasets makes it hard for people to find the data online and use it to overfit the model.
Data scraping has also become eminently more feasible in recent years. When it comes to competitions, creating a new dataset is always the better option because the uniqueness of created datasets makes it hard for people to find the data online and use it to overfit the model. Furthermore, data scraping is limited in terms of scope. However, data scraping does relate strongly to data augmentation techniques which will likely drastically improve over the next few years.
Previously, competitors were limited to the data that was given or data that was related enough to be scraped from the Internet. However, with data augmentation techniques, competitors can source brand-new material using generative AI. Already in most competitions, competitors are generating new text-based datasets but this could lead to generating all sorts of augmentation to the competition datasets as the AI technology grows.
Furthermore, generative AI may be used in the future to account for and be sensitive to differences in a certain space with a specific population. By enriching the presence of certain demographics, datasets that previously would have been unbalanced could potentially become more diverse and equitable.
Conclusion
Data science competitions play a vital role in harnessing collective intelligence to develop AI solutions that cater to educational needs. This research paper has delineated the best practices essential for effective competition management, with a specific emphasis on dataset selection and competition design. The outlined criteria for dataset selection encompass various factors such as impact, uniqueness, method of collection, size, demographic information, and data-sharing terms. Additionally, key considerations including the hosting of public vs. private challenges, engagement of participants from diverse backgrounds, implementation of robust submission and evaluation procedures, fostering community interaction, metric design, and incentivization strategies were discussed. Finally, considerations of cost in data preparation and competitions were outlined to provide a baseline understanding of financial needs.
Furthermore, the paper has identified forthcoming challenges and trends in the realm of data science competitions, including the utilization of synthetic data, manual coding and annotation, dataset creation methods, and innovative data augmentation techniques. It is evident that meticulous planning, coupled with an exploration of emerging methodologies, is essential to foster innovation and ensure the delivery of high-quality solutions in data science competitions. This ongoing exploration and adaptation to evolving challenges will continue to shape the landscape of future competitions, driving advancements in the field of AI and education.
Acknowledgments
The Learning Agency and Vanderbilt University thank the many contributors to this report. Over a dozen individuals with experience designing or competing in data science competitions joined a virtual meeting to discuss the common elements of the most successful components. Individuals included: Dr. Katie McCarthy, Dr. Caitlin Mills, Amed Coulibaly, Thanaporn March Patikorn, Dr. John Stamper, Dr. Simon Woodhead, Addison Howard, Rohit Singh, Jamie Alexandre, Dr. John Whitmer, Ralph Abboud, Ryan Holbrook, Walter Reade, and Andreija Miličević.
– This article was written by Jules King, L Burleigh, Perpetual Baffour, Scott Crossley, Meg Benner, and Ulrich Boser