Neuken in flevoland geil schatje
Seks op zijn hondjes teelballen likken
URLs and addresses are not completely covered. The tokenizer counts on clear markers for these, e. Assuming that any sequence including periods is likely to be a URL provesunwise, given that spacing between normal wordsis often irregular.
And actually checking the existence of a proposed URL was computationally infeasible for the amount of text we intended to process. Finally, as the use of capitalization and diacritics is quite haphazard in the tweets, the tokenizer strips all words of diacritics and transforms them to lower case. For those techniques where hyperparameters need to be selected, we used a leave-one-out strategy on the test material.
For each test author, we determined the optimal hyperparameter settings with regard to the classification of all other authors in the same part of the corpus, in effect using these as development material. In this way, we derived a classification score for each author without the system having any direct or indirect access to the actual gender of the author.
We then measured for which percentage of the authors in the corpus this score was in agreement with the actual gender. These percentages are presented below in Section Profiling Strategies In this section, we describe the strategies that we investigated for the gender recognition task. As we approached the task from a machine learning viewpoint, we needed to select text features to be provided as input to the machine learning systems, as well as machine learning systems which are to use this input for classification.
We first describe the features we used Section 4. Then we explain how we used the three selected machine learning systems to classify the authors Section 4. The use of syntax or even higher level features is for now impossible as the language use on Twitter deviates too much from standard Dutch, and we have no tools to provide reliable analyses.
However, even with purely lexical features, 4. Several errors could be traced back to the fact that the account had moved on to another user since We could have used different dividing strategies, but chose balanced folds in order to give a equal chance to all machine learning techniques, also those that have trouble with unbalanced data.
If, in any application, unbalanced collections are expected, the effects of biases, and corrections for them, will have to be investigated. Most of them rely on the tokenization described above. We will illustrate the options we explored with the Hahaha Top Function Words The most frequent function words see kestemont for an overview. We used the most frequent, as measured on our tweet collection, of which the example tweet contains the words ik, dat, heeft, op, een, voor, and het.
Then, we used a set of feature types based on token n-grams, with which we already had previous experience Van Bael and van Halteren For all feature types, we used only those features which were observed with at least 5 authors in our whole collection for skip bigrams 10 authors. Unigrams Single tokens, similar to the top function words, but then using all tokens instead of a subset.
In the example tweet, we find e. Bigrams Two adjacent tokens. In the example tweet, e. Trigrams Three adjacent tokens. Skip bigrams Two tokens in the tweet, but not adjacent, without any restrictions on the gap size.
Finally, we included feature types based on character n-grams following kjell et al. We used the n-grams with n from 1 to 5, again only when the n-gram was observed with at least 5 authors. However, we used two types of character n-grams. The first set is derived from the tokenizer output, and can be viewed as a kind of normalized character n-grams. Normalized 1-gram About features. Normalized 3-gram About 36K features. Normalized 4-gram About K features. Normalized 5-gram About K features.
The second set of character n-grams is derived from the original tweets. This type of character n-gram has the clear advantage of not needing any preprocessing in the form of tokenization.
Original 1-gram About features. Be Original 3-gram About 77K features. Original 4-gram About K features. Original 5-gram About K features.
Again, we decided to explore more than one option, but here we preferred more focus and restricted ourselves to three systems. Our primary choice for classification was the use of Support Vector Machines, viz. We chose Support Vector Regression ν-svr to be exact with an RBF kernel, as it had shown the best results in several research projects e. With these main choices, we performed a grid search for well-performing hyperparameters, with the following investigated values: The second classification system was Linguistic Profiling LP; van Halteren , which was specifically designed for authorship recognition and profiling.
Roughly speaking, it classifies on the basis of noticeable over- and underuse of specific features. Before being used in comparisons, all feature counts were normalized to counts per words, and then transformed to Z-scores with regard to the average and standard deviation within each feature.
Here the grid search investigated: As the input features are numerical, we used IB1 with k equal to 5 so that we can derive a confidence value. The only hyperparameters we varied in the grid search are the metric Numerical and Cosine distance and the weighting no weighting, information gain, gain ratio, chi-square, shared variance, and standard deviation. However, the high dimensionality of our vectors presented us with a problem. For such high numbers of features, it is known that k-nn learning is unlikely to yield useful results Beyer et al.
This meant that, if we still wanted to use k-nn, we would have to reduce the dimensionality of our feature vectors. For each system, we provided the first N principal components for various N.
In effect, this N is a further hyperparameter, which we varied from 1 to the total number of components usually , as there are authors , using a stepsize of 1 from 1 to 10, and then slowly increasing the stepsize to a maximum of 20 when over Rather than using fixed hyperparameters, we let the control shell choose them automatically in a grid search procedure, based on development data.
When running the underlying systems 7. As scaling is not possible when there are columns with constant values, such columns were removed first. For each setting and author, the systems report both a selected class and a floating point score, which can be used as a confidence score.
In order to improve the robustness of the hyperparameter selection, the best three settings were chosen and used for classifying the current author in question. For LP, this is by design. A model, called profile, is constructed for each individual class, and the system determines for each author to which degree they are similar to the class profile.
For SVR, one would expect symmetry, as both classes are modeled simultaneously, and differ merely in the sign of the numeric class identifier. However, we do observe different behaviour when reversing the signs. For this reason, we did all classification with SVR and LP twice, once building a male model and once a female model.
For both models the control shell calculated a final score, starting with the three outputs for the best hyperparameter settings. It normalized these by expressing them as the number of non-model class standard deviations over the threshold, which was set at the class separation value.
The control shell then weighted each score by multiplying it by the class separation value on the development data for the settings in question, and derived the final score by averaging. It then chose the class for which the final score is highest. In this way, we also get two confidence values, viz. Results In this section, we will present the overall results of the gender recognition.
We start with the accuracy of the various features and systems Section 5. Then we will focus on the effect of preprocessing the input vectors with PCA Section 5. After this, we examine the classification of individual authors Section 5. For the measurements with PCA, the number of principal components provided to the classification system is learned from the development data. Below, in Section 5. Starting with the systems, we see that SVR using original vectors consistently outperforms the other two.
For only one feature type, character trigrams, LP with PCA manages to reach a higher accuracy than SVR, but the difference is not statistically significant. For SVR and LP, these are rather varied, but TiMBL s confidence value consists of the proportion of selected class cases among the nearest neighbours, which with k at 5 is practically always 0.
The class separation value is a variant of Cohen s d Cohen Where Cohen assumes the two distributions have the same standard deviation, we use the sum of the two, practically always different, standard deviations. Accuracy Percentages for various Feature Types and Techniques. In fact, for all the tokens n-grams, it would seem that the further one goes away from the unigrams, the worse the accuracy gets.
An explanation for this might be that recognition is mostly on the basis of the content of the tweet, and unigrams represent the content most clearly. Possibly, the other n-grams are just mirroring this quality of the unigrams, with the effectiveness of the mirror depending on how well unigrams are represented in the n-grams. For the character n-grams, our first observation is that the normalized versions are always better than the original versions.
This means that the content of the n-grams is more important than their form. This is in accordance with the hypothesis just suggested for the token n-grams, as normalization too brings the character n-grams closer to token unigrams. The best performing character n-grams normalized 5-grams , will be most closely linked to the token unigrams, with some token bigrams thrown in, as well as a smidgen of the use of morphological processes. However, we cannot conclude that what is wiped away by the normalization, use of diacritics, capitals and spacing, holds no information for the gender recognition.
To test that, we would have to experiment with a new feature types, modeling exactly the difference between the normalized and the original form.
This number was treated as just another hyperparameter to be selected. As a result, the systems accuracy was partly dependent on the quality of the hyperparameter selection mechanism. In this section, we want to investigate how strong this dependency may have been. Recognition accuracy as a function of the number of principal components provided to the systems, using token unigrams.
Figures 1, 2, and 3 show accuracy measurements for the token unigrams, token bigrams, and normalized character 5-grams, for all three systems at various numbers of principal components. For the unigrams, SVR reaches its peak Interestingly, it is SVR that degrades at higher numbers of principal components, while TiMBL, said to need fewer dimensions, manages to hold on to the recognition quality.
LP peaks much earlier However, it does not manage to achieve good results with the principal components that were best for the other two systems. Furthermore, LP appears to suffer some kind of mathematical breakdown for higher numbers of components. Although LP performs worse than it could on fixed numbers of principal components, its more detailed confidence score allows a better hyperparameter selection, on average selecting around 9 principal components, where TiMBL chooses a wide range of numbers, and generally far lower than is optimal.
We expect that the performance with TiMBL can be improved greatly with the development of a better hyperparameter selection mechanism. For the bigrams Figure 2 , we see much the same picture, although there are differences in the details. SVR now already reaches its peak TiMBL peaks a bit later at with And LP just mirrors its behaviour with unigrams. LP keeps its peak at 10, but now even lower than for the token n-grams However, all systems are in principle able to reach the same quality i.
Even with an automatically selected number, LP already profits clearly Recognition accuracy as a function of the number of principal components provided to the systems, using token bigrams. And TiMBL is currently underperforming, but might be a challenger to SVR when provided with a better hyperparameter selection mechanism.
We will focus on the token n-grams and the normalized character 5-grams. As for systems, we will involve all five systems in the discussion. However, our starting point will always be SVR with token unigrams, this being the best performing combination. We will only look at the final scores for each combination, and forgo the extra detail of any underlying separate male and female model scores which we have for SVR and LP; see above.
When we look at his tweets, we see a kind of financial blog, which is an exception in the population we have in our corpus. The exception also leads to more varied classification by the different systems, yielding a wide range of scores.
SVR tends to place him clearly in the male area with all the feature types, with unigrams at the extreme with a score of SVR with PCA on the other hand, is less convinced, and even classifies him as female for unigrams 1. Figure 4 shows that the male population contains some more extreme exponents than the female population. The most obvious male is author , with a resounding Looking at his texts, we indeed see a prototypical young male Twitter user: From this point on in the discussion, we will present female confidence as positive numbers and male as negative.
Recognition accuracy as a function of the number of principal components provided to the systems, using normalized character 5-grams. All systems have no trouble recognizing him as a male, with the lowest scores around 1 for the top function words.
If we look at the rest of the top males Table 2 , we may see more varied topics, but the wide recognizability stays. Unigrams are mostly closely mirrored by the character 5-grams, as could already be suspected from the content of these two feature types. For the other feature types, we see some variation, but most scores are found near the top of the lists. Feature type Unigram 1: Top Function 4: On the female side, everything is less extreme.
The best recognizable female, author , is not as focused as her male counterpart. There is much more variation in the topics, but most of it is clearly girl talk of the type described in Section 5.
In scores, too, we see far more variation. Even the character 5-grams have ranks up to 40 for this top Another interesting group of authors is formed by the misclassified ones. Taking again SVR on unigrams as our starting point, this group contains 11 males and 16 females. We show the 5 most Confidence scores for gender assignment with regard to the female and male profiles built by SVR on the basis of token unigrams.
The dashed line represents the separation threshold, i. The dotted line represents exactly opposite scores for the two genders. Top rankingfemales insvr ontokenunigrams, with ranksand scoresforsvr with various feature types. Top Function 9: With one exception author is recognized as male when using trigrams , all feature types agree on the misclassification. This may support ourhypothesis that allfeature types aredoingmore orlessthe same.
But it might alsomean that the gender just influences all feature types to a similar degree. In addition, the recognition is of course also influenced by our particular selection of authors, as we will see shortly. Apart from the general agreement on the final decision, the feature types vary widely in the scores assigned, but this also allows for both conclusions.
The male which is attributed the most female score is author On re examination, we see a clearly male first name and also profile photo. However, his Twitter network contains mostly female friends. This apparently colours not only the discussion topics, which might be expected, but also the general language use. The unigrams do not judge him to write in an extremely female way, but all other feature types do. When looking at his tweets, we This has also been remarked by Bamman et al.
There is an extreme number of misspellings even for Twitter , which may possibly confuse the systems models. The most extreme misclassification is reserved for a female, author This turns out to be Judith Sargentini, a member of the European Parliament, who tweets under the 14 Although clearly female, she is judged as rather strongly male In this case, it would seem that the systems are thrown off by the political texts.
If we search for the word parlement parliament in our corpus, which is used 40 times by Sargentini, we find two more female authors each using it once , as compared to 21 male authors with up to 9 uses. Apparently, in our sample, politics is a male thing. We did a quick spot check with author , a girl who plays soccer and is therefore also misclassified often; here, the PCA version agrees with and misclassified even stronger than the original unigrams versus.
In later research, when we will try to identify the various user types on Twitter, we will certainly have another look at this phenomenon. Are they mostly targeting the content of the tweets, i. In this section, we will attempt to get closer to the answer to this question. Again, we take the token unigrams as a starting point. However, looking at SVR is not an option here. Because of the way in which SVR does its classification, hyperplane separation in a transformed version of the vector space, it is impossible to determine which features do the most work.
Instead, we will just look at the distribution of the various features over the female and male texts. Figure 5 shows all token unigrams. The ones used more by women are plotted in green, those used more by men in red. The position in the plot represents the relative number of men and women who used the token at least once somewhere in their tweets.
However, for classification, it is more important how often the token is used by each gender. We represent this quality by the class separation value that we described in Section 4. As the separation value and the percentages are generally correlated, the bigger tokens are found further away from the diagonal, while the area close to the diagonal contains mostly unimportant and therefore unreadable tokens.
On the female side, we see a representation of the world of the prototypical young female Twitter user. And also some more negative emotions, such as haat hate and pijn pain.
Next we see personal care, with nagels nails , nagellak nail polish , makeup makeup , mascara mascara , and krullen curls. Clearly, shopping is also important, as is watching soaps on television gtst. The age is reconfirmed by the endearingly high presence of mama and papa. As for style, the only real factor is echt really.
The word haar may be the pronoun her, but just as well the noun hair, and in both cases it is actually more related to the Identity disclosed with permission.
And by TweetGenie as well. An alternative hypothesis was that Sargentini does not write her own tweets, but assigns this task to a male press spokesperson. However, we received confirmation that she writes almost all her tweets herself Sargentini, personal communication. Percentages of use of tokens by female and male authors. The font size of the words indicates to which degree they differentiate between the gender when also taking into account the relative frequencies of occurrence.
Spelling Bestuderen Inleiding Op B1 niveau gaan we wat meer aandacht schenken aan spelling. Je mag niet meer zoveel fouten maken als op A1 en A2 niveau. We bespreken een aantal belangrijke. Understanding and being understood begins with speaking Dutch Begrijpen en begrepen worden begint met het spreken van de Nederlandse taal The Dutch language links us all Wat leest u in deze folder?
Als je een onderdeel. List of variables with corresponding questionnaire items in English used in chapter 2 Task clarity 1. I understand exactly what the task is 2. I understand exactly what is required of. Please use the latest firmware for the router. The firmware is available on http: Wouldn t it be great to create your own funny character that will give. Dus ik durfde het niet aan om op de fiets naar. Quick scan method to evaluate your applied educational game light validation 1.
Assessing writing through objectively scored tests: My family Main language Dit is de basiswoordenschat. Deze woorden moeten de leerlingen zowel passief als actief kennen. Aim of this presentation Give inside information about our commercial comparison website and our role in the Dutch and Spanish energy market Energieleveranciers.
Invloed van het aantal kinderen op de seksdrive en relatievoorkeur M. Eshuis Oktober Faculteit Psychologie en Onderwijswetenschappen. Firewall van de Speedtouch wl volledig uitschakelen? De firewall van de Speedtouch wl kan niet volledig uitgeschakeld worden via de Web interface: De firewall blijft namelijk op stateful staan. And especially truths that at first sight are concrete, tangible and proven.
Om een realistisch beeld te krijgen van uw niveau,vragen we u niet langer dan één uur te besteden aan de toets. De toets bestaat uit twee. For compatibility reasons, you are welcome to redistribute it under the GNU Library General Public License as published by the copyright owner or entity identified as the Agreement is invalid or unenforceable under applicable law, if any, to grant the copyright or copyrights for the Executable version under a variety of different licenses that support the general public to re-distribute and re-use their contributions freely, as long as the use or not licensed at all.
This License provides that: You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other work that is exclusively available under this License Agreement, BeOpen hereby grants Recipient a non-exclusive, worldwide, royalty-free patent license is required to grant broad permissions to the notice in Exhibit A.
Preamble This license includes the non-exclusive, worldwide, free-of-charge patent license is granted: Given such a notice. Let op dan leggen we het uit. Bezoekers van websites krijgen te maken met cookies. Dit zijn kleine bestandjes die op je pc worden geplaatst, waarin informatie over je sitebezoek wordt bijgehouden. Ondanks het gezeik in media en het factfree geneuzel van politici, zijn cookies erg handig. Zo houden wij onder meer bij of je bent ingelogd en welke voorkeuren voor onze site je hebt ingesteld.
Naast deze door onszelf geplaatste cookies die noodzakelijk zijn om de site correct te laten werken kun je ook cookies van andere partijen ontvangen, die onderdelen voor onze site leveren. Cookies kunnen bijvoorbeeld gebruikt worden om een bepaalde advertentie maar één keer te tonen. Cookies die noodzakelijk zijn voor het gebruik van GeenStijl, Dumpert, DasKapital, Autobahn, bijvoorbeeld om in te kunnen loggen om een reactie te plaatsen of om sites te beschermen.
Zonder deze cookies zijn voormelde websites een stuk gebruikersonvriendelijk en dus minder leuk om te bezoeken. Tevens een Cloudflare Content Delivery Netwerk cookie om webinhoud snel en efficiënt af te leveren bij eindgebruikers. Dat zeiden we dus al. Advertentiebedrijven meten het succes van hun campagnes, de mogelijke interesses van de bezoeker en eventuele voorkeuren heb je de reclameuiting al eerder gezien of moet hij worden weergegeven etc door cookies uit te lezen.
Heeft een advertentiebedrijf banners op meerdere websites dan kunnen de gegevens van deze websites worden gecombineerd om een beter profiel op te stellen. Zo kunnen adverteerders hun cookies op meerdere sites plaatsen en zo een gedetailleerd beeld krijgen van de interesses van de gebruiker. Hiermee kunnen gerichter en relevantere advertenties worden weergegeven.
Zo kun je na het bezoeken van een webwinkel op andere sites banners krijgen met juist de door jezelf bekeken producten of soortgelijke producten.
De websitehouder kan die cookies overigens niet inzien. Je hoeft niet bang te zijn voor deze bedrijven. Ze zijn best lief. En leren is leuk. Om onze bezoekersstatistieken bij te houden maken we gebruik van Google Analytics. Dit systeem houdt bij welke pagina's onze bezoekers bekijken, waar zij vandaan komen en op klikken, welke browser en schermresolutie ze gebruiken en nog veel meer. Deze informatie gebruiken we om een beter beeld te krijgen van onze bezoekers en om onze site hierop te optimaliseren.
Zo worden onze websites nog veel superduper leuker om aan te klikken dan voorheen. Google, die deze dienst levert, gebruikt de informatie om een relevant, anoniem advertentieprofiel op te bouwen waarmee men gerichter advertenties kan aanbieden. Naast bovenstaande zijn er meer onderdelen die een cookie kunnen opleveren.
Veelal worden deze gebruikt door de content-partners om te analyseren op welke sites hun gebruikers actief zijn en hoe hun diensten presteren. Denk hierbij aan filmpjes van bijvoorbeeld YouTube, foto's van diensten als Imgur, Tumblr of picasa, en 'like' knoppen van sociale mediasites als Twitter en Facebook.
Deze websites schijnen best wel een beetje populair te zijn dus we dachten: Wil je nou echt nog meer weten? Ja, door hier te klikken ga ik akkoord met de cookies, scripts en webbeacons die via NewsMedia Websites GeenStijl, Dumpert, Das Kapital en Autobahn geplaatst kunnen worden. Ik begrijp dat deze cookies, scripts en webbeacons door NewsMedia Websites en door derden geplaatst kunnen worden voor functionele en analytische doeleinden, voor social media, om mij advertenties te tonen, mijn surfgedrag te volgen of gewoon omdat men daar zin in heeft.
Ik ga er ook mee akkoord dat met behulp van deze cookies, scripts en webbeacons persoonsgegevens over mij kunnen worden verwerkt voor deze doeleinden.